Monitor
Dashboards and alerting rules for etcd module
Dashboards
The ETCD module provides a monitoring dashboard: Etcd Overview.
ETCD Overview: Overview of the ETCD cluster
This dashboard provides key information about the ETCD status, with the most notable being ETCD Aliveness, which displays the overall service status of the ETCD cluster.
Red bands indicate periods when instances are unavailable, while the blue-gray bands below show when the entire cluster is unavailable.
Alert Rules
Pigsty provides the following two alert rules for the INFRA module:
Alert Rule | Description | Severity |
---|---|---|
EtcdServerDown | Etcd node down, critical alert | Critical |
EtcdNoLeader | Etcd cluster has no leader, critical alert | Critical |
EtcdQuotaFull | Etcd quota usage exceeds 90%, warning | Warning |
EtcdNetworkPeerRTSlow | Etcd network latency is slow, notice | Notice |
EtcdWalFsyncSlow | Etcd disk fsync is slow, notice | Notice |
You can modify or add new etcd alert rules in files/prometheus/rules/etcd.yml
.
#==============================================================#
# Aliveness #
#==============================================================#
# etcd server instance down
- alert: EtcdServerDown
expr: etcd_up < 1
for: 1m
labels: { level: 0, severity: CRIT, category: etcd }
annotations:
summary: "CRIT EtcdServerDown {{ $labels.ins }}@{{ $labels.instance }}"
description: |
etcd_up[ins={{ $labels.ins }}, instance={{ $labels.instance }}] = {{ $value }} < 1
http://g.pigsty/d/etcd-overview
#==============================================================#
# Error #
#==============================================================#
# Etcd no Leader triggers a P0 alert immediately
# if dcs_failsafe mode is not enabled, this may lead to global outage
- alert: EtcdNoLeader
expr: min(etcd_server_has_leader) by (cls) < 1
for: 15s
labels: { level: 0, severity: CRIT, category: etcd }
annotations:
summary: "CRIT EtcdNoLeader: {{ $labels.cls }} {{ $value }}"
description: |
etcd_server_has_leader[cls={{ $labels.cls }}] = {{ $value }} < 1
http://g.pigsty/d/etcd-overview?from=now-5m&to=now&var-cls={{$labels.cls}}
#==============================================================#
# Saturation #
#==============================================================#
- alert: EtcdQuotaFull
expr: etcd:cls:quota_usage > 0.90
for: 1m
labels: { level: 1, severity: WARN, category: etcd }
annotations:
summary: "WARN EtcdQuotaFull: {{ $labels.cls }}"
description: |
etcd:cls:quota_usage[cls={{ $labels.cls }}] = {{ $value | printf "%.3f" }} > 90%
#==============================================================#
# Latency #
#==============================================================#
# etcd network peer rt p95 > 200ms for 1m
- alert: EtcdNetworkPeerRTSlow
expr: etcd:ins:network_peer_rt_p95_5m > 0.200
for: 1m
labels: { level: 2, severity: INFO, category: etcd }
annotations:
summary: "INFO EtcdNetworkPeerRTSlow: {{ $labels.cls }} {{ $labels.ins }}"
description: |
etcd:ins:network_peer_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 200ms
http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}
# Etcd wal fsync rt p95 > 50ms
- alert: EtcdWalFsyncSlow
expr: etcd:ins:wal_fsync_rt_p95_5m > 0.050
for: 1m
labels: { level: 2, severity: INFO, category: etcd }
annotations:
summary: "INFO EtcdWalFsyncSlow: {{ $labels.cls }} {{ $labels.ins }}"
description: |
etcd:ins:wal_fsync_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 50ms
http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}