title: 使用 kind 创建 Kubernetes 测试集群并安装 Prometheus 与 Alertmanager(含自动抑制验证)
date: 2025-11-26
目标
- 使用 kind 在本地创建测试集群。
- 通过 Helm 安装 kube-prometheus-stack(包含 Prometheus、Alertmanager、Grafana)。
- 应用控制平面告警与 Alertmanager 抑制规则,并完成验证。
前置条件
- 已安装并运行 Docker。
- MacOS 环境,已安装
kubectl、helm、kind(若未安装,见下方命令)。
1
| brew install kubectl helm kind
|
创建 kind 集群
1
2
3
| kind create cluster --name k8s-dev
kubectl cluster-info
kubectl get nodes -o wide
|
安装 kube-prometheus-stack
1
2
3
4
5
6
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring || true
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring \
-f deploy/monitoring/kube-prometheus-stack-values.yaml
|
验证安装:
1
2
| kubectl get pods -n monitoring
kubectl get svc -n monitoring | grep -E "prometheus|alertmanager|grafana"
|
应用控制平面告警规则
1
2
| kubectl apply -f deploy/monitoring/prometheusrule-control-plane.yaml
kubectl -n monitoring get prometheusrule kubernetes-control-plane-rules -o yaml | head -n 20
|
在 Prometheus UI 中检查规则是否加载:
1
2
| kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-prometheus 9090:9090 &
# 浏览器访问 http://localhost:9090/rules
|
验证 Alertmanager 抑制规则
端口转发到 Alertmanager:
1
| kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-alertmanager 9093:9093 &
|
注入测试告警:
1
2
3
4
5
6
7
8
9
10
| curl -X POST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[
{
"labels": { "alertname": "APIServerDown", "severity": "critical", "job": "apiserver", "instance": "cp-1", "component": "apiserver" },
"annotations": { "summary": "API Server 不可用" }
},
{
"labels": { "alertname": "APIServerLatencyP95High", "severity": "warning", "job": "apiserver", "instance": "cp-1", "component": "apiserver" },
"annotations": { "summary": "API Server P95 延迟高" }
}
]'
|
预期:
APIServerLatencyP95High 被标记为抑制,仅发送 APIServerDown,在 Alertmanager UI 可看到 Inhibited By。
可选:验证“高严重度抑制低严重度”通用规则:
1
2
3
4
| curl -X POST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[
{ "labels": { "alertname": "HighErrorRate", "severity": "critical", "job": "apiserver", "instance": "cp-1" } },
{ "labels": { "alertname": "HighErrorRate", "severity": "warning", "job": "apiserver", "instance": "cp-1" } }
]'
|
常见问题
- 若抑制未生效,检查
equal 中标签是否能唯一关联同一对象,如 job+instance。 - 指标名称可能因版本不同而有差异,必要时调整
PrometheusRule 中 job 正则。 - 使用
amtool check-config 在容器内校验 Alertmanager 配置。
清理
1
2
| helm -n monitoring uninstall monitoring || true
kind delete cluster --name k8s-dev
|
Alertmanager 抑制参数详解
匹配器与关联
source_matchers:源告警匹配条件(触发抑制)。target_matchers:目标告警匹配条件(被抑制)。equal:源与目标必须相同的标签键列表,用于把两类告警关联到同一对象。
匹配语法
- 支持
key=value、key!=value、key=~regex、key!~regex;多个匹配器为 AND 关系。 - 抑制无时长参数:源告警处于 firing 且命中条件时持续抑制;源告警恢复后立刻解除。
完整 YAML 配置(可复制使用)
kube-prometheus-stack values(Alertmanager 抑制)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| alertmanager:
enabled: true
config:
route:
receiver: default
receivers:
- name: "null"
- name: default
inhibit_rules:
- source_matchers:
- severity=critical
target_matchers:
- severity=warning
equal:
- alertname
- job
- instance
- source_matchers:
- alertname=APIServerDown
target_matchers:
- component=apiserver
equal:
- job
- instance
- source_matchers:
- alertname=EtcdInsufficientMembers
target_matchers:
- component=etcd
equal:
- component
- source_matchers:
- alertname=EtcdNoLeader
target_matchers:
- component=etcd
equal:
- component
- source_matchers:
- alertname=SchedulerDown
target_matchers:
- component=scheduler
equal:
- job
- instance
- source_matchers:
- alertname=ControllerManagerDown
target_matchers:
- component=controller-manager
equal:
- job
- instance
grafana:
enabled: true
adminPassword: admin
prometheus:
prometheusSpec:
retention: 24h
|
PrometheusRule(控制平面告警)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
| apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-control-plane-rules
namespace: monitoring
spec:
groups:
- name: kubernetes-control-plane-availability
rules:
- alert: APIServerDown
expr: up{job=~"apiserver|kube-apiserver"} == 0
for: 5m
labels:
severity: critical
component: apiserver
annotations:
summary: API Server 不可用
description: 实例 {{ $labels.instance }} 的 API Server 无法被 Prometheus 拉取
- alert: SchedulerDown
expr: up{job=~"kube-scheduler|scheduler"} == 0
for: 5m
labels:
severity: critical
component: scheduler
annotations:
summary: Scheduler 不可用
description: 实例 {{ $labels.instance }} 的 Scheduler 无法被 Prometheus 拉取
- alert: ControllerManagerDown
expr: up{job=~"kube-controller-manager|controller-manager"} == 0
for: 5m
labels:
severity: critical
component: controller-manager
annotations:
summary: Controller Manager 不可用
description: 实例 {{ $labels.instance }} 的 Controller Manager 无法被 Prometheus 拉取
- alert: EtcdMemberDown
expr: up{job=~"etcd"} == 0
for: 2m
labels:
severity: critical
component: etcd
annotations:
summary: etcd 成员不可用
description: etcd 成员 {{ $labels.instance }} 无法被 Prometheus 拉取
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
component: etcd
annotations:
summary: etcd 无领导者
description: etcd 集群缺少领导者
- alert: EtcdInsufficientMembers
expr: count(up{job=~"etcd"} == 1) < 3
for: 5m
labels:
severity: critical
component: etcd
annotations:
summary: etcd 成员数不足
description: etcd 可用成员数低于 3
- name: apiserver-health
rules:
- alert: APIServerErrorRateHigh
expr: (sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m]))) > 0.02
for: 10m
labels:
severity: warning
component: apiserver
annotations:
summary: API Server 5xx 比例升高
description: 5xx 比例 > 2%
- alert: APIServerErrorRateCritical
expr: (sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m]))) > 0.10
for: 5m
labels:
severity: critical
component: apiserver
annotations:
summary: API Server 5xx 比例严重
description: 5xx 比例 > 10%
- alert: APIServerLatencyP95High
expr: histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 1
for: 10m
labels:
severity: warning
component: apiserver
annotations:
summary: API Server P95 延迟高
description: P95 请求延迟 > 1s
- alert: APIServerLatencyP99Critical
expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 2
for: 5m
labels:
severity: critical
component: apiserver
annotations:
summary: API Server P99 延迟严重
description: P99 请求延迟 > 2s
- name: etcd-stability
rules:
- alert: EtcdLeaderChangesHigh
expr: increase(etcd_server_leader_changes_seen_total[15m]) > 5
for: 15m
labels:
severity: warning
component: etcd
annotations:
summary: etcd 领导变更频繁
description: 15 分钟领导变更次数 > 5
- alert: EtcdProposalsFailed
expr: rate(etcd_server_proposals_failed_total[5m]) > 0
for: 10m
labels:
severity: warning
component: etcd
annotations:
summary: etcd 提案失败
description: etcd 提案失败率大于 0
- name: scheduler-controller-health
rules:
- alert: SchedulerPendingPodsHigh
expr: sum(kube_pod_status_phase{phase="Pending"}) > 50
for: 15m
labels:
severity: warning
component: scheduler
annotations:
summary: Pending Pods 数量偏高
description: 集群 Pending Pods 连续偏高,可能与调度能力相关
- alert: ControllerManagerWorkQueueLatencyHigh
expr: sum(workqueue_queue_duration_seconds_sum{job=~"kube-controller-manager|controller-manager"}) / sum(workqueue_queue_duration_seconds_count{job=~"kube-controller-manager|controller-manager"}) > 5
for: 10m
labels:
severity: warning
component: controller-manager
annotations:
summary: Controller Manager 队列平均排队时长高
description: 控制器处理队列排队时间过长
|
参考