title: 使用 kind 创建 Kubernetes 测试集群并安装 Prometheus 与 Alertmanager(含自动抑制验证) date: 2025-11-26

目标

  • 使用 kind 在本地创建测试集群。
  • 通过 Helm 安装 kube-prometheus-stack(包含 Prometheus、Alertmanager、Grafana)。
  • 应用控制平面告警与 Alertmanager 抑制规则,并完成验证。

前置条件

  • 已安装并运行 Docker。
  • MacOS 环境,已安装 kubectlhelmkind(若未安装,见下方命令)。
1
brew install kubectl helm kind

创建 kind 集群

1
2
3
kind create cluster --name k8s-dev
kubectl cluster-info
kubectl get nodes -o wide

安装 kube-prometheus-stack

1
2
3
4
5
6
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring || true
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f deploy/monitoring/kube-prometheus-stack-values.yaml

验证安装:

1
2
kubectl get pods -n monitoring
kubectl get svc -n monitoring | grep -E "prometheus|alertmanager|grafana"

应用控制平面告警规则

1
2
kubectl apply -f deploy/monitoring/prometheusrule-control-plane.yaml
kubectl -n monitoring get prometheusrule kubernetes-control-plane-rules -o yaml | head -n 20

在 Prometheus UI 中检查规则是否加载:

1
2
kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-prometheus 9090:9090 &
# 浏览器访问 http://localhost:9090/rules

验证 Alertmanager 抑制规则

端口转发到 Alertmanager:

1
kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-alertmanager 9093:9093 &

注入测试告警:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
curl -X POST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[
  {
    "labels": { "alertname": "APIServerDown", "severity": "critical", "job": "apiserver", "instance": "cp-1", "component": "apiserver" },
    "annotations": { "summary": "API Server 不可用" }
  },
  {
    "labels": { "alertname": "APIServerLatencyP95High", "severity": "warning", "job": "apiserver", "instance": "cp-1", "component": "apiserver" },
    "annotations": { "summary": "API Server P95 延迟高" }
  }
]'

预期:

  • APIServerLatencyP95High 被标记为抑制,仅发送 APIServerDown,在 Alertmanager UI 可看到 Inhibited By

可选:验证“高严重度抑制低严重度”通用规则:

1
2
3
4
curl -X POST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[
  { "labels": { "alertname": "HighErrorRate", "severity": "critical", "job": "apiserver", "instance": "cp-1" } },
  { "labels": { "alertname": "HighErrorRate", "severity": "warning", "job": "apiserver", "instance": "cp-1" } }
]'

常见问题

  • 若抑制未生效,检查 equal 中标签是否能唯一关联同一对象,如 job+instance
  • 指标名称可能因版本不同而有差异,必要时调整 PrometheusRulejob 正则。
  • 使用 amtool check-config 在容器内校验 Alertmanager 配置。

清理

1
2
helm -n monitoring uninstall monitoring || true
kind delete cluster --name k8s-dev

Alertmanager 抑制参数详解

匹配器与关联

  • source_matchers:源告警匹配条件(触发抑制)。
  • target_matchers:目标告警匹配条件(被抑制)。
  • equal:源与目标必须相同的标签键列表,用于把两类告警关联到同一对象。

匹配语法

  • 支持 key=valuekey!=valuekey=~regexkey!~regex;多个匹配器为 AND 关系。
  • 抑制无时长参数:源告警处于 firing 且命中条件时持续抑制;源告警恢复后立刻解除。

完整 YAML 配置(可复制使用)

kube-prometheus-stack values(Alertmanager 抑制)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
alertmanager:
  enabled: true
  config:
    route:
      receiver: default
    receivers:
    - name: "null"
    - name: default
    inhibit_rules:
    - source_matchers:
      - severity=critical
      target_matchers:
      - severity=warning
      equal:
      - alertname
      - job
      - instance
    - source_matchers:
      - alertname=APIServerDown
      target_matchers:
      - component=apiserver
      equal:
      - job
      - instance
    - source_matchers:
      - alertname=EtcdInsufficientMembers
      target_matchers:
      - component=etcd
      equal:
      - component
    - source_matchers:
      - alertname=EtcdNoLeader
      target_matchers:
      - component=etcd
      equal:
      - component
    - source_matchers:
      - alertname=SchedulerDown
      target_matchers:
      - component=scheduler
      equal:
      - job
      - instance
    - source_matchers:
      - alertname=ControllerManagerDown
      target_matchers:
      - component=controller-manager
      equal:
      - job
      - instance

grafana:
  enabled: true
  adminPassword: admin

prometheus:
  prometheusSpec:
    retention: 24h

PrometheusRule(控制平面告警)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-control-plane-rules
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-control-plane-availability
    rules:
    - alert: APIServerDown
      expr: up{job=~"apiserver|kube-apiserver"} == 0
      for: 5m
      labels:
        severity: critical
        component: apiserver
      annotations:
        summary: API Server 不可用
        description: 实例 {{ $labels.instance }} 的 API Server 无法被 Prometheus 拉取

    - alert: SchedulerDown
      expr: up{job=~"kube-scheduler|scheduler"} == 0
      for: 5m
      labels:
        severity: critical
        component: scheduler
      annotations:
        summary: Scheduler 不可用
        description: 实例 {{ $labels.instance }} 的 Scheduler 无法被 Prometheus 拉取

    - alert: ControllerManagerDown
      expr: up{job=~"kube-controller-manager|controller-manager"} == 0
      for: 5m
      labels:
        severity: critical
        component: controller-manager
      annotations:
        summary: Controller Manager 不可用
        description: 实例 {{ $labels.instance }} 的 Controller Manager 无法被 Prometheus 拉取

    - alert: EtcdMemberDown
      expr: up{job=~"etcd"} == 0
      for: 2m
      labels:
        severity: critical
        component: etcd
      annotations:
        summary: etcd 成员不可用
        description: etcd 成员 {{ $labels.instance }} 无法被 Prometheus 拉取

    - alert: EtcdNoLeader
      expr: etcd_server_has_leader == 0
      for: 1m
      labels:
        severity: critical
        component: etcd
      annotations:
        summary: etcd 无领导者
        description: etcd 集群缺少领导者

    - alert: EtcdInsufficientMembers
      expr: count(up{job=~"etcd"} == 1) < 3
      for: 5m
      labels:
        severity: critical
        component: etcd
      annotations:
        summary: etcd 成员数不足
        description: etcd 可用成员数低于 3

  - name: apiserver-health
    rules:
    - alert: APIServerErrorRateHigh
      expr: (sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m]))) > 0.02
      for: 10m
      labels:
        severity: warning
        component: apiserver
      annotations:
        summary: API Server 5xx 比例升高
        description: 5xx 比例 > 2%

    - alert: APIServerErrorRateCritical
      expr: (sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m]))) > 0.10
      for: 5m
      labels:
        severity: critical
        component: apiserver
      annotations:
        summary: API Server 5xx 比例严重
        description: 5xx 比例 > 10%

    - alert: APIServerLatencyP95High
      expr: histogram_quantile(0.95, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 1
      for: 10m
      labels:
        severity: warning
        component: apiserver
      annotations:
        summary: API Server P95 延迟高
        description: P95 请求延迟 > 1s

    - alert: APIServerLatencyP99Critical
      expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 2
      for: 5m
      labels:
        severity: critical
        component: apiserver
      annotations:
        summary: API Server P99 延迟严重
        description: P99 请求延迟 > 2s

  - name: etcd-stability
    rules:
    - alert: EtcdLeaderChangesHigh
      expr: increase(etcd_server_leader_changes_seen_total[15m]) > 5
      for: 15m
      labels:
        severity: warning
        component: etcd
      annotations:
        summary: etcd 领导变更频繁
        description: 15 分钟领导变更次数 > 5

    - alert: EtcdProposalsFailed
      expr: rate(etcd_server_proposals_failed_total[5m]) > 0
      for: 10m
      labels:
        severity: warning
        component: etcd
      annotations:
        summary: etcd 提案失败
        description: etcd 提案失败率大于 0

  - name: scheduler-controller-health
    rules:
    - alert: SchedulerPendingPodsHigh
      expr: sum(kube_pod_status_phase{phase="Pending"}) > 50
      for: 15m
      labels:
        severity: warning
        component: scheduler
      annotations:
        summary: Pending Pods 数量偏高
        description: 集群 Pending Pods 连续偏高,可能与调度能力相关

    - alert: ControllerManagerWorkQueueLatencyHigh
      expr: sum(workqueue_queue_duration_seconds_sum{job=~"kube-controller-manager|controller-manager"}) / sum(workqueue_queue_duration_seconds_count{job=~"kube-controller-manager|controller-manager"}) > 5
      for: 10m
      labels:
        severity: warning
        component: controller-manager
      annotations:
        summary: Controller Manager 队列平均排队时长高
        description: 控制器处理队列排队时间过长

参考

0%