经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 数据库/运维 » Kubernetes » 查看文章
IoT 边缘集群基于 Kubernetes Events 的告警通知实现(二):进一步配置
来源:cnblogs  作者:东风微鸣  时间:2023/2/17 10:00:06  对本文有异议

上一篇文章

IoT 边缘集群基于 Kubernetes Events 的告警通知实现

目标

  1. 告警恢复通知 - 经过评估无法实现
    1. 原因: 告警和恢复是单独完全不相关的事件, 告警是 Warning 级别, 恢复是 Normal 级别, 要开启恢复, 就会导致所有 Normal Events 都会被发送, 这个数量是很恐怖的; 而且, 除非特别有经验和耐心, 否则无法看出哪条 Normal 对应的是 告警的恢复.
  2. 未恢复进行持续告警 - 默认就带的能力, 无需额外配置.
  3. 告警内容显示资源名称,比如节点和pod名称
  4. 可以设置屏蔽特定的节点和工作负载并可以动态调整
    1. 比如,集群001中的节点worker-1做计划性维护,期间停止监控,维护完成后重新开始监控。

配置

告警内容显示资源名称

典型的几类 events:

  1. apiVersion: v1
  2. count: 101557
  3. eventTime: null
  4. firstTimestamp: "2022-04-08T03:50:47Z"
  5. involvedObject:
  6. apiVersion: v1
  7. fieldPath: spec.containers{prometheus}
  8. kind: Pod
  9. name: prometheus-rancher-monitoring-prometheus-0
  10. namespace: cattle-monitoring-system
  11. kind: Event
  12. lastTimestamp: "2022-04-14T11:39:19Z"
  13. message: 'Readiness probe failed: Get "http://10.42.0.87:9090/-/ready": context deadline
  14. exceeded (Client.Timeout exceeded while awaiting headers)'
  15. metadata:
  16. creationTimestamp: "2022-04-08T03:51:17Z"
  17. name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344
  18. namespace: cattle-monitoring-system
  19. reason: Unhealthy
  20. reportingComponent: ""
  21. reportingInstance: ""
  22. source:
  23. component: kubelet
  24. host: master-1
  25. type: Warning
  1. apiVersion: v1
  2. count: 116
  3. eventTime: null
  4. firstTimestamp: "2022-04-13T02:43:26Z"
  5. involvedObject:
  6. apiVersion: v1
  7. fieldPath: spec.containers{grafana}
  8. kind: Pod
  9. name: rancher-monitoring-grafana-57777cc795-2b2x5
  10. namespace: cattle-monitoring-system
  11. kind: Event
  12. lastTimestamp: "2022-04-14T11:18:56Z"
  13. message: 'Readiness probe failed: Get "http://10.42.0.90:3000/api/health": context
  14. deadline exceeded (Client.Timeout exceeded while awaiting headers)'
  15. metadata:
  16. creationTimestamp: "2022-04-14T11:18:57Z"
  17. name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13
  18. namespace: cattle-monitoring-system
  19. reason: Unhealthy
  20. reportingComponent: ""
  21. reportingInstance: ""
  22. source:
  23. component: kubelet
  24. host: master-1
  25. type: Warning
  1. apiVersion: v1
  2. count: 20958
  3. eventTime: null
  4. firstTimestamp: "2022-04-11T10:34:51Z"
  5. involvedObject:
  6. apiVersion: v1
  7. fieldPath: spec.containers{lb-port-1883}
  8. kind: Pod
  9. name: svclb-emqx-dt22t
  10. namespace: emqx
  11. kind: Event
  12. lastTimestamp: "2022-04-14T11:39:48Z"
  13. message: Back-off restarting failed container
  14. metadata:
  15. creationTimestamp: "2022-04-11T10:34:51Z"
  16. name: svclb-emqx-dt22t.16e4d11e2b9efd27
  17. namespace: emqx
  18. reason: BackOff
  19. reportingComponent: ""
  20. reportingInstance: ""
  21. source:
  22. component: kubelet
  23. host: worker-1
  24. type: Warning
  1. apiVersion: v1
  2. count: 21069
  3. eventTime: null
  4. firstTimestamp: "2022-04-11T10:34:48Z"
  5. involvedObject:
  6. apiVersion: v1
  7. fieldPath: spec.containers{lb-port-80}
  8. kind: Pod
  9. name: svclb-traefik-r5p8t
  10. namespace: kube-system
  11. kind: Event
  12. lastTimestamp: "2022-04-14T11:44:59Z"
  13. message: Back-off restarting failed container
  14. metadata:
  15. creationTimestamp: "2022-04-11T10:34:48Z"
  16. name: svclb-traefik-r5p8t.16e4d11daf0b79ce
  17. namespace: kube-system
  18. reason: BackOff
  19. reportingComponent: ""
  20. reportingInstance: ""
  21. source:
  22. component: kubelet
  23. host: worker-1
  24. type: Warning
  1. {
  2. "metadata": {
  3. "name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f",
  4. "namespace": "monitoring",
  5. "uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e",
  6. "resourceVersion": "14043444",
  7. "creationTimestamp": "2022-04-14T13:08:40Z"
  8. },
  9. "reason": "Pulled",
  10. "message": "Container image \"ghcr.io/opsgenie/kubernetes-event-exporter:v0.11\" already present on machine",
  11. "source": {
  12. "component": "kubelet",
  13. "host": "worker-2"
  14. },
  15. "firstTimestamp": "2022-04-14T13:08:40Z",
  16. "lastTimestamp": "2022-04-14T13:08:40Z",
  17. "count": 1,
  18. "type": "Normal",
  19. "eventTime": null,
  20. "reportingComponent": "",
  21. "reportingInstance": "",
  22. "involvedObject": {
  23. "kind": "Pod",
  24. "namespace": "monitoring",
  25. "name": "event-exporter-79544df9f7-xj4t5",
  26. "uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75",
  27. "apiVersion": "v1",
  28. "resourceVersion": "14043435",
  29. "fieldPath": "spec.containers{event-exporter}",
  30. "labels": {
  31. "app": "event-exporter",
  32. "pod-template-hash": "79544df9f7",
  33. "version": "v1"
  34. }
  35. }
  36. }

我们可以把更多的字段加入到告警信息中, 其中就包括:

  • 节点: {{ Source.Host }}
  • Pod: {{ .InvolvedObject.Name }}

综上, 修改后的event-exporter-cfg yaml 如下:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: event-exporter-cfg
  5. namespace: monitoring
  6. resourceVersion: '5779968'
  7. data:
  8. config.yaml: |
  9. logLevel: error
  10. logFormat: json
  11. route:
  12. routes:
  13. - match:
  14. - receiver: "dump"
  15. - drop:
  16. - type: "Normal"
  17. match:
  18. - receiver: "feishu"
  19. receivers:
  20. - name: "dump"
  21. stdout: {}
  22. - name: "feishu"
  23. webhook:
  24. endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
  25. headers:
  26. Content-Type: application/json
  27. layout:
  28. msg_type: interactive
  29. card:
  30. config:
  31. wide_screen_mode: true
  32. enable_forward: true
  33. header:
  34. title:
  35. tag: plain_text
  36. content: xxx测试K3S集群告警
  37. template: red
  38. elements:
  39. - tag: div
  40. text:
  41. tag: lark_md
  42. content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"

屏蔽特定的节点和工作负载

比如,集群001中的节点worker-1做计划性维护,期间停止监控,维护完成后重新开始监控。

继续修改event-exporter-cfg yaml 如下:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: event-exporter-cfg
  5. namespace: monitoring
  6. data:
  7. config.yaml: |
  8. logLevel: error
  9. logFormat: json
  10. route:
  11. routes:
  12. - match:
  13. - receiver: "dump"
  14. - drop:
  15. - type: "Normal"
  16. - source:
  17. host: "worker-1"
  18. - namespace: "cattle-monitoring-system"
  19. - name: "*emqx*"
  20. - kind: "Pod|Deployment|ReplicaSet"
  21. - labels:
  22. version: "dev"
  23. match:
  24. - receiver: "feishu"
  25. receivers:
  26. - name: "dump"
  27. stdout: {}
  28. - name: "feishu"
  29. webhook:
  30. endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..."
  31. headers:
  32. Content-Type: application/json
  33. layout:
  34. msg_type: interactive
  35. card:
  36. config:
  37. wide_screen_mode: true
  38. enable_forward: true
  39. header:
  40. title:
  41. tag: plain_text
  42. content: xxx测试K3S集群告警
  43. template: red
  44. elements:
  45. - tag: div
  46. text:
  47. tag: lark_md
  48. content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"

默认的 drop 规则为: - type: "Normal", 即不对 Normal 级别进行告警;

现在加入以下规则:

  1. - source:
  2. host: "worker-1"
  3. - namespace: "cattle-monitoring-system"
  4. - name: "*emqx*"
  5. - kind: "Pod|Deployment|ReplicaSet"
  6. - labels:
  7. version: "dev"
  • ... host: "worker-1": 不对节点worker-1 做告警;
  • ... namespace: "cattle-monitoring-system": 不对 NameSpace: cattle-monitoring-system 做告警;
  • ... name: "*emqx*": 不对 name(name 往往是 pod name) 包含 emqx 的做告警
  • kind: "Pod|Deployment|ReplicaSet": 不对 Pod Deployment ReplicaSet 做告警(也就是不关注应用, 组件相关的告警)
  • ...version: "dev": 不对 label 含有 version: "dev" 的做告警(可以通过它屏蔽特定的应用的告警)

最终效果

如下图:

Event 告警包含更多信息

Event 告警包含更多信息-2

??????

三人行, 必有我师; 知识共享, 天下为公. 本文由东风微鸣技术博客 EWhisper.cn 编写.

原文链接:https://www.cnblogs.com/east4ming/p/17129096.html

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号