ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Kubernetes Prometheus rule

2022-06-12 12:02:15  阅读:212  来源: 互联网

标签:container Kubernetes expr labels rule Prometheus value pod prod


告警规则


常用告警规则配置

  • alerts

    ## CPU告警规则
    groups:
    - name: CpuAlertRule
      rules:
      - alert: PodCPU告警
        expr: onecore:pod > 80 or twocore:pod / 2 > 80 or squarecore:pod / 4 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          description: "CPU使用率大于80%"
          value: "{{$value}}%"
          #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
      - alert: NodeCPU告警
        expr: round(100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m]))by(kubernetes_node)*100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          description: "CPU使用率大于80%"
          value: "{{$value}}%"
          #summary: 'CPU使用率大于80%,当前值为{{.Value}}%,CPU使用率: {{ printf `ceil(100 - ((avg by (instance)(irate(node_cpu_seconds_total{mode="idle",instance="%s"}[1m]))) *100))` $labels.instance | query | first | value }}%'
    
    ## DISK告警规则
    - name: DiskAlertRule
      rules:
      - alert: Pod磁盘告警
        expr: round(container_fs_usage_bytes{container=~".+",container!~"POD"}/1024/1024/1024/10*100) > 85
        for: 1m
        labels:
          severity: warning
        annotations:
          description: "磁盘使用率大于85%"
          value: "{{$value}}%"
      - alert: Node磁盘告警
        expr: round((1- node_filesystem_avail_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"}/node_filesystem_size_bytes{fstype=~"ext.+|nfs.+",mountpoint!~".*docker.*"})*100) > 85
        for: 1m
        labels:
          severity: warning
        annotations:
          description: "磁盘使用率大于85%"
          value: "{{$value}}%"
    
    ## MEM告警规则
    - name: MemAlertRule
      rules:
      - alert: Pod内存告警
        expr: round(container_memory_usage_bytes{container=~".+",container!~"POD|.+reload",pod!~"^csi.+"}/container_spec_memory_limit_bytes{container=~".+",container!~"POD|.reload",pod!~"^csi.+"}*100) > 85
        for: 2m
        labels:
          severity: warning
        annotations:
          description: "内存使用率大于85%"
          value: "{{$value}}%"
      - alert: Node内存告警
        expr: round(100-((node_memory_MemAvailable_bytes*100)/node_memory_MemTotal_bytes)) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          description: "内存使用率大于85%"
          value: "{{$value}}%"
    
    ## Pod意外重启
    - name: PodRestartAlertRule
      rules:
      - alert: Pod重启告警
        expr: delta(kube_pod_container_status_restarts_total[1m]) > 0
        for: 1s
        labels:
          severity: warning
        annotations:
          description: "Pod发生意外重启事件"
    
    ## JvmCMSOldGC
    - name: PodJvmOldGCAlertRule
      rules:
      - alert: PodJvmCMSOldGC
        expr: round((jvm_memory_pool_bytes_used{pool=~".+Old Gen"}/jvm_memory_pool_bytes_max{pool=~".+Old Gen"})*100) > 89
        for: 5s
        labels:
          severity: warning
        annotations:
          description: "Pod堆内存触发CMSOldGC"
          value: "{{$value}}%"
    
    ## Pod实例异常
    - name: ContainerInstanceAlertRule
      rules:
      - alert: Pod实例异常
        expr: kube_pod_container_status_ready - kube_pod_container_status_running > 0
        for: 20s
        labels:
          severity: warning
        annotations:
          description: "Container实例异常"
    
    ## Pod实例OOM
    - name: ContainerOOMAlertRule
      rules:
      - alert: Pod实例OOM
        expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
        for: 1s
        labels:
          severity: warning
        annotations:
          description: "Container实例OOM"
    
    ## Pod实例驱逐
    - name: ContainerEvictionAlertRule
      rules:
      - alert: Pod实例驱逐
        expr: kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
        for: 1s
        labels:
          severity: warning
        annotations:
          description: "Container实例驱逐"
    
    ## MQ内存告警
    - name: MQMemoryAlertRule
      rules:
      - alert: MQ内存水位线
        expr: rabbitmq_node_mem_alarm{job=~".*rabbitmq.*"} == 1
        for: 1s
        labels:
          severity: warning
        annotations:
          description: "RabbitMQ内存高水位线告警"
          summary: RabbitMQ {{`{{ $labels.instance }}`}} High Memory Alarm is going off.  Which means the node hit highwater mark and has cut off network connectivity, see RabbitMQ WebUI
      - alert: MQ内存使用告警
        expr: round(avg(rabbitmq_node_mem_used{job=~".*rabbitmq.*"} / rabbitmq_node_mem_limit{job=~".*rabbitmq.*"})by(node,kubernetes_namespace)*100) > 90
        for: 10s
        labels:
          severity: warning
        annotations:
          description: "RabbitMQ使用告警"
          value: "{{$value}}%"
          summary: RabbitMQ {{`{{ $labels.instance }}`}} Memory Usage > 90%
    
    ##PodJava进程异常
    - name: PodJavaProcessAlertRule
      rules:
      - alert: PodJava进程异常
        expr: sum(up{job="kubernetes-pods-jvm"})by(kubernetes_container_name,kubernetes_pod_name) == 0
        for: 10s
        labels:
          severity: warning
        annotations:
          description: "PodJava进程异常"
          summary: "赶快看看吧,顶不住了"
  • recording_rules

    groups:
      - name: CpuRecordRules
        rules:
        - record: onecore:pod
          expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container!~"|POD|prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan|prod-rabbitmq|prometheus-server"}[5m]) * 100))
        - record: twocore:pod
          expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-xianxiang-edu-loan|prod-xy-fund|prod-common-callcenter|prod-risk-service|prod-qn-web-api|prod-xc-fund|prod-xc-user|sys-ingress|etcd|prod-qn-mp|prod-xc-common|prod-xianxiang-zuul|prod-qn-user|prod-xc-riskapi|prod-common-message|prod-common-trust-service|prod-xc-collection|kube-controller-manager|prod-qn-risk|prod-xy-zuul|metrics-server|prod-nflow-manager|kube-scheduler|prod-qn-gateway|prod-xc-pay|coredns|kube-apiserver|prod-qn-oms|prod-common-service|prod-nfsp-service|pord-ingress|prod-qn-cms|prod-internal-ingress|prod-xc-loan"}[5m]) * 100))
        - record: squarecore:pod
          expr: round(sum by(pod, container, instance, namespace, name) (irate(container_cpu_usage_seconds_total{container=~"prod-rabbitmq|prometheus-server"}[5m]) * 100))

     

标签:container,Kubernetes,expr,labels,rule,Prometheus,value,pod,prod
来源: https://www.cnblogs.com/apink/p/16367699.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有