Home > Enterprise >  how can I alert on Pod Eviction or Failed && Evicted pods in kubernetes
how can I alert on Pod Eviction or Failed && Evicted pods in kubernetes

Time:09-28

I can see from the pod description that my pod "Failed" due to being "Evicted" due to Memory Pressure. but how can i test for too many "Failed && Evicted" pods with prometheus alert rules or some other means?

I have Prometheus Operator installed, I can see metrics for Failed Pods but not Failed and Evicted

kubectl describe pod gives:

Name:         besteffort-evictme-001
Namespace:    skyfii
Priority:     0
Node:         ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time:   Fri, 24 Sep 2021 15:28:53  1000
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0. 
IP:           
IPs:          <none>
Containers:

the prometheus rule :

kube_pod_status_phase{phase="Failed"} > 0

shows the failed pod

kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}

but nothing shows up for

kube_pod_container_status_terminated_reason{reason="Evicted"} > 0

Any Ideas?

Thanks Karl

CodePudding user response:

So it seems I need to update my version of kube-prometheus-stack helm chart.

The "Evicted" Reason we see in pod description is hanging off podStatus enter image description here

newer kube-prometheus-stack versions which brings in the later version (v.2) of kube-state-metrics (v.2) which in turn exposes the enter image description here

so I'll be able to craft the query I need now

I added this to my prometheus alertmanager rules by adding it to the prometheusAdditionalRulesMap section of the kube-prometheus-stack's Values.yaml


      - name: kubernetes-container-evictions

        rules:

        # Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods

        # NB you will need to delete the failed pods after investigating

        - alert: FailedEvictedPods

          expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0

          for: 10m

          labels:

            severity: warning

          annotations:

            message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'


        - alert: TooManyEvictedPods

          expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2

          labels:

            severity: high

          annotations:

            message: 'Too many Failed Evicted Pods: {{ $value }}'

and now I get the alerts I wanted :-)

  • Related