I can see from the pod description that my pod "Failed" due to being "Evicted" due to Memory Pressure. but how can i test for too many "Failed && Evicted" pods with prometheus alert rules or some other means?
I have Prometheus Operator installed, I can see metrics for Failed Pods but not Failed and Evicted
kubectl describe pod gives:
Name: besteffort-evictme-001
Namespace: skyfii
Priority: 0
Node: ip-172-17-2-169.ap-southeast-2.compute.internal/
Start Time: Fri, 24 Sep 2021 15:28:53 1000
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container termination-demo-container was using 17165108Ki, which exceeds its request of 0.
IP:
IPs: <none>
Containers:
the prometheus rule :
kube_pod_status_phase{phase="Failed"} > 0
shows the failed pod
kube_pod_status_phase{endpoint="http",instance="172.17.3.141:8080",job="kube-state-metrics",namespace="skyfii",phase="Failed",pod="besteffort-evictme-001",service="prometheus-kube-state-metrics"}
but nothing shows up for
kube_pod_container_status_terminated_reason{reason="Evicted"} > 0
Any Ideas?
Thanks Karl
CodePudding user response:
So it seems I need to update my version of kube-prometheus-stack
helm chart.
The "Evicted" Reason
we see in pod description is hanging off podStatus
newer kube-prometheus-stack
versions which brings in the later version (v.2) of kube-state-metrics (v.2) which in turn exposes the
so I'll be able to craft the query I need now
I added this to my prometheus alertmanager rules by adding it to the
prometheusAdditionalRulesMap
section of the kube-prometheus-stack's Values.yaml
- name: kubernetes-container-evictions
rules:
# Mem pressure evicted pods are left in a Failed state, alert if we see too many failed pods
# NB you will need to delete the failed pods after investigating
- alert: FailedEvictedPods
expr: sum by(namespace, pod) (kube_pod_status_phase{phase="Failed"} > 0 and on(namespace, pod) kube_pod_status_reason{reason="Evicted"} > 0) > 0
for: 10m
labels:
severity: warning
annotations:
message: 'Failed Evicted pod:{{ $labels.pod }} namespace:{{ $labels.namespace }}'
- alert: TooManyEvictedPods
expr: sum(kube_pod_status_reason{reason="Evicted"}) >= 2
labels:
severity: high
annotations:
message: 'Too many Failed Evicted Pods: {{ $value }}'
and now I get the alerts I wanted :-)