Home > Back-end >  EKS Kubernetes disk metrics using datadog
EKS Kubernetes disk metrics using datadog

Time:10-01

I am looking for disk space usage metrics for EKS cluster that can be monitored. I came across three different metric alerts :
1.

k8s-high-filesystem-usage:
  name: "(k8s) High Filesystem Usage Detected"
  type: metric alert
  query: |
    avg(last_10m):avg:kubernetes.filesystem.usage_pct{*} by {cluster_name} > 90
  message: |
    {{#is_warning}}
    {{cluster_name.name}} filesystem usage greater than 80% for 10 minutes
    {{/is_warning}}
    {{#is_alert}}
    {{cluster_name.name}} filesystem usage greater than 90% for 10 minutes
    {{/is_alert}}
k8s-high-disk-usage:
  name: "(k8s) High Disk Usage Detected"
  type: metric alert
  query: |
    min(last_5m):min:kubelet.volume.stats.used_bytes{*} by {cluster_name} / avg:kubernetes.kubelet.volume.stats.capacity_bytes{*} by {cluster_name} * 100 > 90
  message: |
    ({{cluster_name.name}} High disk usage detected

k8s-high-disk-usage:
  name: "(k8s) High Disk Usage Detected"
  type: metric alert
  query: |
    min(last_5m):min:system.disk.used{*} by {host,cluster_name} / avg:system.disk.total{*} by {host,cluster_name} * 100 > 90
  message: |
    ({{cluster_name.name}} High disk usage detected on {{host.name}}

What does these three metrics mean? When can I use these?

CodePudding user response:

This is a confusing area that is poorly documented. I'm glad you asked the question.

First, some background on each metric.

system.disk.used

This metric is the most straightforward: disk space used, in bytes, of the disk partitions on the k8s host. This is a core check collected by the Datadog agent. Find the source for this metric in corechecks/system/disk/disk.go. The check will report disk usage for each volume on the host.

kubernetes.filesystem.usage_pct

This metric reports disk space used for each node in a k8s cluster. The data is pulled from the metrics published by the kubelet under /stats/summary. You can query the data directly using kubectl, e.g.

kubectl get --raw /api/v1/nodes/<node name>/proxy/stats/summary

The code can be found by tracing the calls in the cluster orchestrator and kubelet util files. This metric also reports disk usage percentage by pod, device, and other potentially useful tags.

kubernetes.kubelet.volume.stats.used_bytes

This metric reports data about pods persistent volume claims. You can find out how many bytes are used by each pvc. This metric will only exist for pods with persistent volume claims. This is also in the cluster/orchestrator code base.

So, with that background in mind, when would you use each metric?

Use system.disk.used to track the disk usage at the node level. If you want to monitor the disk usage of hosts, watch this value. You should monitor on the device tag - you will be most interested in the physical disk partitions and the Docker volumes. You can probably ignore the shm and tmpfs volumes (virtual memory). Note that since this is a core check, this metric is reported for any host with the datadog agent installed, not just k8s hosts.

Use kubernetes.filesystem.usage_pct to track disk usage by k8s hosts. It probably makes sense to monitor with cluster_name and with host, and to use the max value, e.g. update your query to:

avg(last_10m):max:kubernetes.filesystem.usage_pct{*} by {cluster_name,host}

If you want pod-level usage, you can also add pod_name to the query.

Finally, use kubernetes.kubelet.volume.stats.used_bytes to monitor disk space of persistent volume claims. You'll want to add the persistentvolumeclaim tag to the query so you know which claim you're looking at.

  • Related