Home > front end >  Kubernetes OOMKilled with multiple containers
Kubernetes OOMKilled with multiple containers

Time:02-23

I run my service in Kubernetes cluster (AWS EKS). Recently, I have added a new container (side car) to the pod. After that, I've started observing OOMKilled, but metrics do not show any high memory usage. This is the config:

Containers:
  side-car:
    Container ID:   ...
    Image:          ...
    ...
    State:          Running
      Started:      Mon, 21 Feb 2022 09:11:07  0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 17 Feb 2022 18:36:28  0100
      Finished:     Mon, 21 Feb 2022 09:11:06  0100
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:        1
      memory:     2Gi
    ...
    ...
  my-service:
    Container ID:   ...
    ...
    ...
    ...

    State:          Running
      Started:      Thu, 17 Feb 2022 18:36:28  0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  3Gi
    Requests:
      cpu:      2
      memory:   3Gi

Both side car and my service do have memory limits (and request) set. During OOMKilled none of the containers use more memory than requested/limited. E.g. in one case side-car was using 20MiB, my-service: 800MiB, way low than limits are. Still Kubernetes restarted the container (side-car). Just for the record, before adding the side-car, my-service was running without problem and no OOMKilled was observed.

CodePudding user response:

Maybe yous sidecar container do has some performance issues that you cant catch and at some point of time it do request more than limit?

Check OOM kill due to container limit reached

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13  0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

The Exit code 137 is important because it means that the system terminated the container as it tried to use more memory than its limit.

In order to monitor this, you always have to look at the use of memory compared to the limit. Percentage of the node memory used by a pod is usually a bad indicator as it gives no indication on how close to the limit the memory usage is. In Kubernetes, limits are applied to containers, not pods, so monitor the memory usage of a container vs. the limit of that container.

CodePudding user response:

Most likely you don't get to see when the memory usage goes above the limit, as usually the metrics are pulled at defined intervals (cAdvisor, which is currently the de-facto source for metrics, only refreshes its data every 10-15 seconds by default).

How to troubleshoot further? Connect to the respective node that's running the sidecar container and look at the kernel logs. You can use tail /var/log/syslog -f | grep -i kernel (a sample of how this looks like is in this movie). You should see 2 lines like the ones below, which will indicate the aftermath of the container's cgroup limit being breached, and the respective process terminated:

Jan 16 21:33:51 aks-agentpool-20086390-vmss00003K kernel: [ 8334.895437] Memory cgroup out of memory: Killed process 14300 (dotnet) total-vm:172050596kB, anon-rss:148368kB, file-rss:25056kB, shmem-rss:0kB, UID:0 pgtables:568kB oom_score_adj:-997
Jan 16 21:33:51 aks-agentpool-20086390-vmss00003K kernel: [ 8334.906653] oom_reaper: reaped process 14300 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Pay special attention to the anon-rss and file-rss values, and compare their sum against the limit you've set for the sidecar container.

If you have control over the code that runs in the sidecar container, then you can add some instrumentation code to print out the amount of memory used at small enough intervals, and simply output that to the console. Once the container is OOMKilled, you'll still have access to the logs to see what happened (use the --previous flag with the kubectl logs command). Have a look at this answer for more info.

Including this just for completeness: your system could potentially run so low on memory that somehow the OOM killer is invoked and your sidecar container is chosen to be terminated (such a scenario is described here). That's highly unlikely in your case though, as from my understanding you get the sidecar container to be terminated repeatedly, which most likely points to an issue with that container only.

  • Related