Home > Software engineering >  How to solve OOM Killed 137 pod problem kubernetes GKE?
How to solve OOM Killed 137 pod problem kubernetes GKE?

Time:02-18

I need some help regarding this OOM status of pods 137. I am kinda stuck here for 3 days now. I built a docker image of a flask application. I run the docker image, it was running fine with a memory usage of 2.7 GB. I uploaded it to GKE with the following specification. Cluster

Workloads show "Does not have minimum availability"

Workload

Checking that pod "news-category-68797b888c-fmpnc" shows CrashLoopBackError

Error "back-off 5m0s restarting failed container=news-category-pbz8s pod=news-category-68797b888c-fmpnc_default(d5b0fff8-3e35-4f14-8926-343e99195543): CrashLoopBackOff"

Checking the YAML file shows OOM killed 137.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2022-02-17T16:07:36Z"
  generateName: news-category-68797b888c-
  labels:
    app: news-category
    pod-template-hash: 68797b888c
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:pod-template-hash: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"8d99448a-04f6-4651-a652-b1cc6d0ae4fc"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:containers:
          k:{"name":"news-category-pbz8s"}:
            .: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:terminationGracePeriodSeconds: {}
    manager: kube-controller-manager
    operation: Update
    time: "2022-02-17T16:07:36Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.16.3.4"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2022-02-17T16:55:18Z"
  name: news-category-68797b888c-fmpnc
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: news-category-68797b888c
    uid: 8d99448a-04f6-4651-a652-b1cc6d0ae4fc
  resourceVersion: "25100"
  uid: d5b0fff8-3e35-4f14-8926-343e99195543
spec:
  containers:
  - image: gcr.io/projectiris-327708/news_category:noConsoleDebug
    imagePullPolicy: IfNotPresent
    name: news-category-pbz8s
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-z2lbp
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-news-category-cluste-default-pool-42e1e905-ftzb
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-z2lbp
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-02-17T16:07:37Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-02-17T16:55:18Z"
    message: 'containers with unready status: [news-category-pbz8s]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-02-17T16:55:18Z"
    message: 'containers with unready status: [news-category-pbz8s]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-02-17T16:07:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://a582af0248a330b7d4087916752bd941949387ed708f00b3aac6f91a6ef75e63
    image: gcr.io/projectiris-327708/news_category:noConsoleDebug
    imageID: gcr.io/projectiris-327708/news_category@sha256:c4b3385bd80eff2a0c0ec0df18c6a28948881e2a90dd1c642ec6960b63dd017a
    lastState:
      terminated:
        containerID: containerd://a582af0248a330b7d4087916752bd941949387ed708f00b3aac6f91a6ef75e63
        exitCode: 137
        finishedAt: "2022-02-17T16:55:17Z"
        reason: OOMKilled
        startedAt: "2022-02-17T16:54:48Z"
    name: news-category-pbz8s
    ready: false
    restartCount: 13
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=news-category-pbz8s pod=news-category-68797b888c-fmpnc_default(d5b0fff8-3e35-4f14-8926-343e99195543)
        reason: CrashLoopBackOff
  hostIP: 10.160.0.42
  phase: Running
  podIP: 10.16.3.4
  podIPs:
  - ip: 10.16.3.4
  qosClass: BestEffort
  startTime: "2022-02-17T16:07:37Z"

My question is what to do and how to do to solve this. I tried to add resources to the YAML file in spec-

resources:
   limits:
       memory: 32Gi
   requests:
       memory: 16Gi

It also shows errors. How do I increase the memory of pods? And also if I increase memory it shows "Pod Unscheduled".

Someone plz give me an insight into clusters, nodes, and pods and how to solve this. Thank you.

CodePudding user response:

A Pod always runs on a Node and is the basic unit in a kubernetes engine. A Node is a worker machine in Kubernetes and may be either a virtual or a physical machine, depending on the cluster. A cluster is a set of nodes that run containerized applications.

Now coming back to your question on OOM issue. Mostly this occurs because your pods are trying to go beyond the memory that you have mentioned in your limits (in the YAML file). You can assume that your resource allocation ranges from your requests - limits in your case [16Gi - 32Gi]. Now should you assign more memory to solve this. Absolutely NOT! Thats not how the containerization or even any microservices concepts work. Read more about vertical scaling and horizontal scaling.

Now how you can avoid this problem. So lets assume that if your container runs a basic java spring boot application. Then you can try setting the jvm arguments like (-Xms , -Xms, -GCpolicies) etc., which are your standard configs for any java application and unless you specify it explicitly in a container environment, it will not work as it works in a local machine.

CodePudding user response:

The message back-off restarting failed container appears when you are facing a temporary resource overload, as a result of an activity spike. And the OOMKilled code 137 means that a container or pod was terminated because they used more memory than the one allowed. OOM stands for “Out Of Memory”.

So based on this information from your GKE configuration (Total of memory 16Gi ), I suggest review the total of memory limited configured in your GKE, you can confirm this issue with the following command: kubectl describe pod [name]

You will need to determine why Kubernetes decided to terminate the pod with the OOMKilled error, and adjust memory requests and limit values.

To review all the GKE metrics, you go to the Google console, then to the monitoring dashboard and select GKE. In this monitoring dashboard, you will find the statistics related to memory and CPU.

Also, it is important to review if the containers have enough resources to run your applications. You can follow this link to know the Kubernetes best practices.

GKE also has an interesting feature, which is autoscaling. It is very useful because it automatically resizes your GKE cluster based on the demands of your workloads. You can find more information in this link.

  • Related