Home > Software design >  How to prevent or fix Kubernetes pod getting stuck in containerCreating occasionally
How to prevent or fix Kubernetes pod getting stuck in containerCreating occasionally

Time:07-05

I'm running AWS EKS, running on Fargate, and using Kubernetes to orchestrate multiple cron jobs. I spin roughly 1000 pods up and down over the course of a day.

Very seldomly(once every 3 weeks) one of the pods gets stuck in ContainerCreating and just hangs there and because I have concurrency disabled that particular job will never run. The fix is simply terminating the job or the pod and having it restart but this is a manual intervention.

Is there a way to get a pod to terminate or restart, if it takes too long to create?

The reason for the pod getting stuck varies quite a bit. A solution would need to be general. It can be a time based solution as all the pods are running the same code with different configurations so the startup time is relatively consistent.

CodePudding user response:

Sadly there is no mecanism to stop a job if it fail at image pulling or container creating. I also tried to do what you are trying to achieve.

You can set a backoffLimit inside your template. But it won't handle the number of retries during containerCreating, only while running.

What you can do is a script that makes describes of each pods in namespace. And try to parse it and restart the pod if it is stuck in containerCreating.

Or try to debug/trace what is causing this. kubectl describe pods to get info when your pod is in containerCreating.

  • Related