How to prevent or fix Kubernetes pod getting stuck in containerCreating occasionally-CodePudding

I'm running AWS EKS, running on Fargate, and using Kubernetes to orchestrate multiple cron jobs. I spin roughly 1000 pods up and down over the course of a day.

Very seldomly(once every 3 weeks) one of the pods gets stuck in ContainerCreating and just hangs there and because I have concurrency disabled that particular job will never run. The fix is simply terminating the job or the pod and having it restart but this is a manual intervention.

Is there a way to get a pod to terminate or restart, if it takes too long to create?

The reason for the pod getting stuck varies quite a bit. A solution would need to be general. It can be a time based solution as all the pods are running the same code with different configurations so the startup time is relatively consistent.

CodePudding user response：

Sadly there is no mecanism to stop a job if it fail at image pulling or container creating. I also tried to do what you are trying to achieve.

You can set a backoffLimit inside your template. But it won't handle the number of retries during containerCreating, only while running.

What you can do is a script that makes describes of each pods in namespace. And try to parse it and restart the pod if it is stuck in containerCreating.

Or try to debug/trace what is causing this. kubectl describe pods to get info when your pod is in containerCreating.