I'm running AWS EKS, running on Fargate, and using Kubernetes to orchestrate multiple cron jobs. I spin roughly 1000 pods up and down over the course of a day.
Very seldomly(once every 3 weeks) one of the pods gets stuck in ContainerCreating and just hangs there and because I have concurrency disabled that particular job will never run. The fix is simply terminating the job or the pod and having it restart but this is a manual intervention.
Is there a way to get a pod to terminate or restart, if it takes too long to create?
The reason for the pod getting stuck varies quite a bit. A solution would need to be general. It can be a time based solution as all the pods are running the same code with different configurations so the startup time is relatively consistent.
CodePudding user response:
Sadly there is no mecanism to stop a job if it fail at image pulling or container creating. I also tried to do what you are trying to achieve.
You can set a backoffLimit
inside your template. But it won't handle the number of retries during containerCreating
, only while running.
What you can do is a script that makes describes
of each pods in namespace. And try to parse it and restart the pod if it is stuck in containerCreating
.
Or try to debug/trace what is causing this. kubectl describe pods
to get info when your pod is in containerCreating
.