I'm having a problem where a job runs out of memory, and K8s is continually trying to run it again, despite it having no chance of succeeding, since it's going to use the same amount of memory every time. I want it to simply let the job fail and sit there, and I'll take care of creating a new one with a higher memory limit, if desired, and/or deleting the existing failed job.
I have
restartPolicy: Never
backoffLimit: 0
From the not-so-clear things I've read, setting backoffLimit to 1 might do the trick. But is that true? Would that make it restart once, or is the 1 the number of times it can be run, including the first attempt?
Should I switch from jobs to pods? The main issue with that, is then I don't think K8s will restart the pod on another K8s worker node should the one it's running on go down, and that's a situation where I'd want the job to automatically be restarted on another node.
CodePudding user response:
backoffLimit should be 1 as shown below
backoffLimit: 1
CodePudding user response:
Setting backoffLimit
to 0 is correct, if the Job is supposed to run once and not be restarted:
backoffLimit: Specifies the number of retries before marking this job failed.
Switching your workload to a Pod would make sense as long as you are not interested in restarts in combination with backoff limits.