Kubernetes jobs are created but not executed immediately-CodePudding

Creating jobs like the following for example:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job-sebas
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

Results in the job resource being created but no pod or event is observed. Pod statuses are as follow:

Pods Statuses: 1 Running / 0 Succeeded / 0 Failed

And the only event visible is the notification of a succesful pod being created. The problem is that the message appears only after 30 minutes of complete apparent silence.

Normal SuccessfulCreate 21m job-controller Created pod: test-job-sebas-882bh

From the time we can observe the kube-apiserver log allowing the "create" verb for the Job resource, we are unable to spot any other log in any of the other pods (controllers/schedulers/apiserver) that have the text "test-job-sebas", until ~30 minutes later where the kube-controller-manager logs the following.

Event occurred" object="test-namespace/job-test-01" kind="Job" apiVersion="batch/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: test-job-sebas-882bh"

This happens with any Job in this cluster, no matter which namespace or the nature of the Job, if it comes from a CronJob or if it is explicitly created like the one provided in the example here.

Looking at the code does not throw any obvious suspect for us that points to what could be happening: https://github.com/kubernetes/kubernetes/blob/b5b0cc8bb88fb678c9b065c8da4f4c06a155a628/pkg/controller/job/job_controller.go

edit: We currently have ~15.000 jobs in the cluster where it seems that most of them are active, from only one namespace. This would lead us to think that we are hitting some limit or making some sort of saturation...but we can't confirm this by any of the visible data.

CodePudding user response：

This sounds very similar to what I encountered when we had a misbehaving webhook.

If you have a massive number of jobs all showing as active, but no pods appearing, or pods taking a long time to appear, then that's a sign of an admission webhook interfering with the pod creation. If it's a cronjob affected, you will get a "snowball" effect:

Writeup: https://blenderfox.com/2020/08/07/the-snowball-effect-in-kubernetes/

Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/93783

As for fixing your issue, you need to find out what is interfering with the creation (in our case, we had an up9 webhook misbehaving. Disabling that allowed the creation of the pods)