Unschedulable GPU workload on GKE from node pool-CodePudding

I am running a GPU intensive workload on demand on GKE Standard, where I have created the appropriate node pool with minimum 0 and maximum 5 nodes. However, when a Job is scheduled on the node pool, GKE presents the following error:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   59s (x2 over 60s)  default-scheduler   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
  Normal   NotTriggerScaleUp  58s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 1 in backoff after failed scale-up

I have set up nodeSelector according to the documentation and I have autoscaling enabled, I can confirm it does find the node pool in spite of the error saying "didn't match Pod's node affinity/selector" and tries to scale up the cluster. But then it fails shortly thereafter saying 0/1 nodes are available? Which is completely false, seeing there are 0/5 nodes used in the node pool. What am I doing wrong here?

CodePudding user response：

1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate...

Try add tolerations to your job's pod spec:

...
spec:
  containers:
  - name: ...
    ...
  tolerations:
  - key: nvidia.com/gpu
    value: present
    operator: Exists

CodePudding user response：

For the node(s) didn't match Pod's node you don’t share the manifest file’s detail, but supposing that he has the lines:

nodeSelector: 
nodePool: cluster

One option you have is to delete those lines from the YAML file. Or, another option is to add nodePool: cluster as a label to all the nodes, and then the pod is going to be scheduled by using the available selector. The following command can be useful for you:

kubectl label nodes <your node name> nodePool=cluster

Regarding to the 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate message, you can follow what @gohm'c suggested you, or you can use also the following command in order to remove the taint from master node, that way you should be able to schedule your pod on that node:

kubectl taint nodes  <your node name> node-role.kubernetes.io/master-
kubectl taint nodes  <your node name> node-role.kubernetes.io/master-

You can use the following threads as reference, they have information from real cases, Error : FailedScheduling : nodes didn't match node selector and Node had taints that the pod didn't tolerate error.