I am running a GPU intensive workload on demand on GKE Standard, where I have created the appropriate node pool with minimum 0 and maximum 5 nodes. However, when a Job is scheduled on the node pool, GKE presents the following error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 59s (x2 over 60s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
Normal NotTriggerScaleUp 58s cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 1 in backoff after failed scale-up
I have set up nodeSelector according to the documentation and I have autoscaling enabled, I can confirm it does find the node pool in spite of the error saying "didn't match Pod's node affinity/selector" and tries to scale up the cluster. But then it fails shortly thereafter saying 0/1 nodes are available? Which is completely false, seeing there are 0/5 nodes used in the node pool. What am I doing wrong here?
CodePudding user response:
1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate...
Try add tolerations
to your job's pod spec:
...
spec:
containers:
- name: ...
...
tolerations:
- key: nvidia.com/gpu
value: present
operator: Exists
CodePudding user response:
For the node(s) didn't match Pod's node
you don’t share the manifest file’s detail, but supposing that he has the lines:
nodeSelector:
nodePool: cluster
One option you have is to delete those lines from the YAML file. Or, another option is to add nodePool: cluster
as a label to all the nodes, and then the pod is going to be scheduled by using the available selector. The following command can be useful for you:
kubectl label nodes <your node name> nodePool=cluster
Regarding to the 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate
message, you can follow what @gohm'c suggested you, or you can use also the following command in order to remove the taint
from master node, that way you should be able to schedule your pod on that node:
kubectl taint nodes <your node name> node-role.kubernetes.io/master-
kubectl taint nodes <your node name> node-role.kubernetes.io/master-
You can use the following threads as reference, they have information from real cases, Error : FailedScheduling : nodes didn't match node selector and Node had taints that the pod didn't tolerate error.