I have a GKE cluster v1.19 that has a deployment that can run on GPU or CPU only nodes.
I have 2 node pools, both preemptive so nodes can be unavailable:
- GPU
- CPU Only
I wish to use the GPU node pool as long as GPU nodes are available. If there are no available GPU nodes I wish to assign those pods to a node with CPU only.
My current yaml for the deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: NAME
namespace: NAMESPACE
spec:
selector:
matchLabels:
app: NAME
template:
metadata:
labels:
app: NAME
spec:
nodeSelector:
cloud.google.com/gke-preemptible: "true"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: In
values:
- nvidia-tesla-t4
containers:
- name: NAME
image: IMAGE
resources:
requests:
memory: 28.0Gi
cpu: 3000m
limits:
cpu: 4000m
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
The above yaml allows the pod to be assigned to a GPU node but not to a CPU node. Correct me if I am wrong, I have to set limits: nvidia.com/gpu: 1
in order to use the gpu, but this requires the node to have nvidia.com/gpu so it can't be assigned to CPU only node.
How can I achieve such behavior?
CodePudding user response:
I'm assuming that your question is regarding the set up of GPU before using the limits
command. Kindly check below steps and guides on Kubenetes with GPU.
Creating New Zonal / Regional cluster with GPU
Installing NVIDIA GPU device drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Using node auto-provisioning with GPUs \
gcloud container clusters update CLUSTER_NAME --enable-autoprovisioning \ --autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/compute
To learn more about auto-provisioning, see the auto-provisioning page.
You can refer to this link for more guide related to GKE GPU.
CodePudding user response:
You won't be able to do this. GKE automatically taints GPU nodes with nvidia.com/gpu
and NoSchedule
. So without the toleration, a pod cannot be scheduled on a GPU node. (Note that GKE automatically adds the toleration to pods which have limit set for nvidia.com/gpu
).
You best bet here is to either enable the cluster autoscaler or to enable node auto-provisioning. At least this way, new nodes will be added to the cluster as needed (and available). Of course the number of nodes that can be added with depend on your GPU quota.