GKE - Assign pod to node with gpu or cpu-CodePudding

I have a GKE cluster v1.19 that has a deployment that can run on GPU or CPU only nodes.

I have 2 node pools, both preemptive so nodes can be unavailable:

GPU
CPU Only

I wish to use the GPU node pool as long as GPU nodes are available. If there are no available GPU nodes I wish to assign those pods to a node with CPU only.

My current yaml for the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: NAME
  namespace: NAMESPACE
spec:
  selector:
    matchLabels:
      app: NAME
  template:
    metadata:
      labels:
        app: NAME
    spec:
      nodeSelector:
        cloud.google.com/gke-preemptible: "true"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: In
                values:
                - nvidia-tesla-t4
      containers:
      - name: NAME
        image: IMAGE
        resources:
          requests:
            memory: 28.0Gi
            cpu: 3000m
          limits:
            cpu: 4000m
            nvidia.com/gpu: 1
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

The above yaml allows the pod to be assigned to a GPU node but not to a CPU node. Correct me if I am wrong, I have to set limits: nvidia.com/gpu: 1 in order to use the gpu, but this requires the node to have nvidia.com/gpu so it can't be assigned to CPU only node.

How can I achieve such behavior?

CodePudding user response：

I'm assuming that your question is regarding the set up of GPU before using the limits command. Kindly check below steps and guides on Kubenetes with GPU.

Creating New Zonal / Regional cluster with GPU

Installing NVIDIA GPU device drivers

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Using node auto-provisioning with GPUs \

gcloud container clusters update CLUSTER_NAME --enable-autoprovisioning \ --autoprovisioning-scopes=https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring,https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/compute

To learn more about auto-provisioning, see the auto-provisioning page.

You can refer to this link for more guide related to GKE GPU.

CodePudding user response：

You won't be able to do this. GKE automatically taints GPU nodes with nvidia.com/gpu and NoSchedule. So without the toleration, a pod cannot be scheduled on a GPU node. (Note that GKE automatically adds the toleration to pods which have limit set for nvidia.com/gpu).

You best bet here is to either enable the cluster autoscaler or to enable node auto-provisioning. At least this way, new nodes will be added to the cluster as needed (and available). Of course the number of nodes that can be added with depend on your GPU quota.