GKE GPU Timesharing Driver Capabilities-CodePudding

I'm running nvidia/cuda:11.8.0-base-ubuntu20.04 on Google Kubernetes Engine using GPU Timesharing on T4 gpus

Checking the driver capabilities I get compute and utility. I was hoping to also get graphics and video. Is this a limitation of Timesharing on GKE?

CodePudding user response：

It should let you use the resources for graphics and video, however time-sharing GPU are ideal for workloads that are not using a high amount of the resources all the time.

Limitations for using Time-Sharing GPU's on GKE's

GKE enforces memory (address space) isolation, performance isolation, and fault isolation between containers that share a physical GPU. However, memory limits aren't enforced on time-shared GPUs. To avoid running into out-of-memory (OOM) issues, set GPU memory limits in your applications. To avoid security issues, only deploy workloads that are in the same trust boundary to time-shared GPUs.
GKE might reject certain time-shared GPU requests to prevent unexpected behavior during capacity allocation
The maximum number of containers that can share a single physical GPU is 48. When planning your time-sharing configuration, consider the resource needs of your workloads and the capacity of the underlying physical GPUs to optimize your performance and responsiveness.

CodePudding user response：

As per this official document on GKE , you can create node pools equipped with NVIDIA K80, P100, P4, V100, T4, and A100 GPUs. GPUs provide compute power to drive deep-learning tasks such as image recognition, natural language processing, and other compute-intensive tasks such as video transcoding and image processing.

It means you can utilize compute, utilities, graphics and video

Limitations:

1)GKE might reject certain time-shared GPU requests to prevent unexpected behavior during capacity allocation. For details, see Request limits for time-shared GPUs.

2)You cannot add GPUs to existing node pools.

3)Using multi-instance GPU partitions with GKE is not recommended for untrusted workloads.

4)GPU nodes cannot be live migrated during the maintenance event

Refer to these Time sharing GPUs on GKE, GPUs with multiple workloads for more details