My machine have 4 GPUs, and when I run the code, at the beginning I already set:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
Through nvidia-smi command I can see that gpu 1 is actually used. However, the tensorflow log on the terminal shows that gpu 0 is used:
2021-09-24 02:27:55.691073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:0d.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-24 02:27:55.691123: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:55.694585: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-24 02:27:55.698234: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-24 02:27:55.698776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-24 02:27:55.702390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-24 02:27:55.703656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-24 02:27:55.709853: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-24 02:27:55.710078: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
...
2021-09-24 02:27:55.906440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-24 02:27:55.906571: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:57.342555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-24 02:27:57.342608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2021-09-24 02:27:57.342619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2021-09-24 02:27:57.342980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.343982: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.344891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14419 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0d.0, compute capability: 7.0)
I have two questions:
GPU 0 is indeed used, but by another process. In my code, it is using gpu 1. I am wondering why the log above is consistent with the device actually used?
Also, Tensorflow 2 should be automatically detecting available GPUs and use it. If I don't add this line:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
The log shows that it is trying to use gpu= 0 and produces an out of memory error.
CodePudding user response:
the
CUDA_VISIBLE_DEVICES
environment variable remaps whichever devices you select so that with respect to your CUDA process, those devices (in your list) appear to CUDA as if they started at zero. So when you do:os.environ["CUDA_VISIBLE_DEVICES"] = "1"
Thereafter, CUDA sees that device as if it were device 0.
Just because a GPU is in use by another process/user, does not mean that it is "not available" for you to use. CUDA doesn't prevent two users or two processes from trying to use the same GPU, and in some cases that scenario is sensible/effective. So TF sees it as a usable device, attempts to use it, and runs out of memory. That is one typical reason why people use the environment variable listed in 1 above. The environment variable will make only certain devices "visible" or "usable" to your TF process.