I'm using tensorflow.keras to train a 3D CNN. Tensorflow can detect my GPU. When I run the following code:
print(tf.config.list_physical_devices('GPU'))
print(tf.config.list_logical_devices('GPU'))
I get the following output:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
However, when I train my model I can clearly see in the Windows task manager that the GPU is not utilized at all.
Here is the code to build and train the model:
input_shape = train_gen[0][0][0].shape
model = Sequential()
# 1
model.add(Conv3D(8, kernel_size=(3, 3, 3), padding='same', input_shape=input_shape))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 1), strides=(2, 2, 1), padding='same'))
# 2
model.add(Conv3D(16, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 3
model.add(Conv3D(32, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 4
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 5
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# final
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print('input_shape =', input_shape)
model.summary()
checkpoint = ModelCheckpoint(
'saved-models/3d-cnn/best', monitor='val_loss', mode='min',
save_weights_only=True, save_best_only=True, verbose=1
)
history = model.fit(
train_gen,
validation_data=val_gen,
epochs=20,
callbacks=[checkpoint]
)
Both train_gen
and val_gen
(which are passed to the fit method) are instances of a CustomDataGenerator
class which inherits from tf.keras.utils.Sequence
and generates batches of data by reading images from the disk and storing them in memory as a numpy array.
How can I make my model use the GPU during training?
Edit:
When I compile my model, the following output is shown in the terminal:
2022-02-21 16:38:59.667337: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-21 16:39:00.087775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3989 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5
And I notice that I have 4.1 GB of my GPU memory allocated.
When I call model.fit
I have two additional lines shown in the terminal:
2022-02-21 16:42:25.775427: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-21 16:42:27.101558: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8100
And I have 4.6 GB of my GPU memory allocated. The GPU utilization increases immediately to 100% for about 1 second and then goes down and remains at 0% for the entire training process.
Edit 2:
I entered the following command in my terminal during training: nvidia-smi -l 5
and I got something similar to the following output (each 5 seconds):
Mon Feb 21 17:11:17 2022
-----------------------------------------------------------------------------
| NVIDIA-SMI 496.13 Driver Version: 496.13 CUDA Version: 11.5 |
|------------------------------- ---------------------- ----------------------
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=============================== ====================== ======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 52C P0 68W / N/A | 4892MiB / 6144MiB | 70% Default |
| | | N/A |
------------------------------- ---------------------- ----------------------
-----------------------------------------------------------------------------
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12176 C ...bdul\anaconda3\python.exe N/A |
-----------------------------------------------------------------------------
The 70% value under GPU-Util kept changing between 70% and 85% during training. Does this indicate that my GPU is utilized?
CodePudding user response:
Easy quick check whether GPU is being used:
run with CUDA_VISIBLE_DEVICES="-1" ./your_code.py
if using python script or import os; os.environ['CUDA_VISIBLE_DEVICES'] = '-1'` in the code.
If you experienced significant change in nvidia-smi and/or speed/duration of the training, then you were using GPU in the first place. ( having `CUDA_VISIBLE_DEVICES="0" ( or "0,1,2" if on multi-gpu setting)
Short check list:
- Make sure you are importing and using
tf.keras
. - Make sure you have installed
tensorflow-gpu
- Watch GPU utilization with
watch -n 1 nvidia-smi
while.fit
is running. - Check version compatibility table. This is important.
- Ignore the cuda version shown in
nvidia-smi
, as it is the version of cuda, your driver came with. The installed cuda version is shown withnvcc -V
.
In your case:
The model is getting loaded to GPU. So, it is not related to your GPU utilization issue.
It is possible that your train_gen
and val_gen
takes time or they are buggy. Try without performing any specific augmentation to make sure the problem is not related to *_gen
.