tensorflow.keras not utilizing gpu-CodePudding

I'm using tensorflow.keras to train a 3D CNN. Tensorflow can detect my GPU. When I run the following code:

print(tf.config.list_physical_devices('GPU'))
print(tf.config.list_logical_devices('GPU'))

I get the following output:

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]

However, when I train my model I can clearly see in the Windows task manager that the GPU is not utilized at all.

Here is the code to build and train the model:

input_shape = train_gen[0][0][0].shape
model = Sequential()
# 1
model.add(Conv3D(8, kernel_size=(3, 3, 3), padding='same', input_shape=input_shape))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 1), strides=(2, 2, 1), padding='same'))
# 2
model.add(Conv3D(16, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 3
model.add(Conv3D(32, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 4
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 5
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# final
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(512))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print('input_shape =', input_shape)
model.summary()

checkpoint = ModelCheckpoint(
    'saved-models/3d-cnn/best', monitor='val_loss', mode='min',
    save_weights_only=True, save_best_only=True, verbose=1
)

history = model.fit(
    train_gen,
    validation_data=val_gen,
    epochs=20,
    callbacks=[checkpoint]
)

Both train_gen and val_gen (which are passed to the fit method) are instances of a CustomDataGenerator class which inherits from tf.keras.utils.Sequence and generates batches of data by reading images from the disk and storing them in memory as a numpy array.

How can I make my model use the GPU during training?

Edit:

When I compile my model, the following output is shown in the terminal:

2022-02-21 16:38:59.667337: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-21 16:39:00.087775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3989 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5

And I notice that I have 4.1 GB of my GPU memory allocated.

When I call model.fit I have two additional lines shown in the terminal:

2022-02-21 16:42:25.775427: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-21 16:42:27.101558: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8100

And I have 4.6 GB of my GPU memory allocated. The GPU utilization increases immediately to 100% for about 1 second and then goes down and remains at 0% for the entire training process.

Edit 2:

I entered the following command in my terminal during training: nvidia-smi -l 5 and I got something similar to the following output (each 5 seconds):

Mon Feb 21 17:11:17 2022
 ----------------------------------------------------------------------------- 
| NVIDIA-SMI 496.13       Driver Version: 496.13       CUDA Version: 11.5     |
|------------------------------- ---------------------- ---------------------- 
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|=============================== ====================== ======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    68W /  N/A |   4892MiB /  6144MiB |     70%      Default |
|                               |                      |                  N/A |
 ------------------------------- ---------------------- ---------------------- 

 ----------------------------------------------------------------------------- 
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     12176      C   ...bdul\anaconda3\python.exe    N/A      |
 -----------------------------------------------------------------------------

The 70% value under GPU-Util kept changing between 70% and 85% during training. Does this indicate that my GPU is utilized?

CodePudding user response：

Easy quick check whether GPU is being used:

run with CUDA_VISIBLE_DEVICES="-1" ./your_code.py if using python script or import os; os.environ['CUDA_VISIBLE_DEVICES'] = '-1'` in the code.

If you experienced significant change in nvidia-smi and/or speed/duration of the training, then you were using GPU in the first place. ( having `CUDA_VISIBLE_DEVICES="0" ( or "0,1,2" if on multi-gpu setting)

Short check list:

Make sure you are importing and using tf.keras.
Make sure you have installed tensorflow-gpu
Watch GPU utilization with watch -n 1 nvidia-smi while .fit is running.
Check version compatibility table. This is important.
Ignore the cuda version shown in nvidia-smi, as it is the version of cuda, your driver came with. The installed cuda version is shown with nvcc -V.

In your case:

The model is getting loaded to GPU. So, it is not related to your GPU utilization issue. It is possible that your train_gen and val_gen takes time or they are buggy. Try without performing any specific augmentation to make sure the problem is not related to *_gen.