Tensorflow not using GPU, due to version issue I believe-CodePudding

I am very new to the world of deeplearning and I am struggling with training a model on my GPU.

I'm following a uDemy course which more or less goes over the following process: https://github.com/abdelrahman-gaber/tf2-object-detection-api-tutorial

At the moment, I do the following.

$ conda create -n tf2_gpu tensorflow-gpu==2.4.1 cudatoolkit==10.1.243
$ conda activate tf2_gpu

From here, I run a python script:

import tensorflow as tf

print(f'TensorFlow version is {tf.__version__}')
print(tf.config.list_physical_devices('GPU'))

Wihch outputs:

2021-10-28 10:20:13.443211: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
TensorFlow version is 2.4.1
2021-10-28 10:20:14.109708: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-10-28 10:20:14.110159: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-10-28 10:20:14.148679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.148880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-10-28 10:20:14.148893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-10-28 10:20:14.149875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-10-28 10:20:14.149893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-10-28 10:20:14.151002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-10-28 10:20:14.151178: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-10-28 10:20:14.152293: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-10-28 10:20:14.153391: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-10-28 10:20:14.156860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-10-28 10:20:14.156975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.157223: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.157371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Yay, I have my GPU at index 0.

I then follow the example:

# remember to activate your python environment first
cd models/research
# compile protos:
protoc object_detection/protos/*.proto --python_out=.
# Install TensorFlow Object Detection API as a python package:
cp object_detection/packages/tf2/setup.py .
python -m pip install .

The setup.py script is: https://github.com/tensorflow/models/blob/master/research/object_detection/packages/tf2/setup.py

At this point, my tensorflow is now updated to 2.6.0 and I get the following output of my above python script:

TensorFlow version is 2.6.0
2021-10-28 10:31:56.869027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:31:56.873968: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-10-28 10:31:56.873996: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

I don't how my tensorflow version is changing, nor why the upgrade then doesn't work with my GPU. I can only assume it's the required libs in the setup.py script but I am not seeing it.

The following has no effect from what I can see:

# 'tf-models-official>=2.5.1',
'tf-models-official==2.4.0',

Some other information:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:    18.04
Codename:   bionic

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

$ nvidia-smi    
 ----------------------------------------------------------------------------- 
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|------------------------------- ---------------------- ---------------------- 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|=============================== ====================== ======================|
|   0  NVIDIA GeForce ...  On   | 00000000:26:00.0  On |                  N/A |
|  0%   43C    P8     5W / 215W |     19MiB /  7974MiB |      0%      Default |
|                               |                      |                  N/A |
 ------------------------------- ---------------------- ---------------------- 
                                                                               
 ----------------------------------------------------------------------------- 
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       965      G   /usr/lib/xorg/Xorg                 14MiB |
|    0   N/A  N/A      1766      G   ...mviewer/tv_bin/TeamViewer        2MiB |
 -----------------------------------------------------------------------------

Oringally I had CUDA 9.0, but I upgraded to 11.5. ls -la /usr/local/

lrwxrwxrwx  1 root root   18 Oct 27 18:03 cuda -> /usr/local/cuda-11
lrwxrwxrwx  1 root root   25 Oct 27 17:55 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 16 root root 4096 Oct 27 17:55 cuda-11.5

The only other clue I have is a pretty big one.. TensorFlow supports CUDA® 11.2 (TensorFlow >= 2.5.0) but 2.6.0 gives me nothing, nor 2.5.0

Software requirements

The following NVIDIA® software must be installed on your system:

NVIDIA® GPU drivers —CUDA® 11.2 requires 450.80.02 or higher.
CUDA® Toolkit —TensorFlow supports CUDA® 11.2 (TensorFlow >= 2.5.0)
CUPTI ships with the CUDA® Toolkit.
cuDNN SDK 8.1.0 cuDNN versions).
(Optional) TensorRT 6.0 to improve latency and throughput for inference on some models.

Any thoughts are very apperciated!

CodePudding user response：

Installing TensorFlow is notoriously painful. There are many solutions here, but I would recommend that you avoid installing the dependencies of the library you're installing. If you're missing anything afterwards, you should install it manually.

In practice the only difference would be to change the last line in setup code you posted.

python -m pip install . --no-dependencies

Then go through the list and find the packages in the install script on Anaconda. This usually means typing conda install -c conda-forge [name of package 1] [name of package 2] .... Any packages you can't find on anaconda, you can install with pip in the environment.

'avro-python3',
'apache-beam',
'pillow',
'lxml',
'matplotlib',
'Cython',
'contextlib2',
'tf-slim',
'six',
'pycocotools',
'lvis',
'scipy',
'pandas',
'tf-models-official>=2.5.1',

Since it seems like you will need tf-models-official >= 2.5.1, you might want to target a TensorFlow version >= 2.5.1 also. If you do this, you will need to match the version of cudatoolkit to the version of TensorFlow as listed here.

CodePudding user response：

Went back to the docker route and I have success.

FROM tensorflow/tensorflow:latest-gpu

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y protobuf-compiler ffmpeg libsm6 libxext6
RUN python3 -m pip install --upgrade pip

COPY . /app
WORKDIR /app/models/research

RUN protoc object_detection/protos/anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/argmax_matcher.proto --python_out=.
RUN protoc object_detection/protos/bipartite_matcher.proto --python_out=.
RUN protoc object_detection/protos/box_coder.proto --python_out=.
RUN protoc object_detection/protos/box_predictor.proto --python_out=.
RUN protoc object_detection/protos/calibration.proto --python_out=.
RUN protoc object_detection/protos/center_net.proto --python_out=.
RUN protoc object_detection/protos/eval.proto --python_out=.
RUN protoc object_detection/protos/faster_rcnn.proto --python_out=.
RUN protoc object_detection/protos/faster_rcnn_box_coder.proto --python_out=.
RUN protoc object_detection/protos/flexible_grid_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/fpn.proto --python_out=.
RUN protoc object_detection/protos/graph_rewriter.proto --python_out=.
RUN protoc object_detection/protos/grid_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/hyperparams.proto --python_out=.
RUN protoc object_detection/protos/image_resizer.proto --python_out=.
RUN protoc object_detection/protos/input_reader.proto --python_out=.
RUN protoc object_detection/protos/keypoint_box_coder.proto --python_out=.
RUN protoc object_detection/protos/losses.proto --python_out=.
RUN protoc object_detection/protos/matcher.proto --python_out=.
RUN protoc object_detection/protos/mean_stddev_box_coder.proto --python_out=.
RUN protoc object_detection/protos/model.proto --python_out=.
RUN protoc object_detection/protos/multiscale_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/optimizer.proto --python_out=.
RUN protoc object_detection/protos/pipeline.proto --python_out=.
RUN protoc object_detection/protos/post_processing.proto --python_out=.
RUN protoc object_detection/protos/preprocessor.proto --python_out=.
RUN protoc object_detection/protos/region_similarity_calculator.proto --python_out=.
RUN protoc object_detection/protos/square_box_coder.proto --python_out=.
RUN protoc object_detection/protos/ssd.proto --python_out=.
RUN protoc object_detection/protos/ssd_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/string_int_label_map.proto --python_out=.
RUN protoc object_detection/protos/target_assigner.proto --python_out=.
RUN protoc object_detection/protos/train.proto --python_out=.

RUN cp object_detection/packages/tf2/setup.py .
RUN python3 -m pip install . 

WORKDIR /app

ENV LANG en_US.UTF-8 

ENTRYPOINT ["/usr/bin/python3"]

Orignally this was not working for me, however I since adjusted versions in regards to the following chart: https://www.tensorflow.org/install/source#gpu

Since then, I simply followed the official documention: https://www.tensorflow.org/install/docker