I am very new to the world of deeplearning and I am struggling with training a model on my GPU.
I'm following a uDemy course which more or less goes over the following process: https://github.com/abdelrahman-gaber/tf2-object-detection-api-tutorial
At the moment, I do the following.
$ conda create -n tf2_gpu tensorflow-gpu==2.4.1 cudatoolkit==10.1.243
$ conda activate tf2_gpu
From here, I run a python script:
import tensorflow as tf
print(f'TensorFlow version is {tf.__version__}')
print(tf.config.list_physical_devices('GPU'))
Wihch outputs:
2021-10-28 10:20:13.443211: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
TensorFlow version is 2.4.1
2021-10-28 10:20:14.109708: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-10-28 10:20:14.110159: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-10-28 10:20:14.148679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.148880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:26:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-10-28 10:20:14.148893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-10-28 10:20:14.149875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-10-28 10:20:14.149893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-10-28 10:20:14.151002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-10-28 10:20:14.151178: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-10-28 10:20:14.152293: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-10-28 10:20:14.153391: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-10-28 10:20:14.156860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-10-28 10:20:14.156975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.157223: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:20:14.157371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Yay, I have my GPU at index 0.
I then follow the example:
# remember to activate your python environment first
cd models/research
# compile protos:
protoc object_detection/protos/*.proto --python_out=.
# Install TensorFlow Object Detection API as a python package:
cp object_detection/packages/tf2/setup.py .
python -m pip install .
The setup.py script is: https://github.com/tensorflow/models/blob/master/research/object_detection/packages/tf2/setup.py
At this point, my tensorflow is now updated to 2.6.0 and I get the following output of my above python script:
TensorFlow version is 2.6.0
2021-10-28 10:31:56.869027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-28 10:31:56.873968: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-10-28 10:31:56.873996: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
I don't how my tensorflow version is changing, nor why the upgrade then doesn't work with my GPU. I can only assume it's the required libs in the setup.py
script but I am not seeing it.
The following has no effect from what I can see:
# 'tf-models-official>=2.5.1',
'tf-models-official==2.4.0',
Some other information:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0
$ nvidia-smi
-----------------------------------------------------------------------------
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|------------------------------- ---------------------- ----------------------
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=============================== ====================== ======================|
| 0 NVIDIA GeForce ... On | 00000000:26:00.0 On | N/A |
| 0% 43C P8 5W / 215W | 19MiB / 7974MiB | 0% Default |
| | | N/A |
------------------------------- ---------------------- ----------------------
-----------------------------------------------------------------------------
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 965 G /usr/lib/xorg/Xorg 14MiB |
| 0 N/A N/A 1766 G ...mviewer/tv_bin/TeamViewer 2MiB |
-----------------------------------------------------------------------------
Oringally I had CUDA 9.0, but I upgraded to 11.5. ls -la /usr/local/
lrwxrwxrwx 1 root root 18 Oct 27 18:03 cuda -> /usr/local/cuda-11
lrwxrwxrwx 1 root root 25 Oct 27 17:55 cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x 16 root root 4096 Oct 27 17:55 cuda-11.5
The only other clue I have is a pretty big one.. TensorFlow supports CUDA® 11.2 (TensorFlow >= 2.5.0)
but 2.6.0 gives me nothing, nor 2.5.0
Software requirements
The following NVIDIA® software must be installed on your system:
NVIDIA® GPU drivers —CUDA® 11.2 requires 450.80.02 or higher.
CUDA® Toolkit —TensorFlow supports CUDA® 11.2 (TensorFlow >= 2.5.0)
CUPTI ships with the CUDA® Toolkit.
cuDNN SDK 8.1.0 cuDNN versions).
(Optional) TensorRT 6.0 to improve latency and throughput for inference on some models.
Any thoughts are very apperciated!
CodePudding user response:
Installing TensorFlow is notoriously painful. There are many solutions here, but I would recommend that you avoid installing the dependencies of the library you're installing. If you're missing anything afterwards, you should install it manually.
In practice the only difference would be to change the last line in setup code you posted.
python -m pip install . --no-dependencies
Then go through the list and find the packages in the install script on Anaconda. This usually means typing conda install -c conda-forge [name of package 1] [name of package 2] ...
. Any packages you can't find on anaconda, you can install with pip
in the environment.
'avro-python3',
'apache-beam',
'pillow',
'lxml',
'matplotlib',
'Cython',
'contextlib2',
'tf-slim',
'six',
'pycocotools',
'lvis',
'scipy',
'pandas',
'tf-models-official>=2.5.1',
Since it seems like you will need tf-models-official >= 2.5.1
, you might want to target a TensorFlow version >= 2.5.1 also. If you do this, you will need to match the version of cudatoolkit
to the version of TensorFlow as listed here.
CodePudding user response:
Went back to the docker route and I have success.
FROM tensorflow/tensorflow:latest-gpu
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y protobuf-compiler ffmpeg libsm6 libxext6
RUN python3 -m pip install --upgrade pip
COPY . /app
WORKDIR /app/models/research
RUN protoc object_detection/protos/anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/argmax_matcher.proto --python_out=.
RUN protoc object_detection/protos/bipartite_matcher.proto --python_out=.
RUN protoc object_detection/protos/box_coder.proto --python_out=.
RUN protoc object_detection/protos/box_predictor.proto --python_out=.
RUN protoc object_detection/protos/calibration.proto --python_out=.
RUN protoc object_detection/protos/center_net.proto --python_out=.
RUN protoc object_detection/protos/eval.proto --python_out=.
RUN protoc object_detection/protos/faster_rcnn.proto --python_out=.
RUN protoc object_detection/protos/faster_rcnn_box_coder.proto --python_out=.
RUN protoc object_detection/protos/flexible_grid_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/fpn.proto --python_out=.
RUN protoc object_detection/protos/graph_rewriter.proto --python_out=.
RUN protoc object_detection/protos/grid_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/hyperparams.proto --python_out=.
RUN protoc object_detection/protos/image_resizer.proto --python_out=.
RUN protoc object_detection/protos/input_reader.proto --python_out=.
RUN protoc object_detection/protos/keypoint_box_coder.proto --python_out=.
RUN protoc object_detection/protos/losses.proto --python_out=.
RUN protoc object_detection/protos/matcher.proto --python_out=.
RUN protoc object_detection/protos/mean_stddev_box_coder.proto --python_out=.
RUN protoc object_detection/protos/model.proto --python_out=.
RUN protoc object_detection/protos/multiscale_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/optimizer.proto --python_out=.
RUN protoc object_detection/protos/pipeline.proto --python_out=.
RUN protoc object_detection/protos/post_processing.proto --python_out=.
RUN protoc object_detection/protos/preprocessor.proto --python_out=.
RUN protoc object_detection/protos/region_similarity_calculator.proto --python_out=.
RUN protoc object_detection/protos/square_box_coder.proto --python_out=.
RUN protoc object_detection/protos/ssd.proto --python_out=.
RUN protoc object_detection/protos/ssd_anchor_generator.proto --python_out=.
RUN protoc object_detection/protos/string_int_label_map.proto --python_out=.
RUN protoc object_detection/protos/target_assigner.proto --python_out=.
RUN protoc object_detection/protos/train.proto --python_out=.
RUN cp object_detection/packages/tf2/setup.py .
RUN python3 -m pip install .
WORKDIR /app
ENV LANG en_US.UTF-8
ENTRYPOINT ["/usr/bin/python3"]
Orignally this was not working for me, however I since adjusted versions in regards to the following chart: https://www.tensorflow.org/install/source#gpu
Since then, I simply followed the official documention: https://www.tensorflow.org/install/docker