Suppose I have 8 gpus on a server. (From 0 to 7)
When I train a simple (and small) model on a gpu #0, it takes about 20 minutes per epoch. However, when I load more than 5 or 6 models on some gpus, for example, 2 experiments per gpu from gpu #0 to #2, (6 in total) the training time per epoch explodes. ( about 1 hour ),
When I train 2 models per gpu for all gpus ( 16 experiments in total ), it takes about 3 hours to complete an epoch.
When I see the CPU utilization, it is fine. But GPU utilization drops.
What is the reason for the drop, and how can I solve the problem?
CodePudding user response:
There are basically two ways of using multi-GPUs for deep learning:
- Use
torch.nn.DataParallel(module)
(DP)
This function is quite discouraged by the official documentation because it replicates the entire module in all GPUs at each forward pass. At the end of forward pass, the models are destroyed. Therefore, when you have big models it could be an important bottleneck in your training time and even slow it by compared to single GPU. It could be the case for instance when you freeze a large part of big module for fine tuning. That's why you may consider using:
torch.nn.parallel.DistributedDataParallel(module, device_ids=)
(DDP) documentation
This function often requires refactoring your code a little bit more but it improves the efficiency because it copy the models on GPUs only once, at the beginning of the training. The models are persistent over time and the gradients are synchronized after each backward pass via hooks. To go further, you can distributed data and optimizer as well to avoid data transfer. You can do it simply (as well as parallelized modules) using torch-ignite/distributed.
I don't know what kind of method you tried but I encourage you to use DDP instead of DP if you are using it.
CodePudding user response:
While GPUs can process data several orders of magnitude faster than a CPU due to massive parallelism, GPUs are not as versatile as CPUs. CPUs have large and broad instruction sets, managing every input and output of a computer, which a GPU cannot do. While individual CPU cores are faster (as measured by CPU clock speed) and smarter than individual GPU cores (as measured by available instruction sets), the sheer number of GPU cores and the massive amount of parallelism that they offer more than make up the single-core clock speed difference and limited instruction sets.
GPUs are best suited for repetitive and highly-parallel computing tasks. If your code is not repetitive or highly parallel you should use your CPU