Home > Back-end >  Parameters can't be updated when using torch.nn.DataParallel to train on multiple GPUs
Parameters can't be updated when using torch.nn.DataParallel to train on multiple GPUs

Time:08-15

import torch
import torch.nn as nn
import os

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.h = -1
    def forward(self, x):
        self.h =x

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
if torch.cuda.is_available():
    print('using Cuda devices, num:', torch.cuda.device_count())

model = nn.DataParallel(Net())
x = 2
print(model.module.h)
model(x)
print(model.module.h)

When I use multiple GPUs to train my model, I find that the Net's params can't be updated correctly, it remains the initial value. However, when I use only one GPU instead, it's can be correctly updated. How can I fix this problem? thx! (The examples are posted in the image)

This is when I using two GPUs, the param 'h' didn't change:

enter image description here

This is when I using only one GPU, the param 'h' had changed:

enter image description here

CodePudding user response:

From the PyTorch's documentation (https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html):

In each forward, module is replicated on each device, so any updates to the running module in forward will be lost. For example, if module has a counter attribute that is incremented in each forward, it will always stay at the initial value because the update is done on the replicas which are destroyed after forward.

I am guessing PyTorch skips the copying part when there is only one GPU. Also, your h is just an attribute. It is not a "parameter" in PyTorch.

  • Related