Home > other >  Will the gradients be recorded when finding the model's accuracy in pytorch?
Will the gradients be recorded when finding the model's accuracy in pytorch?

Time:03-21

I starting to learn PyTorch and I am confused about something. From what I understand, if we set .requires_grad_() to our parameters then the necessary calculations to find the gradients of those parameters will be recorded. This way, we can perform gradient descent. However, the gradient values will be added on top of the previous gradient values, so after we perform a gradient descent step we should reset our gradients using param.grad.zero_(), where param is a weight or a bias term. I have a model which has just the input layer and one output neuron, so really simple stuff (since I have just one output neuron you can tell that I only have 2 possible classes). I have my parameters called weights and bias, it is on these 2 variables that I set requires_grad_(). Also I put my training data in a DataLoader called train_dl and my validation data in valid_dl. I use a subset of the MNIST dataset, but that is really not important to this question. These are the functions I use:

def forward_propagation(xb):
    z = xb @ weights   bias
    a = z.sigmoid()
    return a

def mse_loss(predictions, targets):
    loss = ((predictions - targets) ** 2).mean()
    return loss

def backward_propagation(loss):
    loss.backward()
    weights.data -= lr * weights.grad.data
    weights.grad.zero_()
    bias.data -= lr * bias.grad.data
    bias.grad.zero_()

def train_epoch():
    for xb, yb in train_dl:
        a = forward_propagation(xb)
        loss = mse_loss(a, yb)
        backward_propagation(loss)

As you can see I use the function train_epoch() to perform: forward propagation (where some of the calculations for the gradient will be recorded since that is where our parameters are first used), calculate the loss (this step will also be used to calculate the gradients), and then backward propagation where I update my parameters and then reset the gradients to 0, so that they don't accumulate. I used this code to train my model and it worked fine, I am satisfied with the accuracy I got. So I assume that it works, at least somewhat.

But I also use this code to find the validation data accuracy for my model:

def valid_accuracy():
    accuracies = []
    for xb, yb in valid_dl:
        a = forward_propagation(xb)
        correct = (a > 0.5) == yb
        accuracies.append(correct.float().mean())
    return round(torch.stack(accuracies).mean().item(), 4)

As you can see, in finding the model's accuracy I perform forward propagation (the above function, where I multiply the weights by the data and add the bias). My question is: will the gradients also be recorded here? So the next time when I use .backward() on loss will the gradients be influenced by the steps taken in finding the accuracy? I think that as it is right now, the gradient values will be added each time I find the accuracy of the model (which I do not want and doesn't make sense), but I am not sure. Should I have somewhere in the function valid_accuracy() another 2 lines with weights.grad.zero_() and bias.grad.zero_() so that this doesn't happen? Or is it the case that this doesn't happen automatically, so I get the desired behavior by default and I simply misunderstood something?

CodePudding user response:

There are two things to consider: One is the gradients itself, and the other is the computational graph that is being built in each forward pass.

To compute the gradient after a forward pass, we need to record what operations have been done to what tensors in what order, that is, the computation graph. So whenever whe compute a new tensor form other tensors that have requires_grad==True, the new tensor has an attribute .grad_fn that points to the previous operation and the involved tensors. This is basically how backward() "knows" where to go. If you call backward() it will consider the this .grad_fn and recursively do the backward pass.

So currently the way you do it will actually build this computation graph, even when computing the accuracy. But if this graph is never accessed, the garbage collector will eventually destroy it.

The key thing to notice is that each separate evaluation will produce a new computation graph (depending on your model maybe sharing some parts), but the backward pass will only start from the "node" from which you called .backward, so in your snippets you won't ever get a gradient from the accuracy computation as you never call a.backward(), you only ever call loss.backward().

The recoding of the computation graph does require some overhead though, but this can be disabled using the torch.no_grad() context manager, which is made with this exact use case in mind. Unfortunately the name (as well as the documentation) mention the gradient, but it really is just about recording the (forward) computation graph. But obviously if you disable that, as a consequence you won't be able to compute a backward pass either.

CodePudding user response:

The gradients stored in the leaf tensors (weights and bias in this case) are updated when you call .backward() on a tensor that is a descendent of those tensors. Since you aren't calling backward during valid_accuracy the gradients won't be influenced. That said, PyTorch will store intermediate information in the temporary computation graph (which is discarded when the program returns from valid_accuracy and all the tensors referencing the computation graph go out of scope) which takes time and memory.

You can, and should use a torch.no_grad() context if you are certain you won't need to perform backpropagation on the output of the model. This will disable the recording of the intermediate results. For example

def valid_accuracy():
    with torch.no_grad():
        accuracies = []
        for xb, yb in valid_dl:
            a = forward_propagation(xb)
            correct = (a > 0.5) == yb
            accuracies.append(correct.float().mean())
        return round(torch.stack(accuracies).mean().item(), 4)
  • Related