Home > Software design >  how does optimizer.step() takes the recent loss of the model?
how does optimizer.step() takes the recent loss of the model?

Time:06-02

I am looking at the example from pytorch of a model:

https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py

for epoch in range(2): # loop over the dataset multiple times

running_loss = 0.0
for i, data in enumerate(trainloader, 0):
    # get the inputs; data is a list of [inputs, labels]
    inputs, labels = data

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward   backward   optimize
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    # print statistics
    running_loss  = loss.item()
    if i % 2000 == 1999:    # print every 2000 mini-batches
        print(f'[{epoch   1}, {i   1:5d}] loss: {running_loss / 2000:.3f}')
        running_loss = 0.0

print('Finished Training')

And I have a very basic question - the optimizer was never inserted or defined into the model (similarly to model.compile in keras). Nor it received the loss or labels of the last batch or epoch. How does it "knows" to perform optimization step?

CodePudding user response:

On optimizer instantiation you pass a model parameters:

optimizer = optim.Adam(model.parameters())

optimizer.step updates those parameters.

gradients are computed on loss.backward() step before calling the step method.

CodePudding user response:

Rather than thinking about how loss and parameters are related, you should consider them as separate events which are not linked. Indeed, there are two distinct elements that have an effect on parameters and their cached gradient.

  • The autograd mechanism (the process in charge of performing gradient computation) allows you to call backward on a torch.Tensor (your loss) and which will in turn backpropagate through all the nodes tensors that are allowed to compute this final tensor value. Doing so, it will navigate through what's called the computation graph, updating each of the parameters' gradients by changing their grad attribute. This means that at the end of a backward call the network's learned parameters that were used to compute this output will have a grad attribute containing the gradient of the loss with respect to that parameter.

    loss.backward()
    
  • The optimizer is independent of the backward pass since it doesn't rely on it. You can call backward on your graph once, multiple times, or on different loss terms depending on your use case. The optimizer's task is to take the parameters of the model independently (that is irrespective of the network architecture or its computation graph) and update them using a given optimization routine (for example via Stochastic Gradient Descent, Root Mean Squared Propagation, etc...). It goes through all parameters it was initialized with and updates them using their respective gradient value (which is supposed to be stored in the grad attribute by at least one backpropagation.

    optimizer.step()
    

Important notes:

  • Keep in mind though that the backward process and the actual update call using the optimizer are linked implicitly only by the fact that the optimizer will use the results computed by the backward preceding call.

  • In PyTorch parameter gradients are kept in memory so you have to clear them out before performing a new backward call. This is done using the optimizer's zero_grad function. In practice, it clears the grad attribute of the tensors it has registered as parameters.

  • Related