Home > front end >  How to Record Variables in Pytorch Without Breaking Gradient Computation?
How to Record Variables in Pytorch Without Breaking Gradient Computation?

Time:01-21

I am trying to implement some policy gradient training, similar to this. However, I would like to manipulate the rewards(like discounted future sum and other differentiable operations) before doing the backward propagation.

Consider the manipulate function defined to calculate the reward to go:

def manipulate(reward_pool):
    n = len(reward_pool)
    R = np.zeros_like(reward_pool)
    for i in reversed(range(n)):
        R[i] = reward_pool[i]   (R[i 1] if i 1 < n else 0)
    return T.as_tensor(R)

I tried to store the rewards in a list:

#pseudocode
reward_pool = [0 for i in range(batch_size)]

for k in batch_size:
  act = net(state)
  state, reward = env.step(act)
  reward_pool[k] = reward

R = manipulate(reward_pool)
R.backward()
optimizer.step()

It seems like inplace operation breaks the gradient computation, the code gives me an error: one of the variables needed for gradient computation has been modified by an inplace operation.

I also tried to initialize an empty tensor first, and store it in the tensor, but inplace operation is still the issue - a view of a leaf Variable that requires grad is being used in an in-place operation.

I am kind of new to PyTorch. Does anyone know what the right way recording rewards is in this case?

Edit: Solution Found

Simply initialize the empty pool(list) for each iteration and append to the pool when new reward is calculated, i.e.

reward_pool = []

for k in batch_size:
  act = net(state)
  state, reward = env.step(act)
  reward_pool.append(reward)

R = manipulate(reward_pool)
R.backward()
optimizer.step()

CodePudding user response:

The issue is due to assignment to existing object. Simply initialize the empty pool(list) for each iteration and append to the pool when new reward is calculated, i.e.

reward_pool = []

for k in batch_size:
  act = net(state)
  state, reward = env.step(act)
  reward_pool.append(reward)

R = manipulate(reward_pool)
R.backward()
optimizer.step()
  • Related