I'm puzzled by the fact that I've seen blocks of code that require tf.GradientTape().watch() to work and blocks that seem to work without it.
For example, this block of code requires the watch() function:
x = tf.ones((2,2)) #Note: In original posting of question,
# I didn't include this key line.
with tf.GradientTape() as t:
# Record the actions performed on tensor x with `watch`
t.watch(x)
# Define y as the sum of the elements in x
y = tf.reduce_sum(x)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
But, this block does not:
with tf.GradientTape() as tape:
logits = model(images, images = True)
loss_value = loss_object(labels, logits)
loss_history.append(loss_value.numpy.mean()
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradient(zip(grads, model.trainable_vaiables))
What is the difference between these two cases?
CodePudding user response:
The documentation of watch says the following:
Ensures that
tensor
is being traced by this tape.
Any trainable variable that is accessed within the context of tape is watched by default. This means we can calculate the gradient with respect to that trainable variable by calling t.gradient(loss, variable)
. Check the following example:
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets, training=True)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
Hence no need to use tape.watch
in the above code. But sometimes we need to calculate gradients with respect to some non trainable variable. In those cases we need to watch
them.
with tf.GradientTape() as t:
t.watch(images)
predictions = cnn_model(images)
loss = tf.keras.losses.categorical_crossentropy(expected_class_output, predictions)
gradients = t.gradient(loss, images)
In the above code, images
is input to model and not a trainable variable. I need to calculate the gradient of loss with respect to the images and hence I need to watch
it.
CodePudding user response:
Although the answer and comment given above were helpful, I think this code most clearly illustrates what I needed to know.
# Define a 2x2 array of 1's
x = tf.ones((2,2))
x1 = tf.Variable([[1,1],[1,1]],name='x1',dtype=tf.float32)
with tf.GradientTape(persistent=True) as t:
# Record the actions performed on tensor x with `watch`
# Define y as the sum of the elements in x
y = tf.reduce_sum(x) tf.reduce_sum(x1)
# Let z be the square of y
z = tf.square(y)
# Get the derivative of z wrt the original input tensor x
dz_dx = t.gradient(z, x)
dz_dx1 = t.gradient(z, x1)
# Print results
print("dz_dx =",dz_dx)
print("dz_dx1 =", dz_dx1)
print("type(x) =", type(x))
print("type(x1) =", type(x1))
which gives,
dz_dx = None
dz_dx1 = tf.Tensor(
[[16. 16.]
[16. 16.]], shape=(2, 2), dtype=float32)
type(x) = <class 'tensorflow.python.framework.ops.EagerTensor'>
type(x1) = <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>
In Tensorflow, all variables are are given the property Trainable=True
by default. So, they are automatically watched. I neglected (out of ignorance) to originally indicate that x
was not a variable, but instead, was an EagerTensor, which is not watched by default.
model.trainable_variables
is a list of trainable variables, which are watched by gradient tape, whereas a regular Tensor is not watched without specifically telling the tape to watch it.