pytorch's autograd.detect_anomaly equivalent in tensorflow-CodePudding

I am trying to debug my tensorflow code that suddenly produces a NaN loss after about 30 epochs. You may find my specific problem and things I tried in this SO question.

I monitored the weights of all layers for each mini-batch during training and found that the weights suddenly jump to NaN although all weight values were less than 1 during the previous iteration (I have set kernel_constraint max_norm to 1). This makes it very hard to figure out which operation is the culprit.

Pytorch has a cool debugging method torch.autograd.detect_anomaly that produces an error at any backward computation that produces NaN value and shows the traceback. This makes it easy to debug the code.

Is there something similar in TensorFlow? If not can you suggest a method to debug this?

CodePudding user response：

There is indeed a similar debugging tool in tensorflow. See tf.debugging.check_numerics.

This can be used to track the tensors that produce inf or nan values during training. As soon as such value is found, tensorflow produces an InvalidArgumentError.

tf.debugging.check_numerics(LayerN, "LayerN is producing nans!")

If the tensor LayerN has nans, you would get an error like that:

Traceback (most recent call last):
  File "trainer.py", line 506, in <module>
    worker.train_model()
  File "trainer.py", line 211, in train_model
    l, tmae = train_step(*batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  LayerN is producing nans! : Tensor had NaN values