Tensorflow NaN loss during training with Tf.math.is

I have written a custom loss function that returns a loss of 0 when the ground truth labels (6d vector) are NaN and otherwise returns the mean squared error. Either all 6 features in the label are NaN, or there are no NaNs.

my loss function looks like:

tf.reduce_mean(tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels))))

where true_labels and predicted_labels have shape (batch_size, 6), and only entire rows of either matrix can be NaN. I get NaN loss values in this case, even though I should be returning 0 for the loss when thr ground truth is NaN. I have also tested this issue with a work around by replacing all the NaN values with a large negative number (-1e4, which is outside the range of my data) during preprocessing, and then testing for NaNs in my loss function by using

tf.where(tf.math.less(true_labels, -9999), tf.zeros_like(true_labels),
tf.square(tf.subtract(true_labels, predicted_labels)))

This is a total hack, but works nonetheless. Therefore, I believe the issue is with the tf.math.is_nan function, but I have no idea why it gives my NaN losses. Furthermore, I have tested the loss function outside of training mode on some labels I made artificially, and it does not return NaNs then. Any help is appreciated.

This is my model below. It returns a (batch_size, 6) shaped Tensor. The first column is sigmoid activated to lie in [0,1] and is fed into a binary cross entropy loss function (that I did not include here, but confirmed that the NaN is not coming from the binary loss). The remaining 5 columns are fed into the custom loss function defined above.

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = Sequential()
    model.add(
        Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(Conv2D(16, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(32, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Conv2D(64, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Conv2D(128, **CONV_PARAMS2))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPool2D())
    model.add(Flatten())
    model.add(Dense(64))
    model.add(Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Here is an example of what the ground truth label looks like when the first feature is True (1):

 [  1.         106.         189.           2.64826314  19.
   26.44962941]

When the first feature is False (0), the label is

[0, nan, nan, nan, nan, nan]

Update:

After some debugging with tf.print statements, I found that my 'predicted_labels' are coming out as all NaN values. This issue does not occur when I use the 'hack' described above, so I don't think it is an issue wiht my data. I also checked that none of my images contain any NaNs after preprocessing when used as input to the network. Somehow, with the loss function described above, I get NaNs in my predicted values, but I have no idea why. I have tried lowering learning rate and batch size, but this does not help.

CodePudding user response：

Maybe something like the following could work for you. All nan elements are first converted to 0, while the rest remain elements stay the same. For example, [0, np.nan, np.nan, np.nan, np.nan, np.nan] results in [0, 0, 0, 0, 0, 0] while [1., 106., 189., 2.64826314, 19., 26.44962941] remains untouched. Afterwards, your loss is only calculated for non-zero values. If true_labels are zero, then you just return 0.

import tensorflow as tf
import numpy as np

def custom_loss(true_labels, predicted_labels):

   true_labels = tf.where(tf.math.is_nan(true_labels), tf.zeros_like(true_labels), true_labels)
   loss = tf.reduce_mean(
       tf.where(tf.equal(true_labels, 0.0), true_labels,
       tf.square(tf.subtract(true_labels, predicted_labels))))
   return loss

def custom_activation(tensor):
    first_node_sigmoid = tf.nn.sigmoid(tensor[:, :1])
    return tf.concat([first_node_sigmoid, tensor[:, 1:]], axis = 1)


def gen_model():
    IMAGE_SIZE = 200
    CONV_PARAMS = {"kernel_size": 3, "use_bias": False, "padding": "same"}
    CONV_PARAMS2 = {"kernel_size": 5, "use_bias": False, "padding": "same"}

    model = tf.keras.Sequential()
    model.add(
        tf.keras.layers.Reshape((IMAGE_SIZE, IMAGE_SIZE, 1), input_shape=(IMAGE_SIZE, IMAGE_SIZE))
    )
    model.add(tf.keras.layers.Conv2D(16, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(32, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Conv2D(64, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Conv2D(128, **CONV_PARAMS2))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.MaxPool2D())
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(64))
    model.add(tf.keras.layers.Dense(6))
    model.add(tf.keras.layers.Lambda(custom_activation, name = "final_activation_layer"))
    return model

Y_train = tf.constant([[1., 106., 189., 2.64826314, 19., 26.44962941], 
                       [0, np.nan, np.nan, np.nan, np.nan, np.nan]])
model = gen_model()
model.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam())
model.fit(tf.random.normal((2, 200, 200)), Y_train, epochs=4)

Epoch 1/4
1/1 [==============================] - 1s 1s/step - loss: 4112.9380
Epoch 2/4
1/1 [==============================] - 0s 30ms/step - loss: 947.3030
Epoch 3/4
1/1 [==============================] - 0s 25ms/step - loss: 25.8993
Epoch 4/4
1/1 [==============================] - 0s 24ms/step - loss: 217.2151
<keras.callbacks.History at 0x7f8490b8db90>