Custom metric Turns to NaN after many steps in each epoch-CodePudding

I am using custom Recall and Precision metrics in my model. I know they have them built into Keras but I only care about one of the classes.

As I begin an epoch, I get values printing out for the metrics but after many steps one metrics returns NaN and a few hundred epochs later the second custom metric shows NaN.

The recall metric is written in the same

def precision(y_true, y_pred):
    '''
    Calculates precision metric over gun label
    Precision = TP/(TP FP)
    '''
    #I only care about the last label
    y_true = y_true[:,-1]
    y_pred = y_pred[:,-1]
    y_pred = tf.where(y_pred>.5, 1, 0)

    y_pred = tf.cast(y_pred, tf.float32)
    y_true = tf.cast(y_true, tf.float32)

    true_positives = K.sum(y_true * y_pred)
    false_positive = tf.math.reduce_sum(tf.where(tf.logical_and(tf.not_equal(y_true,y_pred), y_pred==1), 1, 0))
    false_positive = tf.cast(false_positive, tf.float32)
    precision = true_positives / (true_positives   false_positive)
    return precision

Training a multi label so my last dense layer is preds = Dense(num_classes, activation='sigmoid', name='Classifier')(x).

model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy', precision, recall])
model.fit(train_ds, steps_per_epoch=10000, validation_data=valid_ds, validation_steps=1181,  epochs=200)

18/10000 [............] - ETA: 6:43 - loss: 0.6919 - accuracy: 0.0046 - precision: 0.2597 - recall: 0.4691

315/10000 [...........] - ETA: 7:56 - loss: 0.4174 - accuracy: 0.1145 - precision: nan - recall: 0.6115

10000/10000 [=========>] - ETA: 0s - loss: 0.0797 - accuracy: 0.5432 - precision: nan - recall: nan
10000/10000 [=========>] - 576s 56ms/step - loss: 0.0797 - accuracy: 0.5432 - precision: nan - recall: nan - val_loss: 0.0557 - val_accuracy: 0.5807 - val_precision: 0.9698 - val_recall: 0.9529

At the beginning of each epoch, the metrics show numbers again but after many steps they go back to NaN. With observation, I can confirm they do not go to 0 nor 1 right before NaN.

CodePudding user response：

The issue was a divide by zero. I added a small value in each denominator which solves the problem. This occurs if there are no positive predictions by the network in any batch. This is why it occurred intermittently.

import tensorflow.keras.backend as K

precision = true_positives / (true_positives   false_positive   K.epsilon())