Home > Software engineering >  Keras model predict NaNs after save/load
Keras model predict NaNs after save/load

Time:11-15

A Keras model works perfectly fine after compiling/training:

>>> model.predict(values)
array([[5.28525668e-10, 3.66615766e-12, 2.76005746e-10, ...,
        1.06744905e-10, 3.96939370e-09, 1.54998125e-09],
       [1.08512407e-17, 1.16371355e-20, 3.40085518e-20, ...,
        1.58855026e-15, 3.41645340e-23, 2.22618953e-18],
       [8.91928664e-07, 1.51766372e-07, 5.11579383e-05, ...,
        2.09874074e-07, 1.08243627e-08, 1.00344047e-03],
       ...,
       [1.48135211e-06, 4.81735299e-07, 7.23933127e-08, ...,
        6.75531879e-08, 2.97403737e-08, 5.35680655e-08],
       [2.52744006e-12, 1.91630305e-11, 4.30207465e-13, ...,
        6.73083234e-09, 1.56778467e-13, 6.92025376e-13],
       [2.72180110e-08, 2.60345967e-08, 6.72346505e-05, ...,
        1.04813864e-06, 8.22153803e-11, 6.33114814e-06]], dtype=float32)

But after saving the model and loading it in a different script:

# script 1
model.save('./model')

# script 2:
model = tf.keras.models.load_model(f"./model")

Calling model.predict() on the loaded model returns only NaN values, on the exact same input data:

>>> model.predict(values)
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)

This worked perfectly fine up to recently, but now suddenly the model started to behave like this. Again, going back to script 1 works perfectly fine on the exact same data, restarting scripts (1 and 2) and saving the model again then reloading again does not improve anything.

  • I checked that the model saved and the model loaded are exactly the same
  • I also tried to call loaded_model(values, training=False) with no success

Any idea what is happening here and how to fix this? Using TensorFlow 2.3.4.

CodePudding user response:

Turns out this was because some of the values in the training dataset where nan.

As a result, the weights in some of the layers were also nan.

The surprising bit is that running model.predict() on GPU was perfectly fine, while on CPU it resulted in all nan predictions.

I was using the fitted model directly on GPU, and the saved model on CPU, hence I believe it had something to do with model saving, but not at all. Purely GPU versus CPU dichotomy.

I ended up cleaning the nan values from the training dataset and now the model is exempt from nan weights and runs fine both on CPU and GPU.

  • Related