A Keras model works perfectly fine after compiling/training:
>>> model.predict(values)
array([[5.28525668e-10, 3.66615766e-12, 2.76005746e-10, ...,
1.06744905e-10, 3.96939370e-09, 1.54998125e-09],
[1.08512407e-17, 1.16371355e-20, 3.40085518e-20, ...,
1.58855026e-15, 3.41645340e-23, 2.22618953e-18],
[8.91928664e-07, 1.51766372e-07, 5.11579383e-05, ...,
2.09874074e-07, 1.08243627e-08, 1.00344047e-03],
...,
[1.48135211e-06, 4.81735299e-07, 7.23933127e-08, ...,
6.75531879e-08, 2.97403737e-08, 5.35680655e-08],
[2.52744006e-12, 1.91630305e-11, 4.30207465e-13, ...,
6.73083234e-09, 1.56778467e-13, 6.92025376e-13],
[2.72180110e-08, 2.60345967e-08, 6.72346505e-05, ...,
1.04813864e-06, 8.22153803e-11, 6.33114814e-06]], dtype=float32)
But after saving the model and loading it in a different script:
# script 1
model.save('./model')
# script 2:
model = tf.keras.models.load_model(f"./model")
Calling model.predict()
on the loaded model returns only NaN values, on the exact same input data:
>>> model.predict(values)
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
This worked perfectly fine up to recently, but now suddenly the model started to behave like this. Again, going back to script 1 works perfectly fine on the exact same data, restarting scripts (1 and 2) and saving the model again then reloading again does not improve anything.
- I checked that the model saved and the model loaded are exactly the same
- I also tried to call
loaded_model(values, training=False)
with no success
Any idea what is happening here and how to fix this? Using TensorFlow 2.3.4.
CodePudding user response:
Turns out this was because some of the values in the training dataset where nan
.
As a result, the weights in some of the layers were also nan
.
The surprising bit is that running model.predict()
on GPU was perfectly fine, while on CPU it resulted in all nan
predictions.
I was using the fitted model directly on GPU, and the saved model on CPU, hence I believe it had something to do with model saving, but not at all. Purely GPU versus CPU dichotomy.
I ended up cleaning the nan
values from the training dataset and now the model is exempt from nan
weights and runs fine both on CPU and GPU.