Tensorflow - Loss stagnant from first epoch for the same model which showed better results in earlie-CodePudding

I was training a model for my project on Optical Communication on Colab and this weird thing happened. The model that I trained first, showed near to 99% training and 97% validation accuracy, but the runtime expired sometime in the night. Now, for the same model, I tried re-training after reconnecting to the runtime. But now, the accuracy remains constant from the first epoch at 25%. Surprisingly, there are 4 categories and my model is classifying them all with 0.25. I am not sure what's causing this error because after a few restarts, the model showed similar to original performance but now it's back to the 25% accuracy. Please refer the image and below for the model specs.

Model Summary

model_fm = tf.keras.Sequential([
        tf.keras.layers.Conv1D(256,kernel_size = 3, activation = 'relu', input_shape = x_train.shape[1:]),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Conv1D(128,kernel_size = 3, activation = 'relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Conv1D(64,kernel_size = 3, activation = 'relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(256, activation = 'relu'),
        tf.keras.layers.Dense(128, activation = 'relu'),
        tf.keras.layers.Dense(128, activation = 'relu'),
        tf.keras.layers.Dense(64, activation = 'relu'),
        tf.keras.layers.Dense(4, activation = 'softmax')
        ])
model_fm.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
model_fm.fit(x_train, y_train, batch_size=256, verbose=1, epochs=60,validation_data=(x_val, y_val), callbacks = [earlystopping, reduce_lr])

Earlier progress

Epoch 1/60 612/612 [==============================] - 170s 275ms/step - loss: 0.9359 - accuracy: 0.5621 - val_loss: 0.7793 - val_accuracy: 0.6299

Epoch 2/60 612/612 [==============================] - 168s 274ms/step - loss: 0.5998 - accuracy: 0.7369 - val_loss: 0.4597 - val_accuracy: 0.8002

Epoch 3/60 612/612 [==============================] - 173s 284ms/step - loss: 0.4464 - accuracy: 0.8078 - val_loss: 0.3138 - val_accuracy: 0.8693

Epoch 4/60 612/612 [==============================] - 174s 284ms/step - loss: 0.3427 - accuracy: 0.8578 - val_loss: 0.2393 - val_accuracy: 0.9037

After restarting runtime:

Epoch 1/60 409/409 [==============================] - 112s 273ms/step - loss: 1.3865 - accuracy: 0.2493 - val_loss: 1.3862 - val_accuracy: 0.2594

Epoch 2/60 409/409 [==============================] - 111s 271ms/step - loss: 1.3863 - accuracy: 0.2501 - val_loss: 1.3864 - val_accuracy: 0.2435

P.S. Ignore the change in number of samples used for training for the latter case. The model showed similar results for the entire dataset (25% accuracy). I thought maybe using a smaller number of samples might ease the situation, but it didn't. Your help is very much appreciated.

CodePudding user response：

I assume that it because the first dense layer is very big (16M parameters that represents 99% of the total number of parameters) and therefore your model is really sensitive to initialization and sometimes hard to train.

CodePudding user response：

You use multiple dense layer in your architecture, when you flatten the final layer, an array of 68k values is created and each value is then passed as an input to 256 neurons in dense layers. What you can do instead is use GAP layer then ust one dense layer ( 4 neurons) or 2 dense layers ( 1st dense layer with 8 neurons ans 2nd with 4 neurons)