Home > Software design >  How to split the model.fit for continue training in multi days
How to split the model.fit for continue training in multi days


The tensorflow model uses the following code for training:


The total steps_per_epoch is 10000 and epochs is 20000.

Is it possible to split the training time for multiple days:

day 1:

model.fit(..., steps_per_epoch=10000, ..., epochs=10, ....)
model.fit(..., steps_per_epoch=10000, ..., epochs=20, ....)
model.fit(..., steps_per_epoch=10000, ..., epochs=30, ....)

day 2:

model.fit(..., steps_per_epoch=10000, ..., epochs=100, ....)

day 3:

model.fit(..., steps_per_epoch=10000, ..., epochs=5, ....)

day (n):

model.fit(..., steps_per_epoch=10000, ..., epochs=n, ....)

The expected epochs is:

20000 = (day1   day2   day3   ...   dayn)

Can I simply stop the model.fit and start the model.fit on another day?

Is it the same as running once with "epochs=20000"?

CodePudding user response:

You can save your model after each day as a pickle file then tomorrow load your model and continue training:

training the model in day_1

import tensorflow_datasets as tfds
import tensorflow as tf
import joblib

train, test = tfds.load(
    split = ['train', 'test']

train = train.repeat(15).batch(64).prefetch(tf.data.AUTOTUNE)
test = test.batch(64).prefetch(tf.data.AUTOTUNE)

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(128, (3,3), activation='relu'))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))        
              optimizer='adam', metrics=['accuracy'])

model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_1.pkl')

Output after day_1:

Epoch 1/3
150/150 [==============================] - 7s 17ms/step - loss: 23.0504 - accuracy: 0.5786
Epoch 2/3
150/150 [==============================] - 2s 16ms/step - loss: 0.9366 - accuracy: 0.7208
Epoch 3/3
150/150 [==============================] - 3s 17ms/step - loss: 0.7321 - accuracy: 0.7682
157/157 [==============================] - 1s 8ms/step - loss: 0.4627 - accuracy: 0.8405
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets

Load model in day_2 and continue training:

model = joblib.load("/content/model_day_1.pkl")
model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_2.pkl')

Output after day_2:

Epoch 1/3
150/150 [==============================] - 3s 17ms/step - loss: 0.6288 - accuracy: 0.7981
Epoch 2/3
150/150 [==============================] - 2s 16ms/step - loss: 0.5290 - accuracy: 0.8222
Epoch 3/3
150/150 [==============================] - 2s 16ms/step - loss: 0.5124 - accuracy: 0.8272
157/157 [==============================] - 1s 5ms/step - loss: 0.4131 - accuracy: 0.8598
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets

Load model in day_3 and continue training:

model = joblib.load("/content/model_day_2.pkl")
model.fit(train, batch_size=256, steps_per_epoch=150, epochs=3, verbose=1)
model.evaluate(test, verbose=1)
joblib.dump(model, 'model_day_3.pkl')

Output after day_3:

Epoch 1/3
150/150 [==============================] - 3s 17ms/step - loss: 0.4579 - accuracy: 0.8498
Epoch 2/3
150/150 [==============================] - 2s 17ms/step - loss: 0.4078 - accuracy: 0.8589
Epoch 3/3
150/150 [==============================] - 2s 16ms/step - loss: 0.4073 - accuracy: 0.8560
157/157 [==============================] - 1s 5ms/step - loss: 0.3997 - accuracy: 0.8603
INFO:tensorflow:Assets written to: ram://***/assets
INFO:tensorflow:Assets written to: ram://***/assets

CodePudding user response:

I think you're asking if multiple calls to model.fit will continue training the model (instead of starting from scratch)--the answer is yes, it will. However, a new History object is generated for each model.fit call, so if you are capturing that, you may want to handle that separately.

So running

model.fit(..., epochs=10)
model.fit(..., epochs=10)

will train the model for 20 epochs in total.

  • Related