Home > front end >  Can't train model from checkpoint on Google Colab because those all deleted after a few hours
Can't train model from checkpoint on Google Colab because those all deleted after a few hours

Time:01-24

I'm using Google Colab for finetuning a pre-trained model.

I successfully preprocessed a dataset and created an instance of the Seq2SeqTrainer class:

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

But the problem is training it from last checkpoint after the session is over.

If I run trainer.train() it runs well. As it takes long time I came back to Colab tab after a few hours. I know that if session got crashed I can continue training from last checkpoint like this: trainer.train("checkpoint-5500")

But the problem is that those checkpoint data no longer exist on Google Colab if I came back too late, so even though I know till what point training has been done, I will have to start all over again?

Is there any way to solve this problem?

CodePudding user response:

To fix your problem try adding a full fixed path, for example for your google drive and saving the checkpoint-5500 to it.

Using your trainer you can set the output directory as your Google Drive path when creating an instance of the Seq2SeqTrainingArguments.

When you come back to your code, if the session is indeed over you'll just need to load your checkpoint-5500 from your google drive instead of retraining everything.

Add the following code:

from google.colab import drive
drive.mount('/content/drive')

And then after your trainer.train("checkpoint-5500") is finished (or as it's last step) save your checkpoint to your google drive. Or if you prefer, you can add a callback inside your fit function in order to save and update after every single epoch (that was if for some reason the session is crashing before it finish you'll still have some progress saved).

  • Related