Home > Blockchain >  Huggingface Trainer load_best_model f1 score vs. loss and overfitting
Huggingface Trainer load_best_model f1 score vs. loss and overfitting

Time:08-04

I have trained a roberta-large and specified load_best_model_at_end=True and metric_for_best_model=f1. During training, I can see overfitting after the 6th epoch, which is the sweetspot. In Epoch 8, which is the next one to evaluate due to gradient accumulation, we can see that train loss decreases and eval_loss increases. Thus, overfitting starts. The transformers trainer in the end loads the model from epoch 8, checkpoint -14928, as the f1 score is a bit highea. I was wondering, in theory, wouldn't be the model from epoch 6 be better suited, as it did not overfit? Or does one really go for the f1 metric here even though the model did overfit? (the eval loss decreased in epochs <6 constantly).

The test_loss from the second checkpoint, which is then loaded as the "best", is 0.128. Is it possible to lower that using the first checkpoint which should be the better model anyway?

checkpoint-11196:
{'loss': 0.0638, 'learning_rate': 8.666799323450404e-06, 'epoch': 6.0}

{'eval_loss': 0.09599845856428146, 'eval_accuracy': 0.9749235986101227, 'eval_precision': 0.9648319293367138, 'eval_recall': 0.9858766505097777, 'eval_f1': 0.9752407721241682, 'eval_runtime': 282.2294, 'eval_samples_per_second': 84.637, 'eval_steps_per_second': 2.647, 'epoch': 6.0}

VS.

checkpoint-14928:
{'loss': 0.0312, 'learning_rate': 7.4291115311909265e-06, 'epoch': 8.0}

{'eval_loss': 0.12377820163965225, 'eval_accuracy': 0.976305103194206, 'eval_precision': 0.9719324391455539, 'eval_recall': 0.9810295838208257, 'eval_f1': 0.9764598236566295, 'eval_runtime': 276.7619, 'eval_samples_per_second': 86.309, 'eval_steps_per_second': 2.699, 'epoch': 8.0}

CodePudding user response:

You could just comment the metric_for_best_model='f1' part out and see for yourself, loss is the default setting. Or, utilize from_pretrained('path/to/checkpoint') to compare two checkpoints back to back. F-score is threshold sensitive, so it's entirely possible for a lower loss checkpoint to be better in the end (assuming you do optimize the threshold).

  • Related