-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/best model checkpoint fix #35885
base: main
Are you sure you want to change the base?
Fix/best model checkpoint fix #35885
Conversation
tests/trainer/test_trainer.py::TrainerIntegrationTest::test_best_model_checkpoint_behavior test you created is failing, can you check ? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
e68d64f
to
9963e52
Compare
I think the error should be fixed now. Problem was I had hardcoded values and wasn't using the patch correctly. Thanks for pointing it out. |
a00279e
to
4108f6c
Compare
Make sure to run make style for the CI. There is also a failing test but that's unrelated to this PR |
Rather than set it explicitly without checking if the checkpoint directory even exists as before, now we moved the setting logic inside of _save_checkpoint and are only setting it if it exists.
4108f6c
to
492211b
Compare
Done. I'm not sure why but there was a bit of discrepancy between my local setting and the CI pipeline so I had to make the changes through a Docker image since |
thanks ! i will wait for @tomaarsen to test the PR before reviewing it |
I haven't reviewed the code in this PR, but I can confirm that my test passes with this branch again!
|
What does this PR do?
Fixes #35609 (TL;DR
best_model_checkpoint
is being set when the checkpoint may not even exist)best_global_step
that keeps track of the step where we performed evaluation and achieved a new best metric.best_model_checkpoint
from_determine_best_metric
to_save_checkpoint
._save_checkpoint
, check if thebest_model_checkpoint
actually exists, and set it if so.This was a bit trickier than I thought it would be, since the Trainer's saving and evaluation logic are not tightly coupled as would be in the
save_strategy == "best"
scenario.Before submitting
Pull Request section?
to it if that's the case. Trainer sets
state.best_model_checkpoint
even when it doesn't save there; leads to training crash #35609documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr
@SunMarc
@tomaarsen
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.