Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/best model checkpoint fix #35885

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

seanswyi
Copy link
Contributor

What does this PR do?

Fixes #35609 (TL;DR best_model_checkpoint is being set when the checkpoint may not even exist)

  • Add a new attribute to TrainerState called best_global_step that keeps track of the step where we performed evaluation and achieved a new best metric.
  • Moved setting best_model_checkpoint from _determine_best_metric to _save_checkpoint.
  • Inside _save_checkpoint, check if the best_model_checkpoint actually exists, and set it if so.

This was a bit trickier than I thought it would be, since the Trainer's saving and evaluation logic are not tightly coupled as would be in the save_strategy == "best" scenario.

Before submitting

Who can review?

@muellerzr
@SunMarc
@tomaarsen

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc SunMarc requested a review from tomaarsen January 27, 2025 15:05
@SunMarc
Copy link
Member

SunMarc commented Jan 27, 2025

tests/trainer/test_trainer.py::TrainerIntegrationTest::test_best_model_checkpoint_behavior test you created is failing, can you check ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@seanswyi seanswyi force-pushed the fix/best-model-checkpoint-fix branch from e68d64f to 9963e52 Compare January 27, 2025 21:14
@seanswyi
Copy link
Contributor Author

tests/trainer/test_trainer.py::TrainerIntegrationTest::test_best_model_checkpoint_behavior test you created is failing, can you check ?

I think the error should be fixed now. Problem was I had hardcoded values and wasn't using the patch correctly. Thanks for pointing it out.

@seanswyi seanswyi force-pushed the fix/best-model-checkpoint-fix branch from a00279e to 4108f6c Compare January 28, 2025 12:02
@SunMarc
Copy link
Member

SunMarc commented Jan 28, 2025

Make sure to run make style for the CI. There is also a failing test but that's unrelated to this PR

@seanswyi seanswyi force-pushed the fix/best-model-checkpoint-fix branch from 4108f6c to 492211b Compare January 28, 2025 13:48
@seanswyi
Copy link
Contributor Author

Make sure to run make style for the CI. There is also a failing test but that's unrelated to this PR

Done. I'm not sure why but there was a bit of discrepancy between my local setting and the CI pipeline so I had to make the changes through a Docker image since make style wasn't working.

@SunMarc
Copy link
Member

SunMarc commented Jan 28, 2025

thanks ! i will wait for @tomaarsen to test the PR before reviewing it

@tomaarsen
Copy link
Member

I haven't reviewed the code in this PR, but I can confirm that my test passes with this branch again!
cc @SunMarc

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash
4 participants