Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if max pairs limit reached in generate_pairs and generate_multilabel_pairs #549

Merged
merged 6 commits into from
Sep 18, 2024

Conversation

OscarRunsCode
Copy link
Contributor

No description provided.

Copy link
Contributor Author

@OscarRunsCode OscarRunsCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In trainer.py, generate_pairs and generate_multilabel_pairsthe program can run out of memory in highly imbalanced datasets due to not checking if max_pairs has been satisfied prior appending additional positive/negative examples.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tomaarsen
Copy link
Member

Hello!

Apologies for the delay.
I think this direction makes a lot of sense. I'm actually a bit surprised that I didn't think of this initially. Previously, if you set max_pairs to e.g. 20, you would often get e.g. 22 pairs or something - which is just very confusing/surprising. This only got worse the larger the discrepancy between classes.

I merged your changes with some other recent changes from this week (e.g. moving away from the InputExample class), and then I rewrote it all somewhat. I think the direction is a bit clearer now, while still being equivalent to what you wrote. If max_pairs is not set, then the pairs are also still sampled like normal. All good there.

Thanks for putting this together and bringing it to my attention!

  • Tom Aarsen

@tomaarsen tomaarsen merged commit ac0e8bc into huggingface:main Sep 18, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants