-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…-gold-standard-schema-simple
ireneisdoomed
changed the title
feat(l2g): implement new training strategy with cross validation and after regularization
feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation
Dec 2, 2024
project-defiant
approved these changes
Dec 9, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good, although I am not 100% expert on the cross validation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Context
The changes here reflect the training strategy used for training the latest L2G model (presented in the Genetics meeting on 21/11 and discussed here).
I've essentially changed the
LocusToGeneTrainer.train
function to follow that diagram with a 5 fold cross validation strategy set by default.The most complicated part has been to integrate it with W&B Sweeps. Sweeps allow to group different runs together and are useful to test several parameter configurations and compare between them. They have a known issue where each cross validation fold keeps being overwritten, so tweaking this so that a Sweep shows results for each fold + the evaluation for the test set was the part that kept me busy. For the record, I took inspiration from here.
An example run can be found here: https://wandb.ai/open-targets/gentropy-locus-to-gene/sweeps/917iffr0?nw=nwuseropentargets
This closes opentargets/issues#3253
🛠 What does this PR implement
cross_validate
. Whether to run this step. It is set to true by default. I've tested that the function works on both scenarios.train
: to implement the new strategy (described in docs). The key difference is the strategy to split between train and test sets, which takes into account gene/EFO pairs so that they are always kept separate.hyperparameter_tuning
tocross_validate
. This function now not only accepts a grid of parameters to sweep over, but is the responsible for doing and tracking the cross validation. This is complex because I had to play a lot around the W&B config variables so that each sweep contained all my runs.LocusToGeneStep
config defaults and to thehyperparameters
default dict inside the model.🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?