Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

Merged
merged 15 commits into from
Dec 9, 2024

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Dec 2, 2024

✨ Context

The changes here reflect the training strategy used for training the latest L2G model (presented in the Genetics meeting on 21/11 and discussed here).

I've essentially changed the LocusToGeneTrainer.train function to follow that diagram with a 5 fold cross validation strategy set by default.

The most complicated part has been to integrate it with W&B Sweeps. Sweeps allow to group different runs together and are useful to test several parameter configurations and compare between them. They have a known issue where each cross validation fold keeps being overwritten, so tweaking this so that a Sweep shows results for each fold + the evaluation for the test set was the part that kept me busy. For the record, I took inspiration from here.

An example run can be found here: https://wandb.ai/open-targets/gentropy-locus-to-gene/sweeps/917iffr0?nw=nwuseropentargets

This closes opentargets/issues#3253

🛠 What does this PR implement

  • New param in the L2G step: cross_validate. Whether to run this step. It is set to true by default. I've tested that the function works on both scenarios.
  • Rewritten train: to implement the new strategy (described in docs). The key difference is the strategy to split between train and test sets, which takes into account gene/EFO pairs so that they are always kept separate.
  • Changed hyperparameter_tuning to cross_validate. This function now not only accepts a grid of parameters to sweep over, but is the responsible for doing and tracking the cross validation. This is complex because I had to play a lot around the W&B config variables so that each sweep contained all my runs.
  • Added best set of params to the dictionary of LocusToGeneStep config defaults and to the hyperparameters default dict inside the model.

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@ireneisdoomed ireneisdoomed changed the title feat(l2g): implement new training strategy with cross validation and after regularization feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation Dec 2, 2024
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good, although I am not 100% expert on the cross validation.

src/gentropy/method/l2g/trainer.py Show resolved Hide resolved
src/gentropy/method/l2g/trainer.py Outdated Show resolved Hide resolved
@ireneisdoomed ireneisdoomed merged commit 79f6fcc into dev Dec 9, 2024
5 checks passed
@ireneisdoomed ireneisdoomed deleted the il-3253 branch December 9, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perform hyperparameter tuning and cross validation on L2G
2 participants