feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

ireneisdoomed · 2024-12-02T15:05:09Z

✨ Context

The changes here reflect the training strategy used for training the latest L2G model (presented in the Genetics meeting on 21/11 and discussed here).

I've essentially changed the LocusToGeneTrainer.train function to follow that diagram with a 5 fold cross validation strategy set by default.

The most complicated part has been to integrate it with W&B Sweeps. Sweeps allow to group different runs together and are useful to test several parameter configurations and compare between them. They have a known issue where each cross validation fold keeps being overwritten, so tweaking this so that a Sweep shows results for each fold + the evaluation for the test set was the part that kept me busy. For the record, I took inspiration from here.

An example run can be found here: https://wandb.ai/open-targets/gentropy-locus-to-gene/sweeps/917iffr0?nw=nwuseropentargets

This closes opentargets/issues#3253

🛠 What does this PR implement

New param in the L2G step: cross_validate. Whether to run this step. It is set to true by default. I've tested that the function works on both scenarios.
Rewritten train: to implement the new strategy (described in docs). The key difference is the strategy to split between train and test sets, which takes into account gene/EFO pairs so that they are always kept separate.
Changed hyperparameter_tuning to cross_validate. This function now not only accepts a grid of parameters to sweep over, but is the responsible for doing and tracking the cross validation. This is complex because I had to play a lot around the W&B config variables so that each sweep contained all my runs.
Added best set of params to the dictionary of LocusToGeneStep config defaults and to the hyperparameters default dict inside the model.

🙈 Missing

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-gold-standard-schema-simple

…column

…-3253

project-defiant

All looks good, although I am not 100% expert on the cross validation.

src/gentropy/method/l2g/trainer.py

ireneisdoomed added 12 commits November 22, 2024 11:47

feat(gold_standard): add traitFromSourceMappedId to schema

c99b354

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

c8a5f5b

…-gold-standard-schema-simple

chore: adapt tests

24febad

feat(feature_matrix): consider traitFromSourceMappedId a static column

0bf2d3d

feat(feature_matrix): consider traitFromSourceMappedId an optional …

bde32a7

…column

feat: update l2g config with best hyperparams

bff71d5

feat(trainer): new train runs when cross_validate=False

2a8962c

chore(model): add default hyperparams based on best params

02ce7ba

chore: debug sweep, one single run

701e4af

feat(trainer): new train runs when cross_validate=True

34bbc7c

feat(cross_validate): sweep runs are now together

25d06a1

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

3d0166e

…-3253

github-actions bot added size-M Method Step Feature labels Dec 2, 2024

chore: pre-commit auto fixes [...]

71601ed

ireneisdoomed changed the title ~~feat(l2g): implement new training strategy with cross validation and after regularization~~ feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation Dec 2, 2024

ireneisdoomed requested a review from project-defiant December 2, 2024 15:13

Merge branch 'dev' into il-3253

61b7826

project-defiant approved these changes Dec 9, 2024

View reviewed changes

src/gentropy/method/l2g/trainer.py Show resolved Hide resolved

src/gentropy/method/l2g/trainer.py Outdated Show resolved Hide resolved

chore: improve error message

b3e36ba

ireneisdoomed merged commit 79f6fcc into dev Dec 9, 2024
5 checks passed

ireneisdoomed deleted the il-3253 branch December 9, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

ireneisdoomed commented Dec 2, 2024 •

edited

Loading

project-defiant left a comment

feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

feat(l2g): implement new training strategy splitting between EFO/gene pairs and with cross validation #938

Conversation

ireneisdoomed commented Dec 2, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

project-defiant left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Dec 2, 2024 •

edited

Loading