Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): adding new variant annotation model #641

Merged
merged 57 commits into from
Jun 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
76cb983
feat(variant annotation): new variant annotation schema + logic to ex…
DSuveges Jun 12, 2024
aa963d2
fix: typehints in function
DSuveges Jun 12, 2024
bbb18af
refactor(variant annotation): migrating methods to the new schema
DSuveges Jun 14, 2024
4bfa2d4
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Jun 14, 2024
7e6572d
refactor(variant index): sorting out new variant index dataset
DSuveges Jun 14, 2024
053483e
Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…
DSuveges Jun 14, 2024
ea152df
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Jun 14, 2024
5c70a90
feature(vep): adding predictors to vep transcript object
DSuveges Jun 17, 2024
73103bc
Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…
DSuveges Jun 17, 2024
0e112ef
fix(schema): fixing schema missing fields
DSuveges Jun 17, 2024
d92679b
fix(schema): fixing schema missing fields
DSuveges Jun 17, 2024
bf975c6
fix(schema): fixing schema missing fields
DSuveges Jun 17, 2024
b95cc09
fix(schema): fixing schema missing fields
DSuveges Jun 18, 2024
5b3c58c
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Jun 18, 2024
3690359
fix(annotation): array union under condition
DSuveges Jun 18, 2024
6f48f7b
fix: resolving merge conflicts
DSuveges Jun 18, 2024
5e9e6fa
fix: merging dbxref objects
DSuveges Jun 18, 2024
8225864
feat(variants): updating variants to make more robust
DSuveges Jun 18, 2024
73ebc86
feat: migrating methods to new variant index
DSuveges Jun 18, 2024
6a4f301
adjusting variant index methods
DSuveges Jun 19, 2024
052446f
some updates
DSuveges Jun 19, 2024
77eef57
rename v2g to variant to gene
DSuveges Jun 19, 2024
1e53432
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Jun 19, 2024
213e7d3
adding test
DSuveges Jun 19, 2024
6a175b5
chore: pre-commit auto fixes [...]
pre-commit-ci[bot] Jun 19, 2024
1d87add
fix(precommit): json file needed to rename to jsonl
DSuveges Jun 19, 2024
144ce5c
merge remote
DSuveges Jun 19, 2024
0345c3e
fix(precommit): removing steps depending on old data model
DSuveges Jun 20, 2024
707ac66
fix(coftest): fixing variant index mock generation
DSuveges Jun 20, 2024
ad9db90
fix: typo in package import
DSuveges Jun 20, 2024
fc09036
fix: sorting out conftest
DSuveges Jun 20, 2024
264ad91
refactor(gwas ingest): Updating GnomAD handling
DSuveges Jun 20, 2024
28dd486
refactor(gnomad): variant annotation removed, changed to variant inde…
DSuveges Jun 20, 2024
2a2dc2f
refactor: shuffling around gnomad logic
DSuveges Jun 20, 2024
876666c
Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…
DSuveges Jun 20, 2024
c172641
fix: references in tests
DSuveges Jun 20, 2024
831c2fa
refactor: sorting out gnomad variant dag
DSuveges Jun 20, 2024
adfe73f
refactor: cleaning configs and tests
DSuveges Jun 21, 2024
2d7b121
docs(vep): adding datasource description
DSuveges Jun 21, 2024
0544cdc
test(vep): adding more test to the vep parser
DSuveges Jun 25, 2024
95da474
test(vep): tests are now running
DSuveges Jun 25, 2024
d8b8280
Merge branch 'dev' into ds_3333_new_variant_index
DSuveges Jun 25, 2024
caab094
fix: removing version suffix from pyproject and airflow config
DSuveges Jun 25, 2024
5efa2b2
Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…
DSuveges Jun 25, 2024
d3a2016
fix: reverting DAGs - removing temporary modifications I added for te…
DSuveges Jun 25, 2024
841a83d
Merge branch 'dev' into ds_3333_new_variant_index
DSuveges Jun 26, 2024
a5a016b
Merge branch 'dev' into ds_3333_new_variant_index
DSuveges Jun 27, 2024
0339c25
fix: addressing reviewer comments
DSuveges Jun 27, 2024
d62e784
refactor: fiddling with variant index annotation logic
DSuveges Jun 27, 2024
f24062f
chore: addressing comments
DSuveges Jun 28, 2024
6c84d1e
Merge branch 'dev' into ds_3333_new_variant_index
DSuveges Jun 28, 2024
bdf38ae
fix: variant cross-ref snake case
DSuveges Jun 28, 2024
561a928
Merge branch 'dev' of https://github.com/opentargets/gentropy into ds…
DSuveges Jun 28, 2024
0c5d0cc
Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…
DSuveges Jun 28, 2024
edf5536
Merge branch 'dev' into ds_3333_new_variant_index
DSuveges Jun 28, 2024
4899fdf
fix: correcting join strategy
DSuveges Jun 28, 2024
407eec6
Merge branch 'ds_3333_new_variant_index' of https://github.com/openta…
DSuveges Jun 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,8 @@
"python.testing.pytestEnabled": true,
"mypy-type-checker.severity": {
"error": "Information"
}
},
"yaml.extension.recommendations": false,
"workbench.remoteIndicator.showExtensionRecommendations": false,
"extensions.ignoreRecommendations": true
}
10 changes: 6 additions & 4 deletions config/datasets/ot_gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ gnomad_public_bucket: gs://gcp-public-data--gnomad/release/
ld_matrix_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm
ld_index_raw_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht
liftover_ht_path: ${datasets.gnomad_public_bucket}/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht
# variant_annotation
# GnomAD variant set:
gnomad_genomes_path: ${datasets.gnomad_public_bucket}4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht/

# Others
chain_38_37: gs://hail-common/references/grch38_to_grch37.over.chain.gz
chain_37_38: ${datasets.static_assets}/grch37_to_grch38.over.chain
vep_consequences: ${datasets.static_assets}/vep_consequences.tsv
vep_consequences: ${datasets.static_assets}/variant_consequence_to_score.tsv
anderson: ${datasets.static_assets}/andersson2014/enhancer_tss_associations.bed
javierre: ${datasets.static_assets}/javierre_2016_preprocessed
jung: ${datasets.static_assets}/jung2019_pchic_tableS3.csv
Expand All @@ -55,7 +55,7 @@ finngen_finemapping_results_path: ${datasets.inputs}/Finngen_susie_finemapping_r
finngen_finemapping_summaries_path: ${datasets.inputs}/Finngen_susie_finemapping_r10/Finngen_susie_credset_summary_r10.tsv

# Dev output datasets
variant_annotation: ${datasets.outputs}/variant_annotation
gnomad_variants: ${datasets.outputs}/gnomad_variants
ireneisdoomed marked this conversation as resolved.
Show resolved Hide resolved
study_locus: ${datasets.outputs}/study_locus
summary_statistics: ${datasets.outputs}/summary_statistics
study_locus_overlap: ${datasets.outputs}/study_locus_overlap
Expand All @@ -68,6 +68,8 @@ catalog_study_locus: ${datasets.study_locus}/catalog_study_locus
from_sumstats_study_locus: ${datasets.study_locus}/from_sumstats
from_sumstats_pics: ${datasets.credible_set}/from_sumstats

vep_output_path: gs://genetics_etl_python_playground/vep/full_variant_index_vcf
ireneisdoomed marked this conversation as resolved.
Show resolved Hide resolved

# ETL output datasets:
l2g_gold_standard_curation: ${datasets.release_folder}/locus_to_gene_gold_standard.json
l2g_model: ${datasets.release_folder}/locus_to_gene_model/classifier.skops
Expand All @@ -78,4 +80,4 @@ study_index: ${datasets.release_folder}/study_index
variant_index: ${datasets.release_folder}/variant_index
credible_set: ${datasets.release_folder}/credible_set
gene_index: ${datasets.release_folder}/gene_index
v2g: ${datasets.release_folder}/variant_to_gene
variant_to_gene: ${datasets.release_folder}/variant_to_gene
19 changes: 0 additions & 19 deletions config/step/ot_variant_annotation.yaml

This file was deleted.

4 changes: 2 additions & 2 deletions config/step/ot_variant_index.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
defaults:
- variant_index

variant_annotation_path: ${datasets.variant_annotation}
credible_set_path: ${datasets.credible_set}
vep_output_json_path: ${datasets.vep_output_path}
gnomad_variant_annotations_path: ${datasets.gnomad_variants}
variant_index_path: ${datasets.variant_index}
3 changes: 1 addition & 2 deletions config/step/ot_variant_to_gene.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ defaults:
- variant_to_gene

variant_index_path: ${datasets.variant_index}
variant_annotation_path: ${datasets.variant_annotation}
gene_index_path: ${datasets.gene_index}
vep_consequences_path: ${datasets.vep_consequences}
liftover_chain_file_path: ${datasets.chain_37_38}
Expand All @@ -11,4 +10,4 @@ interval_sources:
javierre: ${datasets.javierre}
jung: ${datasets.jung}
thurman: ${datasets.thurman}
v2g_path: ${datasets.v2g}
v2g_path: ${datasets.variant_to_gene}
Binary file added docs/assets/imgs/ensembl_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 0 additions & 9 deletions docs/python_api/datasets/variant_annotation.md

This file was deleted.

3 changes: 2 additions & 1 deletion docs/python_api/datasources/_datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ This section contains information about the data source harmonisation tools avai
## Variant annotation/validation

1. [GnomAD](gnomad/_gnomad.md) v4.0
1. GWAS catalog harmonisation pipeline [more info](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

## Linkage desiquilibrium

Expand Down
10 changes: 10 additions & 0 deletions docs/python_api/datasources/ensembl/_ensembl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Ensembl annotations
---

<div align="center">
<img width="100" height="100" src="../../../../assets/imgs/ensembl_logo.png">
<h1>Ensembl</h1>
</div>

[Ensembl](https://www.ensembl.org/index.html) provides a diverse set of genetic data Gentropy takes advantage of including gene set, and variant annotations.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Variant effector parser
---

::: gentropy.datasource.ensembl.vep_parser.VariantEffectPredictorParser
4 changes: 2 additions & 2 deletions docs/python_api/steps/ld_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: ld_index
title: GnomAD Linkage data ingestion
---

::: gentropy.ld_index.LDIndexStep
::: gentropy.gnomad_ingestion.LDIndexStep
4 changes: 2 additions & 2 deletions docs/python_api/steps/variant_annotation_step.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: variant_annotation
title: GnomAD variant data ingestion
---

::: gentropy.variant_annotation.VariantAnnotationStep
::: gentropy.gnomad_ingestion.GnomadVariantIndexStep
2 changes: 1 addition & 1 deletion docs/python_api/steps/variant_to_gene_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
title: variant_to_gene
---

::: gentropy.v2g.V2GStep
::: gentropy.variant_to_gene.V2GStep
1 change: 1 addition & 0 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions src/airflow/dags/gnomad_preprocess.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Airflow DAG for the Preprocess part of the pipeline."""
"""Airflow DAG for the Preprocess GnomAD datasets - LD index and GnomAD variant set."""

from __future__ import annotations

Expand All @@ -11,13 +11,13 @@

ALL_STEPS = [
"ot_ld_index",
"ot_variant_annotation",
"ot_gnomad_variants",
]


with DAG(
dag_id=Path(__file__).stem,
description="Open Targets Genetics — Preprocess",
description="Open Targets Genetics — GnomAD Preprocess",
default_args=common.shared_dag_args,
**common.shared_dag_kwargs,
):
Expand Down
43 changes: 43 additions & 0 deletions src/gentropy/assets/data/so_mappings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"transcript_ablation": "SO_0001893",
"splice_acceptor_variant": "SO_0001574",
"splice_donor_variant": "SO_0001575",
"stop_gained": "SO_0001587",
"frameshift_variant": "SO_0001589",
"stop_lost": "SO_0001578",
"start_lost": "SO_0002012",
"transcript_amplification": "SO_0001889",
"feature_elongation": "SO_0001907",
"feature_truncation": "SO_0001906",
"inframe_insertion": "SO_0001821",
"inframe_deletion": "SO_0001822",
"missense_variant": "SO_0001583",
"protein_altering_variant": "SO_0001818",
"splice_donor_5th_base_variant": "SO_0001787",
"splice_region_variant": "SO_0001630",
"splice_donor_region_variant": "SO_0002170",
"splice_polypyrimidine_tract_variant": "SO_0002169",
"incomplete_terminal_codon_variant": "SO_0001626",
"start_retained_variant": "SO_0002019",
"stop_retained_variant": "SO_0001567",
"synonymous_variant": "SO_0001819",
"coding_sequence_variant": "SO_0001580",
"mature_miRNA_variant": "SO_0001620",
"5_prime_UTR_variant": "SO_0001623",
"3_prime_UTR_variant": "SO_0001624",
"non_coding_transcript_exon_variant": "SO_0001792",
"intron_variant": "SO_0001627",
"NMD_transcript_variant": "SO_0001621",
"non_coding_transcript_variant": "SO_0001619",
"coding_transcript_variant": "SO_0001968",
"upstream_gene_variant": "SO_0001631",
"downstream_gene_variant": "SO_0001632",
"TFBS_ablation": "SO_0001895",
"TFBS_amplification": "SO_0001892",
"TF_binding_site_variant": "SO_0001782",
"regulatory_region_ablation": "SO_0001894",
"regulatory_region_amplification": "SO_0001891",
"regulatory_region_variant": "SO_0001566",
"intergenic_variant": "SO_0001628",
"sequence_variant": "SO_0001060"
}
Loading