Skip to content

Commit

Permalink
feat(datasets): adding new variant index model (#641)
Browse files Browse the repository at this point in the history
* feat(variant annotation): new variant annotation schema + logic to extract from VEP

* fix: typehints in function

* refactor(variant annotation): migrating methods to the new schema

* chore: pre-commit auto fixes [...]

* refactor(variant index): sorting out new variant index dataset

* chore: pre-commit auto fixes [...]

* feature(vep): adding predictors to vep transcript object

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* fix(schema): fixing schema missing fields

* chore: pre-commit auto fixes [...]

* fix(annotation): array union under condition

* fix: merging dbxref objects

* feat(variants): updating variants to make more robust

* feat: migrating methods to new variant index

* adjusting variant index methods

* some updates

* rename v2g to variant to gene

* chore: pre-commit auto fixes [...]

* adding test

* chore: pre-commit auto fixes [...]

* fix(precommit): json file needed to rename to jsonl

* fix(precommit): removing steps depending on old data model

* fix(coftest): fixing variant index mock generation

* fix: typo in package import

* fix: sorting out conftest

* refactor(gwas ingest): Updating GnomAD handling

* refactor(gnomad): variant annotation removed, changed to variant index, steps updated

* refactor: shuffling around gnomad logic

* fix: references in tests

* refactor: sorting out gnomad variant dag

* refactor: cleaning configs and tests

* docs(vep): adding datasource description

* test(vep): adding more test to the vep parser

* test(vep): tests are now running

* fix: removing version suffix from pyproject and airflow config

* fix: reverting DAGs - removing temporary modifications I added for testing

* fix: addressing reviewer comments

* refactor: fiddling with variant index annotation logic

* chore: addressing comments

* fix: variant cross-ref snake case

* fix: correcting join strategy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
DSuveges and pre-commit-ci[bot] authored Jun 30, 2024
1 parent b3e89bb commit f79c789
Show file tree
Hide file tree
Showing 42 changed files with 2,239 additions and 730 deletions.
5 changes: 4 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,8 @@
"python.testing.pytestEnabled": true,
"mypy-type-checker.severity": {
"error": "Information"
}
},
"yaml.extension.recommendations": false,
"workbench.remoteIndicator.showExtensionRecommendations": false,
"extensions.ignoreRecommendations": true
}
10 changes: 6 additions & 4 deletions config/datasets/ot_gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ gnomad_public_bucket: gs://gcp-public-data--gnomad/release/
ld_matrix_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm
ld_index_raw_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht
liftover_ht_path: ${datasets.gnomad_public_bucket}/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht
# variant_annotation
# GnomAD variant set:
gnomad_genomes_path: ${datasets.gnomad_public_bucket}4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht/

# Others
chain_38_37: gs://hail-common/references/grch38_to_grch37.over.chain.gz
chain_37_38: ${datasets.static_assets}/grch37_to_grch38.over.chain
vep_consequences: ${datasets.static_assets}/vep_consequences.tsv
vep_consequences: ${datasets.static_assets}/variant_consequence_to_score.tsv
anderson: ${datasets.static_assets}/andersson2014/enhancer_tss_associations.bed
javierre: ${datasets.static_assets}/javierre_2016_preprocessed
jung: ${datasets.static_assets}/jung2019_pchic_tableS3.csv
Expand All @@ -55,7 +55,7 @@ finngen_finemapping_results_path: ${datasets.inputs}/Finngen_susie_finemapping_r
finngen_finemapping_summaries_path: ${datasets.inputs}/Finngen_susie_finemapping_r10/Finngen_susie_credset_summary_r10.tsv

# Dev output datasets
variant_annotation: ${datasets.outputs}/variant_annotation
gnomad_variants: ${datasets.outputs}/gnomad_variants
study_locus: ${datasets.outputs}/study_locus
summary_statistics: ${datasets.outputs}/summary_statistics
study_locus_overlap: ${datasets.outputs}/study_locus_overlap
Expand All @@ -68,6 +68,8 @@ catalog_study_locus: ${datasets.study_locus}/catalog_study_locus
from_sumstats_study_locus: ${datasets.study_locus}/from_sumstats
from_sumstats_pics: ${datasets.credible_set}/from_sumstats

vep_output_path: gs://genetics_etl_python_playground/vep/full_variant_index_vcf

# ETL output datasets:
l2g_gold_standard_curation: ${datasets.release_folder}/locus_to_gene_gold_standard.json
l2g_model: ${datasets.release_folder}/locus_to_gene_model/classifier.skops
Expand All @@ -78,4 +80,4 @@ study_index: ${datasets.release_folder}/study_index
variant_index: ${datasets.release_folder}/variant_index
credible_set: ${datasets.release_folder}/credible_set
gene_index: ${datasets.release_folder}/gene_index
v2g: ${datasets.release_folder}/variant_to_gene
variant_to_gene: ${datasets.release_folder}/variant_to_gene
19 changes: 0 additions & 19 deletions config/step/ot_variant_annotation.yaml

This file was deleted.

4 changes: 2 additions & 2 deletions config/step/ot_variant_index.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
defaults:
- variant_index

variant_annotation_path: ${datasets.variant_annotation}
credible_set_path: ${datasets.credible_set}
vep_output_json_path: ${datasets.vep_output_path}
gnomad_variant_annotations_path: ${datasets.gnomad_variants}
variant_index_path: ${datasets.variant_index}
3 changes: 1 addition & 2 deletions config/step/ot_variant_to_gene.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ defaults:
- variant_to_gene

variant_index_path: ${datasets.variant_index}
variant_annotation_path: ${datasets.variant_annotation}
gene_index_path: ${datasets.gene_index}
vep_consequences_path: ${datasets.vep_consequences}
liftover_chain_file_path: ${datasets.chain_37_38}
Expand All @@ -11,4 +10,4 @@ interval_sources:
javierre: ${datasets.javierre}
jung: ${datasets.jung}
thurman: ${datasets.thurman}
v2g_path: ${datasets.v2g}
v2g_path: ${datasets.variant_to_gene}
Binary file added docs/assets/imgs/ensembl_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 0 additions & 9 deletions docs/python_api/datasets/variant_annotation.md

This file was deleted.

3 changes: 2 additions & 1 deletion docs/python_api/datasources/_datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ This section contains information about the data source harmonisation tools avai
## Variant annotation/validation

1. [GnomAD](gnomad/_gnomad.md) v4.0
1. GWAS catalog harmonisation pipeline [more info](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

## Linkage desiquilibrium

Expand Down
10 changes: 10 additions & 0 deletions docs/python_api/datasources/ensembl/_ensembl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Ensembl annotations
---

<div align="center">
<img width="100" height="100" src="../../../../assets/imgs/ensembl_logo.png">
<h1>Ensembl</h1>
</div>

[Ensembl](https://www.ensembl.org/index.html) provides a diverse set of genetic data Gentropy takes advantage of including gene set, and variant annotations.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Variant effector parser
---

::: gentropy.datasource.ensembl.vep_parser.VariantEffectPredictorParser
4 changes: 2 additions & 2 deletions docs/python_api/steps/ld_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: ld_index
title: GnomAD Linkage data ingestion
---

::: gentropy.ld_index.LDIndexStep
::: gentropy.gnomad_ingestion.LDIndexStep
4 changes: 2 additions & 2 deletions docs/python_api/steps/variant_annotation_step.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: variant_annotation
title: GnomAD variant data ingestion
---

::: gentropy.variant_annotation.VariantAnnotationStep
::: gentropy.gnomad_ingestion.GnomadVariantIndexStep
2 changes: 1 addition & 1 deletion docs/python_api/steps/variant_to_gene_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
title: variant_to_gene
---

::: gentropy.v2g.V2GStep
::: gentropy.variant_to_gene.V2GStep
1 change: 1 addition & 0 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions src/airflow/dags/gnomad_preprocess.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Airflow DAG for the Preprocess part of the pipeline."""
"""Airflow DAG for the Preprocess GnomAD datasets - LD index and GnomAD variant set."""

from __future__ import annotations

Expand All @@ -11,13 +11,13 @@

ALL_STEPS = [
"ot_ld_index",
"ot_variant_annotation",
"ot_gnomad_variants",
]


with DAG(
dag_id=Path(__file__).stem,
description="Open Targets Genetics — Preprocess",
description="Open Targets Genetics — GnomAD Preprocess",
default_args=common.shared_dag_args,
**common.shared_dag_kwargs,
):
Expand Down
43 changes: 43 additions & 0 deletions src/gentropy/assets/data/so_mappings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"transcript_ablation": "SO_0001893",
"splice_acceptor_variant": "SO_0001574",
"splice_donor_variant": "SO_0001575",
"stop_gained": "SO_0001587",
"frameshift_variant": "SO_0001589",
"stop_lost": "SO_0001578",
"start_lost": "SO_0002012",
"transcript_amplification": "SO_0001889",
"feature_elongation": "SO_0001907",
"feature_truncation": "SO_0001906",
"inframe_insertion": "SO_0001821",
"inframe_deletion": "SO_0001822",
"missense_variant": "SO_0001583",
"protein_altering_variant": "SO_0001818",
"splice_donor_5th_base_variant": "SO_0001787",
"splice_region_variant": "SO_0001630",
"splice_donor_region_variant": "SO_0002170",
"splice_polypyrimidine_tract_variant": "SO_0002169",
"incomplete_terminal_codon_variant": "SO_0001626",
"start_retained_variant": "SO_0002019",
"stop_retained_variant": "SO_0001567",
"synonymous_variant": "SO_0001819",
"coding_sequence_variant": "SO_0001580",
"mature_miRNA_variant": "SO_0001620",
"5_prime_UTR_variant": "SO_0001623",
"3_prime_UTR_variant": "SO_0001624",
"non_coding_transcript_exon_variant": "SO_0001792",
"intron_variant": "SO_0001627",
"NMD_transcript_variant": "SO_0001621",
"non_coding_transcript_variant": "SO_0001619",
"coding_transcript_variant": "SO_0001968",
"upstream_gene_variant": "SO_0001631",
"downstream_gene_variant": "SO_0001632",
"TFBS_ablation": "SO_0001895",
"TFBS_amplification": "SO_0001892",
"TF_binding_site_variant": "SO_0001782",
"regulatory_region_ablation": "SO_0001894",
"regulatory_region_amplification": "SO_0001891",
"regulatory_region_variant": "SO_0001566",
"intergenic_variant": "SO_0001628",
"sequence_variant": "SO_0001060"
}
Loading

0 comments on commit f79c789

Please sign in to comment.