Skip to content

Commit

Permalink
feat(config): gnomAD steps configuration extraction and versioning (#620
Browse files Browse the repository at this point in the history
)

* feat: drop .coverage files from tracked files
* feat: new configuration variables for DAGs
* build(linting): resolved ruff warnings in make check
* build(airflow_config): extract additional input parameters for gnomad steps
* feat(step_config): extracted new input parameters from gnomad step configs

Configuration updates for:
- [x] ld_index_step
- [x] ld_variant_annotation_step

Both steps and underlying classes use default values derived from
StepConfig data classes as defaults, while preserving the ability
to set inputs at each stage, in case end user want to use step function
API, step cli or datasource function from API.

* refactor(types): added a file for storing library types
* feat(version_engine): add version engine to infer datasource versions
* docs: added version engine to documentation

---------

Signed-off-by: Szymon Szyszkowski <[email protected]>
Co-authored-by: Szymon Szyszkowski <[email protected]>
  • Loading branch information
project-defiant and Szymon Szyszkowski authored May 28, 2024
1 parent e355970 commit c2bfa18
Show file tree
Hide file tree
Showing 18 changed files with 534 additions and 73 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ src/airflow/logs/*
!src/airflow/logs/.gitkeep
site/
.env
.coverage*
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ setup-dev: ## Setup development environment

check: ## Lint and format code
@echo "Linting API..."
@poetry run ruff src/gentropy .
@poetry run ruff check src/gentropy .
@echo "Linting docstrings..."
@poetry run pydoclint --config=pyproject.toml src
@poetry run pydoclint --config=pyproject.toml --skip-checking-short-docstrings=true tests
Expand Down
14 changes: 13 additions & 1 deletion config/datasets/ot_gcp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ static_assets: gs://genetics_etl_python_playground/static_assets
outputs: gs://genetics_etl_python_playground/output/python_etl/parquet/${datasets.dev_version}

## Datasets:
# GWAS
gwas_catalog_dataset: gs://gwas_catalog_data
# Ingestion input files:
gwas_catalog_associations: ${datasets.gwas_catalog_dataset}/curated_inputs/gwas_catalog_associations_ontology_annotated.tsv
Expand All @@ -29,7 +30,18 @@ gwas_catalog_study_index: ${datasets.gwas_catalog_dataset}/study_index
gwas_catalog_study_locus_folder: ${datasets.gwas_catalog_dataset}/study_locus_datasets
gwas_catalog_credible_set_folder: ${datasets.gwas_catalog_dataset}/credible_set_datasets

# Input datasets
# GnomAD
gnomad_public_bucket: gs://gcp-public-data--gnomad/release
# LD generation
# Templates require placeholders {POP} to expand template to match multiple populationwise paths
ld_matrix_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm
ld_index_raw_template: ${datasets.gnomad_public_bucket}/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht
liftover_ht_path: ${datasets.gnomad_public_bucket}/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht
# variant_annotation
gnomad_genomes_path: ${datasets.gnomad_public_bucket}4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht/

# Others
chain_38_37: gs://hail-common/references/grch38_to_grch37.over.chain.gz
chain_37_38: ${datasets.static_assets}/grch37_to_grch38.over.chain
vep_consequences: ${datasets.static_assets}/vep_consequences.tsv
anderson: ${datasets.static_assets}/andersson2014/enhancer_tss_associations.bed
Expand Down
16 changes: 16 additions & 0 deletions config/step/ot_ld_index.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,19 @@ defaults:
- ld_index

ld_index_out: ${datasets.ld_index}
ld_matrix_template: ${datasets.ld_matrix_template}
ld_index_raw_template: ${datasets.ld_index_raw_template}
grch37_to_grch38_chain_path: ${datasets.chain_37_38.}
liftover_ht_path: ${datasets.liftover_ht_path}
ld_populations:
- afr # African-American
- amr # American Admixed/Latino
- asj # Ashkenazi Jewish
- eas # East Asian
- est # Estonian
- fin # Finnish
- nfe # Non-Finnish European
- nwe # Northwestern European
- seu # Southeastern European
# The version will of the gnomad will be inferred from ld_matrix_template and appended to the ld_index_out.
use_version_from_input: true
15 changes: 15 additions & 0 deletions config/step/ot_variant_annotation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,18 @@ defaults:
- variant_annotation

variant_annotation_path: ${datasets.variant_annotation}
gnomad_genomes_path: ${datasets.gnomad_genomes_path}
chain_38_37: ${datasets.chain_38_37}
gnomad_variant_populations:
- afr # African-American
- amr # American Admixed/Latino
- ami # Amish ancestry
- asj # Ashkenazi Jewish
- eas # East Asian
- fin # Finnish
- nfe # Non-Finnish European
- mid # Middle Eastern
- sas # South Asian
- remaining # Other
# The version will of the gnomad will be inferred from ld_matrix_template and appended to the ld_index_out.
use_version_from_input: true
1 change: 1 addition & 0 deletions docs/python_api/_python_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ The overall architecture of the package distinguishes between:
- [**Datasets**](datasets/_datasets.md): data model
- [**Methods**](methods/_methods.md): statistical analysis tools
- [**Steps**](steps/_steps.md): pipeline steps
- [**Common**](common/_common.md): Common classes
8 changes: 8 additions & 0 deletions docs/python_api/common/_common.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: Common
---

Common utilities used in gentropy package.

- [**Version Engine**](version_engine.md): class to extract version from datasource input paths
- [**Types**](types.md): Literal types used in the gentropy
8 changes: 8 additions & 0 deletions docs/python_api/common/types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: Literal Types
---

:::gentropy.common.types
:::gentropy.common.types.LD_Population
:::gentropy.common.types.VariantPopulation
:::gentropy.common.types.DataSourceType
12 changes: 12 additions & 0 deletions docs/python_api/common/version_engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: VersionEngine
---

**VersionEngine**:

Version engine allows for registering datasource specific version seeker class to retrieve datasource version used as input to gentropy steps. Currently implemented only for GnomAD datasource.

This class can be then used to produce automation over output directory versioning.

:::gentropy.common.version_engine.VersionEngine
:::gentropy.common.version_engine.GnomADVersionSeeker
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -247,15 +247,15 @@ ignore = [

]

[tool.ruff.per-file-ignores]
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402"]
"path/to/file.py" = ["E402"]
"**/{tests,docs,tools}/*" = ["E402"]

[tool.ruff.flake8-quotes]
[tool.ruff.lint.flake8-quotes]
docstring-quotes = "double"

[tool.ruff.pydocstyle]
[tool.ruff.lint.pydocstyle]
convention = "google"

[tool.pydoclint]
Expand Down
18 changes: 18 additions & 0 deletions src/gentropy/common/types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""Types and type aliases used in the package."""

from typing import Literal

LD_Population = Literal["afr", "amr", "asj", "eas", "est", "fin", "nfe", "nwe", "seu"]

VariantPopulation = Literal[
"afr", "amr", "ami", "asj", "eas", "fin", "nfe", "mid", "sas", "remaining"
]
DataSourceType = Literal[
"gnomad",
"fingenn",
"gwas_catalog",
"eqtl_catalog",
"ukbiobank",
"open_targets",
"intervals",
]
154 changes: 154 additions & 0 deletions src/gentropy/common/version_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
"""Mechanism to seek version from specific datasource."""

from __future__ import annotations

import re
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Callable

from gentropy.common.types import DataSourceType


class VersionEngine:
"""Seek version from the datasource."""

def __init__(self, datasource: DataSourceType) -> None:
"""Initialize VersionEngine.
Args:
datasource (DataSourceType): datasource to seek the version from
"""
self.datasource = datasource

@staticmethod
def version_seekers() -> dict[DataSourceType, DatasourceVersionSeeker]:
"""List version seekers.
Returns:
dict[DataSourceType, DatasourceVersionSeeker]: list of available data sources.
"""
return {
"gnomad": GnomADVersionSeeker(),
}

def seek(self, text: str | Path) -> str:
"""Interface for inferring the version from text by using registered data source version iner method.
Args:
text (str | Path): text to seek version from
Returns:
str: inferred version
Raises:
TypeError: if version can not be found in the text
Examples:
>>> VersionEngine("gnomad").seek("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz")
'2.1.1'
"""
match text:
case Path() | str():
text = str(text)
case _:
msg = f"Can not find version in {text}"
raise TypeError(msg)
infer_method = self._get_version_seek_method()
return infer_method(text)

def _get_version_seek_method(self) -> Callable[[str], str]:
"""Method that gets the version seeker for the datasource.
Returns:
Callable[[str], str]: Method to seek version based on the initialized datasource
Raises:
ValueError: if datasource is not registered in the list of version seekers
"""
if self.datasource not in self.version_seekers():
raise ValueError(f"Invalid datasource {self.datasource}")
return self.version_seekers()[self.datasource].seek_version

def amend_version(
self, analysis_input_path: str | Path, analysis_output_path: str | Path
) -> str:
"""Amend version to the analysis output path if it is not already present.
Path can be path to g3:// or Path object, absolute or relative.
The analysis_input_path has to contain the version number.
If the analysis_output_path contains the same version as inferred from input version already,
then it will not be appended.
Args:
analysis_input_path (str | Path): step input path
analysis_output_path (str | Path): step output path
Returns:
str: Path with the ammended version, does not return Path object!
Examples:
>>> VersionEngine("gnomad").amend_version("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz", "/some/path/without/version")
'/some/path/without/version/2.1.1'
"""
version = self.seek(analysis_input_path)
output_path = str(analysis_output_path)
if version in output_path:
return output_path
if output_path.endswith("/"):
return f"{analysis_output_path}{version}"
return f"{analysis_output_path}/{version}"


class DatasourceVersionSeeker(ABC):
"""Interface for datasource version seeker."""

@staticmethod
@abstractmethod
def seek_version(text: str) -> str:
"""Seek version from text. Implement this method in the subclass.
Args:
text (str): text to seek version from
Returns:
str: seeked version
Raises:
ValueError: if version can not be seeked
"""
raise NotImplementedError


class GnomADVersionSeeker(DatasourceVersionSeeker):
"""Seek version from GnomAD datasource."""

@staticmethod
def seek_version(text: str) -> str:
"""Seek GnomAD version from provided text by using regex.
Up to 3 digits are allowed in the version number.
Historically gnomAD version numbers have been in the format
2.1.1, 3.1, etc. as of 2024-05. GnomAD versions can be found by
running `"gs://gcp-public-data--gnomad/release/*/*/*"`
Args:
text (str): text to seek version from
Raises:
ValueError: if version can not be seeked
Returns:
str: seeked version
Examples:
>>> GnomADVersionSeeker.seek_version("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz")
'2.1.1'
"""
result = re.search(r"v?((\d+){1}\.(\d+){1}\.?(\d+)?)", text)
match result:
case None:
raise ValueError(f"No GnomAD version found in provided text: {text}")
case _:
return result.group(1)
41 changes: 39 additions & 2 deletions src/gentropy/config.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Interface for application configuration."""

import os
from dataclasses import dataclass, field
from typing import Any, Dict, List
Expand Down Expand Up @@ -157,8 +158,28 @@ class LDIndexConfig(StepConfig):
"start_hail": True,
}
)
min_r2: float = 0.5
ld_index_out: str = MISSING
min_r2: float = 0.5
ld_matrix_template: str = "gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm"
ld_index_raw_template: str = "gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht"
liftover_ht_path: str = "gs://gcp-public-data--gnomad/release/2.1.1/liftover_grch38/ht/genomes/gnomad.genomes.r2.1.1.sites.liftover_grch38.ht"
grch37_to_grch38_chain_path: str = (
"gs://hail-common/references/grch37_to_grch38.over.chain.gz"
)
ld_populations: list[str] = field(
default_factory=lambda: [
"afr", # African-American
"amr", # American Admixed/Latino
"asj", # Ashkenazi Jewish
"eas", # East Asian
"est", # Estionian
"fin", # Finnish
"nfe", # Non-Finnish European
"nwe", # Northwestern European
"seu", # Southeastern European
]
)
use_version_from_input: bool = False
_target_: str = "gentropy.ld_index.LDIndexStep"


Expand Down Expand Up @@ -270,6 +291,23 @@ class VariantAnnotationConfig(StepConfig):
}
)
variant_annotation_path: str = MISSING
gnomad_genomes_path: str = "gs://gcp-public-data--gnomad/release/4.0/ht/genomes/gnomad.genomes.v4.0.sites.ht/"
chain_38_37: str = "gs://hail-common/references/grch38_to_grch37.over.chain.gz"
gnomad_variant_populations: list[str] = field(
default_factory=lambda: [
"afr", # African-American
"amr", # American Admixed/Latino
"ami", # Amish ancestry
"asj", # Ashkenazi Jewish
"eas", # East Asian
"fin", # Finnish
"nfe", # Non-Finnish European
"mid", # Middle Eastern
"sas", # South Asian
"remaining", # Other
]
)
use_version_from_input: bool = False
_target_: str = "gentropy.variant_annotation.VariantAnnotationStep"


Expand Down Expand Up @@ -358,7 +396,6 @@ class FinemapperConfig(StepConfig):
imputed_r2_threshold: float = MISSING
ld_score_threshold: float = MISSING
output_path_log: str = MISSING
_target_: str = "gentropy.susie_finemapper.SusieFineMapperStep"


@dataclass
Expand Down
Loading

0 comments on commit c2bfa18

Please sign in to comment.