-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(config): gnomAD steps configuration extraction and versioning (#620
) * feat: drop .coverage files from tracked files * feat: new configuration variables for DAGs * build(linting): resolved ruff warnings in make check * build(airflow_config): extract additional input parameters for gnomad steps * feat(step_config): extracted new input parameters from gnomad step configs Configuration updates for: - [x] ld_index_step - [x] ld_variant_annotation_step Both steps and underlying classes use default values derived from StepConfig data classes as defaults, while preserving the ability to set inputs at each stage, in case end user want to use step function API, step cli or datasource function from API. * refactor(types): added a file for storing library types * feat(version_engine): add version engine to infer datasource versions * docs: added version engine to documentation --------- Signed-off-by: Szymon Szyszkowski <[email protected]> Co-authored-by: Szymon Szyszkowski <[email protected]>
- Loading branch information
1 parent
e355970
commit c2bfa18
Showing
18 changed files
with
534 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,4 @@ src/airflow/logs/* | |
!src/airflow/logs/.gitkeep | ||
site/ | ||
.env | ||
.coverage* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
--- | ||
title: Common | ||
--- | ||
|
||
Common utilities used in gentropy package. | ||
|
||
- [**Version Engine**](version_engine.md): class to extract version from datasource input paths | ||
- [**Types**](types.md): Literal types used in the gentropy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
--- | ||
title: Literal Types | ||
--- | ||
|
||
:::gentropy.common.types | ||
:::gentropy.common.types.LD_Population | ||
:::gentropy.common.types.VariantPopulation | ||
:::gentropy.common.types.DataSourceType |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
title: VersionEngine | ||
--- | ||
|
||
**VersionEngine**: | ||
|
||
Version engine allows for registering datasource specific version seeker class to retrieve datasource version used as input to gentropy steps. Currently implemented only for GnomAD datasource. | ||
|
||
This class can be then used to produce automation over output directory versioning. | ||
|
||
:::gentropy.common.version_engine.VersionEngine | ||
:::gentropy.common.version_engine.GnomADVersionSeeker |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
"""Types and type aliases used in the package.""" | ||
|
||
from typing import Literal | ||
|
||
LD_Population = Literal["afr", "amr", "asj", "eas", "est", "fin", "nfe", "nwe", "seu"] | ||
|
||
VariantPopulation = Literal[ | ||
"afr", "amr", "ami", "asj", "eas", "fin", "nfe", "mid", "sas", "remaining" | ||
] | ||
DataSourceType = Literal[ | ||
"gnomad", | ||
"fingenn", | ||
"gwas_catalog", | ||
"eqtl_catalog", | ||
"ukbiobank", | ||
"open_targets", | ||
"intervals", | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
"""Mechanism to seek version from specific datasource.""" | ||
|
||
from __future__ import annotations | ||
|
||
import re | ||
from abc import ABC, abstractmethod | ||
from pathlib import Path | ||
from typing import Callable | ||
|
||
from gentropy.common.types import DataSourceType | ||
|
||
|
||
class VersionEngine: | ||
"""Seek version from the datasource.""" | ||
|
||
def __init__(self, datasource: DataSourceType) -> None: | ||
"""Initialize VersionEngine. | ||
Args: | ||
datasource (DataSourceType): datasource to seek the version from | ||
""" | ||
self.datasource = datasource | ||
|
||
@staticmethod | ||
def version_seekers() -> dict[DataSourceType, DatasourceVersionSeeker]: | ||
"""List version seekers. | ||
Returns: | ||
dict[DataSourceType, DatasourceVersionSeeker]: list of available data sources. | ||
""" | ||
return { | ||
"gnomad": GnomADVersionSeeker(), | ||
} | ||
|
||
def seek(self, text: str | Path) -> str: | ||
"""Interface for inferring the version from text by using registered data source version iner method. | ||
Args: | ||
text (str | Path): text to seek version from | ||
Returns: | ||
str: inferred version | ||
Raises: | ||
TypeError: if version can not be found in the text | ||
Examples: | ||
>>> VersionEngine("gnomad").seek("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz") | ||
'2.1.1' | ||
""" | ||
match text: | ||
case Path() | str(): | ||
text = str(text) | ||
case _: | ||
msg = f"Can not find version in {text}" | ||
raise TypeError(msg) | ||
infer_method = self._get_version_seek_method() | ||
return infer_method(text) | ||
|
||
def _get_version_seek_method(self) -> Callable[[str], str]: | ||
"""Method that gets the version seeker for the datasource. | ||
Returns: | ||
Callable[[str], str]: Method to seek version based on the initialized datasource | ||
Raises: | ||
ValueError: if datasource is not registered in the list of version seekers | ||
""" | ||
if self.datasource not in self.version_seekers(): | ||
raise ValueError(f"Invalid datasource {self.datasource}") | ||
return self.version_seekers()[self.datasource].seek_version | ||
|
||
def amend_version( | ||
self, analysis_input_path: str | Path, analysis_output_path: str | Path | ||
) -> str: | ||
"""Amend version to the analysis output path if it is not already present. | ||
Path can be path to g3:// or Path object, absolute or relative. | ||
The analysis_input_path has to contain the version number. | ||
If the analysis_output_path contains the same version as inferred from input version already, | ||
then it will not be appended. | ||
Args: | ||
analysis_input_path (str | Path): step input path | ||
analysis_output_path (str | Path): step output path | ||
Returns: | ||
str: Path with the ammended version, does not return Path object! | ||
Examples: | ||
>>> VersionEngine("gnomad").amend_version("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz", "/some/path/without/version") | ||
'/some/path/without/version/2.1.1' | ||
""" | ||
version = self.seek(analysis_input_path) | ||
output_path = str(analysis_output_path) | ||
if version in output_path: | ||
return output_path | ||
if output_path.endswith("/"): | ||
return f"{analysis_output_path}{version}" | ||
return f"{analysis_output_path}/{version}" | ||
|
||
|
||
class DatasourceVersionSeeker(ABC): | ||
"""Interface for datasource version seeker.""" | ||
|
||
@staticmethod | ||
@abstractmethod | ||
def seek_version(text: str) -> str: | ||
"""Seek version from text. Implement this method in the subclass. | ||
Args: | ||
text (str): text to seek version from | ||
Returns: | ||
str: seeked version | ||
Raises: | ||
ValueError: if version can not be seeked | ||
""" | ||
raise NotImplementedError | ||
|
||
|
||
class GnomADVersionSeeker(DatasourceVersionSeeker): | ||
"""Seek version from GnomAD datasource.""" | ||
|
||
@staticmethod | ||
def seek_version(text: str) -> str: | ||
"""Seek GnomAD version from provided text by using regex. | ||
Up to 3 digits are allowed in the version number. | ||
Historically gnomAD version numbers have been in the format | ||
2.1.1, 3.1, etc. as of 2024-05. GnomAD versions can be found by | ||
running `"gs://gcp-public-data--gnomad/release/*/*/*"` | ||
Args: | ||
text (str): text to seek version from | ||
Raises: | ||
ValueError: if version can not be seeked | ||
Returns: | ||
str: seeked version | ||
Examples: | ||
>>> GnomADVersionSeeker.seek_version("gs://gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz") | ||
'2.1.1' | ||
""" | ||
result = re.search(r"v?((\d+){1}\.(\d+){1}\.?(\d+)?)", text) | ||
match result: | ||
case None: | ||
raise ValueError(f"No GnomAD version found in provided text: {text}") | ||
case _: | ||
return result.group(1) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.