diff --git a/404.html b/404.html index fb3174466..6bd77a8f5 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - Open Targets Genetics
\ No newline at end of file + Open Targets Genetics
\ No newline at end of file diff --git a/development/_development/index.html b/development/_development/index.html index 4cd8d49fc..5f02cb3a0 100644 --- a/development/_development/index.html +++ b/development/_development/index.html @@ -1 +1 @@ - Development - Open Targets Genetics
\ No newline at end of file + Development - Open Targets Genetics
\ No newline at end of file diff --git a/development/airflow/index.html b/development/airflow/index.html index dee819bf1..559f0fba1 100644 --- a/development/airflow/index.html +++ b/development/airflow/index.html @@ -1,4 +1,4 @@ - Running Airflow workflows - Open Targets Genetics
Skip to content

Running Airflow workflows

Airflow code is located in src/airflow. Make sure to execute all of the instructions from that directory, unless stated otherwise.

Set up Docker

We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):

sudo apt install docker-compose
+ Running Airflow workflows - Open Targets Genetics       

Running Airflow workflows

Airflow code is located in src/airflow. Make sure to execute all of the instructions from that directory, unless stated otherwise.

Set up Docker

We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):

sudo apt install docker-compose
 

Next, verify that you can run Docker. This should say "Hello from Docker":

docker run hello-world
 

If the command above raises a permission error, fix it and reboot:

sudo usermod -a -G docker $USER
 newgrp docker
diff --git a/development/contributing/index.html b/development/contributing/index.html
index c92987e33..1fbf29848 100644
--- a/development/contributing/index.html
+++ b/development/contributing/index.html
@@ -1 +1 @@
- Contributing guidelines - Open Targets Genetics       

Contributing guidelines

One-time configuration

The steps in this section only ever need to be done once on any particular system.

Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.

Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.

Check that you have java installed.

Environment configuration

Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.

It is recommended to use VS Code as an IDE for development.

How to run the code

All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.

In order to run the code:

  1. Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.

  2. Manually edit your local pyproject.toml file and modify the version of the code.

    • This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
    • You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
    • Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
    • Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
  3. Run make build.

    • This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
    • A version specific subpath is used, so uploading the code will not affect any branches but your own.
    • If there was already a code bundle uploaded with the same version number, it will be replaced.
  4. Submit the Dataproc job with poetry run python workflow/workflow_template.py

    • You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
    • The script will provision the cluster and submit the job.
    • The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
    • Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
    • On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.

Contributing checklist

When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)

For more details on each of these steps, see the sections below.

Documentation

  • If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
  • Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
  • Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
  • Step definition in docs/reference/step (example: docs/reference/step/finngen.md)

Configuration

  • Input and output paths in config/datasets/gcp.yaml
  • Step configuration in config/step/STEP.yaml (example: config/step/finngen.yaml)

Classes

  • Dataset class in src/org/dataset/ (example: src/otg/dataset/study_index.pyStudyIndexFinnGen)
  • Step main running class in src/org/STEP.py (example: src/org/finngen.py)

Tests

  • Test study fixture in tests/conftest.py (example: mock_study_index_finngen in that module)
  • Test sample data in tests/data_samples (example: tests/data_samples/finngen_studies_sample.json)
  • Test definition in tests/ (example: tests/dataset/test_study_index.pytest_study_index_finngen_creation)

\ No newline at end of file + Contributing guidelines - Open Targets Genetics

Contributing guidelines

One-time configuration

The steps in this section only ever need to be done once on any particular system.

Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.

Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.

Check that you have java installed.

Environment configuration

Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.

It is recommended to use VS Code as an IDE for development.

How to run the code

All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.

In order to run the code:

  1. Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.

  2. Manually edit your local pyproject.toml file and modify the version of the code.

    • This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
    • You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
    • Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
    • Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
  3. Run make build.

    • This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
    • A version specific subpath is used, so uploading the code will not affect any branches but your own.
    • If there was already a code bundle uploaded with the same version number, it will be replaced.
  4. Submit the Dataproc job with poetry run python workflow/workflow_template.py

    • You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
    • The script will provision the cluster and submit the job.
    • The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
    • Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
    • On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.

Contributing checklist

When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)

For more details on each of these steps, see the sections below.

Documentation

  • If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
  • Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
  • Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
  • Step definition in docs/reference/step (example: docs/reference/step/finngen.md)

Configuration

  • Input and output paths in config/datasets/gcp.yaml
  • Step configuration in config/step/STEP.yaml (example: config/step/finngen.yaml)

Classes

  • Dataset class in src/org/dataset/ (example: src/otg/dataset/study_index.pyStudyIndexFinnGen)
  • Step main running class in src/org/STEP.py (example: src/org/finngen.py)

Tests

  • Test study fixture in tests/conftest.py (example: mock_study_index_finngen in that module)
  • Test sample data in tests/data_samples (example: tests/data_samples/finngen_studies_sample.json)
  • Test definition in tests/ (example: tests/dataset/test_study_index.pytest_study_index_finngen_creation)

\ No newline at end of file diff --git a/development/troubleshooting/index.html b/development/troubleshooting/index.html index dc96298ea..d6c3aec1f 100644 --- a/development/troubleshooting/index.html +++ b/development/troubleshooting/index.html @@ -1 +1 @@ - Troubleshooting - Open Targets Genetics

Troubleshooting

BLAS/LAPACK

If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.

Pyenv and Poetry

If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:

  1. Close your currently activated environment, if any: exit
  2. Uninstall Poetry: curl -sSL https://install.python-poetry.org | python3 - --uninstall
  3. Clear Poetry cache: rm -rf ~/.cache/pypoetry
  4. Clear pre-commit cache: rm -rf ~/.cache/pre-commit
  5. Switch to system Python shell: pyenv shell system
  6. Edit ~/.bashrc to remove the lines related to Pyenv configuration
  7. Remove Pyenv configuration and cache: rm -rf ~/.pyenv

After that, open a fresh shell session and run make setup-dev again.

Java

Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.

If you are encountering problems with initialising a Spark session, try using Java 11.

Pre-commit

If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.

One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.

Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.

On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.

After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).


\ No newline at end of file + Troubleshooting - Open Targets Genetics

Troubleshooting

BLAS/LAPACK

If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.

Pyenv and Poetry

If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:

  1. Close your currently activated environment, if any: exit
  2. Uninstall Poetry: curl -sSL https://install.python-poetry.org | python3 - --uninstall
  3. Clear Poetry cache: rm -rf ~/.cache/pypoetry
  4. Clear pre-commit cache: rm -rf ~/.cache/pre-commit
  5. Switch to system Python shell: pyenv shell system
  6. Edit ~/.bashrc to remove the lines related to Pyenv configuration
  7. Remove Pyenv configuration and cache: rm -rf ~/.pyenv

After that, open a fresh shell session and run make setup-dev again.

Java

Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.

If you are encountering problems with initialising a Spark session, try using Java 11.

Pre-commit

If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.

One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.

Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.

On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.

After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).


\ No newline at end of file diff --git a/index.html b/index.html index 2110f6f99..95d3d0a03 100644 --- a/index.html +++ b/index.html @@ -1,4 +1,4 @@ - Open Targets Genetics - Open Targets Genetics

Open Targets Genetics

Open Targets Genetics

\ No newline at end of file + Installation - Open Targets Genetics
\ No newline at end of file diff --git a/objects.inv b/objects.inv index 8278155af..61ba946f1 100644 Binary files a/objects.inv and b/objects.inv differ diff --git a/python_api/_python_api/index.html b/python_api/_python_api/index.html index 5375ccc98..7d95169fe 100644 --- a/python_api/_python_api/index.html +++ b/python_api/_python_api/index.html @@ -1 +1 @@ - Python API - Open Targets Genetics
\ No newline at end of file + Python API - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/dataset/_dataset/index.html b/python_api/dataset/_dataset/index.html index e0a2997da..e000a9ab5 100644 --- a/python_api/dataset/_dataset/index.html +++ b/python_api/dataset/_dataset/index.html @@ -1,4 +1,4 @@ - Dataset - Open Targets Genetics

Dataset

otg.dataset.dataset.Dataset dataclass

Bases: ABC

Open Targets Genetics Dataset.

Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the json.schemas module.

Source code in src/otg/dataset/dataset.py
 17
+ Dataset - Open Targets Genetics       

Dataset

otg.dataset.dataset.Dataset dataclass

Bases: ABC

Open Targets Genetics Dataset.

Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the schemas module.

Source code in src/otg/dataset/dataset.py
 17
  18
  19
  20
@@ -145,7 +145,7 @@
 class Dataset(ABC):
     """Open Targets Genetics Dataset.
 
-    `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `json.schemas` module.
+    `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `schemas` module.
     """
 
     _df: DataFrame
diff --git a/python_api/dataset/colocalisation/index.html b/python_api/dataset/colocalisation/index.html
index b88d04c05..4043b57c5 100644
--- a/python_api/dataset/colocalisation/index.html
+++ b/python_api/dataset/colocalisation/index.html
@@ -1,4 +1,4 @@
- Colocalisation - Open Targets Genetics       

Colocalisation

otg.dataset.colocalisation.Colocalisation dataclass

Bases: Dataset

Colocalisation results for pairs of overlapping study-locus.

Source code in src/otg/dataset/colocalisation.py
14
+ Colocalisation - Open Targets Genetics       

Colocalisation

otg.dataset.colocalisation.Colocalisation dataclass

Bases: Dataset

Colocalisation results for pairs of overlapping study-locus.

Source code in src/otg/dataset/colocalisation.py
14
 15
 16
 17
diff --git a/python_api/dataset/gene_index/index.html b/python_api/dataset/gene_index/index.html
index ae9657802..7eb860dab 100644
--- a/python_api/dataset/gene_index/index.html
+++ b/python_api/dataset/gene_index/index.html
@@ -1,4 +1,4 @@
- Gene Index - Open Targets Genetics       

Gene Index

otg.dataset.gene_index.GeneIndex dataclass

Bases: Dataset

Gene index dataset.

Gene-based annotation.

Source code in src/otg/dataset/gene_index.py
17
+ Gene Index - Open Targets Genetics       

Gene Index

otg.dataset.gene_index.GeneIndex dataclass

Bases: Dataset

Gene index dataset.

Gene-based annotation.

Source code in src/otg/dataset/gene_index.py
17
 18
 19
 20
diff --git a/python_api/dataset/intervals/index.html b/python_api/dataset/intervals/index.html
index 9284ee362..1c46e000a 100644
--- a/python_api/dataset/intervals/index.html
+++ b/python_api/dataset/intervals/index.html
@@ -1,4 +1,4 @@
- Intervals - Open Targets Genetics       

Intervals

otg.dataset.intervals.Intervals dataclass

Bases: Dataset

Intervals dataset links genes to genomic regions based on genome interaction studies.

Source code in src/otg/dataset/intervals.py
 22
+ Intervals - Open Targets Genetics       

Intervals

otg.dataset.intervals.Intervals dataclass

Bases: Dataset

Intervals dataset links genes to genomic regions based on genome interaction studies.

Source code in src/otg/dataset/intervals.py
 22
  23
  24
  25
diff --git a/python_api/dataset/l2g_feature_matrix/index.html b/python_api/dataset/l2g_feature_matrix/index.html
new file mode 100644
index 000000000..3bc31ad17
--- /dev/null
+++ b/python_api/dataset/l2g_feature_matrix/index.html
@@ -0,0 +1,421 @@
+ L2G Feature Matrix - Open Targets Genetics       

L2G Feature Matrix

otg.dataset.l2g_feature_matrix.L2GFeatureMatrix dataclass

Bases: Dataset

Dataset with features for Locus to Gene prediction.

Source code in src/otg/dataset/l2g_feature_matrix.py
 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
@dataclass
+class L2GFeatureMatrix(Dataset):
+    """Dataset with features for Locus to Gene prediction."""
+
+    @classmethod
+    def generate_features(
+        cls: Type[L2GFeatureMatrix],
+        study_locus: StudyLocus,
+        study_index: StudyIndex,
+        variant_gene: V2G,
+        # colocalisation: Colocalisation,
+    ) -> L2GFeatureMatrix:
+        """Generate features from the OTG datasets.
+
+        Args:
+            study_locus (StudyLocus): Study locus dataset
+            study_index (StudyIndex): Study index dataset
+            variant_gene (V2G): Variant to gene dataset
+
+        Returns:
+            L2GFeatureMatrix: L2G feature matrix dataset
+
+        Raises:
+            ValueError: If the feature matrix is empty
+        """
+        if features_dfs := [
+            # Extract features
+            # ColocalisationFactory._get_coloc_features(
+            #     study_locus, study_index, colocalisation
+            # ).df,
+            StudyLocusFactory._get_tss_distance_features(study_locus, variant_gene).df,
+        ]:
+            fm = reduce(
+                lambda x, y: x.unionByName(y),
+                features_dfs,
+            )
+        else:
+            raise ValueError("No features found")
+
+        # raise error if the feature matrix is empty
+        if fm.limit(1).count() != 0:
+            return cls(
+                _df=_convert_from_long_to_wide(
+                    fm, ["studyLocusId", "geneId"], "featureName", "featureValue"
+                ),
+                _schema=cls.get_schema(),
+            )
+        raise ValueError("L2G Feature matrix is empty")
+
+    @classmethod
+    def get_schema(cls: type[L2GFeatureMatrix]) -> StructType:
+        """Provides the schema for the L2gFeatureMatrix dataset.
+
+        Returns:
+            StructType: Schema for the L2gFeatureMatrix dataset
+        """
+        return parse_spark_schema("l2g_feature_matrix.json")
+
+    def fill_na(
+        self: L2GFeatureMatrix, value: float = 0.0, subset: list[str] | None = None
+    ) -> L2GFeatureMatrix:
+        """Fill missing values in a column with a given value.
+
+        Args:
+            value (float): Value to replace missing values with. Defaults to 0.0.
+            subset (list[str] | None): Subset of columns to consider. Defaults to None.
+
+        Returns:
+            L2GFeatureMatrix: L2G feature matrix dataset
+        """
+        self.df = self._df.fillna(value, subset=subset)
+        return self
+
+    def select_features(
+        self: L2GFeatureMatrix, features_list: list[str]
+    ) -> L2GFeatureMatrix:
+        """Select a subset of features from the feature matrix.
+
+        Args:
+            features_list (list[str]): List of features to select
+
+        Returns:
+            L2GFeatureMatrix: L2G feature matrix dataset
+        """
+        fixed_rows = ["studyLocusId", "geneId", "goldStandardSet"]
+        self.df = self._df.select(fixed_rows + features_list)
+        return self
+
+    def train_test_split(
+        self: L2GFeatureMatrix, fraction: float
+    ) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]:
+        """Split the dataset into training and test sets.
+
+        Args:
+            fraction (float): Fraction of the dataset to use for training
+
+        Returns:
+            tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets
+        """
+        train, test = self._df.randomSplit([fraction, 1 - fraction], seed=42)
+        return (
+            L2GFeatureMatrix(
+                _df=train, _schema=L2GFeatureMatrix.get_schema()
+            ).persist(),
+            L2GFeatureMatrix(_df=test, _schema=L2GFeatureMatrix.get_schema()).persist(),
+        )
+

fill_na(value: float = 0.0, subset: list[str] | None = None) -> L2GFeatureMatrix

Fill missing values in a column with a given value.

Parameters:

Name Type Description Default
value float

Value to replace missing values with. Defaults to 0.0.

0.0
subset list[str] | None

Subset of columns to consider. Defaults to None.

None

Returns:

Name Type Description
L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
80
+81
+82
+83
+84
+85
+86
+87
+88
+89
+90
+91
+92
+93
def fill_na(
+    self: L2GFeatureMatrix, value: float = 0.0, subset: list[str] | None = None
+) -> L2GFeatureMatrix:
+    """Fill missing values in a column with a given value.
+
+    Args:
+        value (float): Value to replace missing values with. Defaults to 0.0.
+        subset (list[str] | None): Subset of columns to consider. Defaults to None.
+
+    Returns:
+        L2GFeatureMatrix: L2G feature matrix dataset
+    """
+    self.df = self._df.fillna(value, subset=subset)
+    return self
+

generate_features(study_locus: StudyLocus, study_index: StudyIndex, variant_gene: V2G) -> L2GFeatureMatrix classmethod

Generate features from the OTG datasets.

Parameters:

Name Type Description Default
study_locus StudyLocus

Study locus dataset

required
study_index StudyIndex

Study index dataset

required
variant_gene V2G

Variant to gene dataset

required

Returns:

Name Type Description
L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Raises:

Type Description
ValueError

If the feature matrix is empty

Source code in src/otg/dataset/l2g_feature_matrix.py
26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
+63
+64
+65
+66
+67
+68
+69
@classmethod
+def generate_features(
+    cls: Type[L2GFeatureMatrix],
+    study_locus: StudyLocus,
+    study_index: StudyIndex,
+    variant_gene: V2G,
+    # colocalisation: Colocalisation,
+) -> L2GFeatureMatrix:
+    """Generate features from the OTG datasets.
+
+    Args:
+        study_locus (StudyLocus): Study locus dataset
+        study_index (StudyIndex): Study index dataset
+        variant_gene (V2G): Variant to gene dataset
+
+    Returns:
+        L2GFeatureMatrix: L2G feature matrix dataset
+
+    Raises:
+        ValueError: If the feature matrix is empty
+    """
+    if features_dfs := [
+        # Extract features
+        # ColocalisationFactory._get_coloc_features(
+        #     study_locus, study_index, colocalisation
+        # ).df,
+        StudyLocusFactory._get_tss_distance_features(study_locus, variant_gene).df,
+    ]:
+        fm = reduce(
+            lambda x, y: x.unionByName(y),
+            features_dfs,
+        )
+    else:
+        raise ValueError("No features found")
+
+    # raise error if the feature matrix is empty
+    if fm.limit(1).count() != 0:
+        return cls(
+            _df=_convert_from_long_to_wide(
+                fm, ["studyLocusId", "geneId"], "featureName", "featureValue"
+            ),
+            _schema=cls.get_schema(),
+        )
+    raise ValueError("L2G Feature matrix is empty")
+

get_schema() -> StructType classmethod

Provides the schema for the L2gFeatureMatrix dataset.

Returns:

Name Type Description
StructType StructType

Schema for the L2gFeatureMatrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
71
+72
+73
+74
+75
+76
+77
+78
@classmethod
+def get_schema(cls: type[L2GFeatureMatrix]) -> StructType:
+    """Provides the schema for the L2gFeatureMatrix dataset.
+
+    Returns:
+        StructType: Schema for the L2gFeatureMatrix dataset
+    """
+    return parse_spark_schema("l2g_feature_matrix.json")
+

select_features(features_list: list[str]) -> L2GFeatureMatrix

Select a subset of features from the feature matrix.

Parameters:

Name Type Description Default
features_list list[str]

List of features to select

required

Returns:

Name Type Description
L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
def select_features(
+    self: L2GFeatureMatrix, features_list: list[str]
+) -> L2GFeatureMatrix:
+    """Select a subset of features from the feature matrix.
+
+    Args:
+        features_list (list[str]): List of features to select
+
+    Returns:
+        L2GFeatureMatrix: L2G feature matrix dataset
+    """
+    fixed_rows = ["studyLocusId", "geneId", "goldStandardSet"]
+    self.df = self._df.select(fixed_rows + features_list)
+    return self
+

train_test_split(fraction: float) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]

Split the dataset into training and test sets.

Parameters:

Name Type Description Default
fraction float

Fraction of the dataset to use for training

required

Returns:

Type Description
tuple[L2GFeatureMatrix, L2GFeatureMatrix]

tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets

Source code in src/otg/dataset/l2g_feature_matrix.py
110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
def train_test_split(
+    self: L2GFeatureMatrix, fraction: float
+) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]:
+    """Split the dataset into training and test sets.
+
+    Args:
+        fraction (float): Fraction of the dataset to use for training
+
+    Returns:
+        tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets
+    """
+    train, test = self._df.randomSplit([fraction, 1 - fraction], seed=42)
+    return (
+        L2GFeatureMatrix(
+            _df=train, _schema=L2GFeatureMatrix.get_schema()
+        ).persist(),
+        L2GFeatureMatrix(_df=test, _schema=L2GFeatureMatrix.get_schema()).persist(),
+    )
+

Schema

root
+ |-- studyLocusId: long (nullable = false)
+ |-- geneId: string (nullable = false)
+ |-- goldStandardSet: string (nullable = true)
+ |-- distanceTssMean: float (nullable = true)
+ |-- distanceTssMinimum: float (nullable = true)
+ |-- eqtlColocClppLocalMaximum: double (nullable = true)
+ |-- eqtlColocClppNeighborhoodMaximum: double (nullable = true)
+ |-- eqtlColocLlrLocalMaximum: double (nullable = true)
+ |-- eqtlColocLlrNeighborhoodMaximum: double (nullable = true)
+ |-- pqtlColocClppLocalMaximum: double (nullable = true)
+ |-- pqtlColocClppNeighborhoodMaximum: double (nullable = true)
+ |-- pqtlColocLlrLocalMaximum: double (nullable = true)
+ |-- pqtlColocLlrNeighborhoodMaximum: double (nullable = true)
+ |-- sqtlColocClppLocalMaximum: double (nullable = true)
+ |-- sqtlColocClppNeighborhoodMaximum: double (nullable = true)
+ |-- sqtlColocLlrLocalMaximum: double (nullable = true)
+ |-- sqtlColocLlrNeighborhoodMaximum: double (nullable = true)
+

\ No newline at end of file diff --git a/python_api/dataset/l2g_gold_standard/index.html b/python_api/dataset/l2g_gold_standard/index.html new file mode 100644 index 000000000..48f9bde4e --- /dev/null +++ b/python_api/dataset/l2g_gold_standard/index.html @@ -0,0 +1,150 @@ + L2G Gold Standard - Open Targets Genetics

L2G Gold Standard

otg.dataset.l2g_gold_standard.L2GGoldStandard dataclass

Bases: Dataset

L2G gold standard dataset.

Source code in src/otg/dataset/l2g_gold_standard.py
18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
@dataclass
+class L2GGoldStandard(Dataset):
+    """L2G gold standard dataset."""
+
+    @classmethod
+    def from_otg_curation(
+        cls: type[L2GGoldStandard],
+        gold_standard_curation: DataFrame,
+        v2g: V2G,
+        study_locus_overlap: StudyLocusOverlap,
+        interactions: DataFrame,
+    ) -> L2GGoldStandard:
+        """Initialise L2GGoldStandard from source dataset.
+
+        Args:
+            gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from
+            v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS
+            study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci
+            interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene
+
+        Returns:
+            L2GGoldStandard: L2G Gold Standard dataset
+        """
+        from otg.datasource.open_targets.l2g_gold_standard import (
+            OpenTargetsL2GGoldStandard,
+        )
+
+        return OpenTargetsL2GGoldStandard.as_l2g_gold_standard(
+            gold_standard_curation, v2g, study_locus_overlap, interactions
+        )
+
+    @classmethod
+    def get_schema(cls: type[L2GGoldStandard]) -> StructType:
+        """Provides the schema for the L2GGoldStandard dataset.
+
+        Returns:
+            StructType: Spark schema for the L2GGoldStandard dataset
+        """
+        return parse_spark_schema("l2g_gold_standard.json")
+

from_otg_curation(gold_standard_curation: DataFrame, v2g: V2G, study_locus_overlap: StudyLocusOverlap, interactions: DataFrame) -> L2GGoldStandard classmethod

Initialise L2GGoldStandard from source dataset.

Parameters:

Name Type Description Default
gold_standard_curation DataFrame

Gold standard curation dataframe, extracted from

required
v2g V2G

Variant to gene dataset to bring distance between a variant and a gene's TSS

required
study_locus_overlap StudyLocusOverlap

Study locus overlap dataset to remove duplicated loci

required
interactions DataFrame

Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene

required

Returns:

Name Type Description
L2GGoldStandard L2GGoldStandard

L2G Gold Standard dataset

Source code in src/otg/dataset/l2g_gold_standard.py
22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
@classmethod
+def from_otg_curation(
+    cls: type[L2GGoldStandard],
+    gold_standard_curation: DataFrame,
+    v2g: V2G,
+    study_locus_overlap: StudyLocusOverlap,
+    interactions: DataFrame,
+) -> L2GGoldStandard:
+    """Initialise L2GGoldStandard from source dataset.
+
+    Args:
+        gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from
+        v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS
+        study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci
+        interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene
+
+    Returns:
+        L2GGoldStandard: L2G Gold Standard dataset
+    """
+    from otg.datasource.open_targets.l2g_gold_standard import (
+        OpenTargetsL2GGoldStandard,
+    )
+
+    return OpenTargetsL2GGoldStandard.as_l2g_gold_standard(
+        gold_standard_curation, v2g, study_locus_overlap, interactions
+    )
+

get_schema() -> StructType classmethod

Provides the schema for the L2GGoldStandard dataset.

Returns:

Name Type Description
StructType StructType

Spark schema for the L2GGoldStandard dataset

Source code in src/otg/dataset/l2g_gold_standard.py
49
+50
+51
+52
+53
+54
+55
+56
@classmethod
+def get_schema(cls: type[L2GGoldStandard]) -> StructType:
+    """Provides the schema for the L2GGoldStandard dataset.
+
+    Returns:
+        StructType: Spark schema for the L2GGoldStandard dataset
+    """
+    return parse_spark_schema("l2g_gold_standard.json")
+

Schema

root
+ |-- studyLocusId: long (nullable = false)
+ |-- geneId: string (nullable = false)
+ |-- goldStandardSet: string (nullable = false)
+ |-- sources: array (nullable = false)
+ |    |-- element: string (containsNull = true)
+

\ No newline at end of file diff --git a/python_api/dataset/l2g_prediction/index.html b/python_api/dataset/l2g_prediction/index.html new file mode 100644 index 000000000..10959bb7c --- /dev/null +++ b/python_api/dataset/l2g_prediction/index.html @@ -0,0 +1,222 @@ + L2G Prediction - Open Targets Genetics

L2G Prediction

otg.dataset.l2g_prediction.L2GPrediction dataclass

Bases: Dataset

Dataset that contains the Locus to Gene predictions.

It is the result of applying the L2G model on a feature matrix, which contains all the study/locus pairs and their functional annotations. The score column informs the confidence of the prediction that a gene is causal to an association.

Source code in src/otg/dataset/l2g_prediction.py
24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
+63
+64
+65
+66
+67
+68
+69
+70
+71
+72
+73
+74
+75
+76
+77
+78
+79
+80
+81
+82
+83
+84
@dataclass
+class L2GPrediction(Dataset):
+    """Dataset that contains the Locus to Gene predictions.
+
+    It is the result of applying the L2G model on a feature matrix, which contains all
+    the study/locus pairs and their functional annotations. The score column informs the
+    confidence of the prediction that a gene is causal to an association.
+    """
+
+    @classmethod
+    def get_schema(cls: type[L2GPrediction]) -> StructType:
+        """Provides the schema for the L2GPrediction dataset.
+
+        Returns:
+            StructType: Schema for the L2GPrediction dataset
+        """
+        return parse_spark_schema("l2g_predictions.json")
+
+    @classmethod
+    def from_study_locus(
+        cls: Type[L2GPrediction],
+        model_path: str,
+        study_locus: StudyLocus,
+        study_index: StudyIndex,
+        v2g: V2G,
+        # coloc: Colocalisation,
+    ) -> L2GPrediction:
+        """Initialise L2G from feature matrix.
+
+        Args:
+            model_path (str): Path to the fitted model
+            study_locus (StudyLocus): Study locus dataset
+            study_index (StudyIndex): Study index dataset
+            v2g (V2G): Variant to gene dataset
+
+        Returns:
+            L2GPrediction: L2G dataset
+        """
+        fm = L2GFeatureMatrix.generate_features(
+            study_locus=study_locus,
+            study_index=StudyIndex,
+            variant_gene=v2g,
+            # colocalisation=coloc,
+        ).fill_na()
+        return L2GPrediction(
+            # Load and apply fitted model
+            _df=(
+                LocusToGeneModel.load_from_disk(
+                    model_path,
+                    features_list=fm.df.drop("studyLocusId", "geneId").columns,
+                ).predict(fm)
+                # the probability of the positive class is the second element inside the probability array
+                # - this is selected as the L2G probability
+                .select(
+                    "studyLocusId",
+                    "geneId",
+                    vector_to_array("probability")[1].alias("score"),
+                )
+            ),
+            _schema=cls.get_schema(),
+        )
+

from_study_locus(model_path: str, study_locus: StudyLocus, study_index: StudyIndex, v2g: V2G) -> L2GPrediction classmethod

Initialise L2G from feature matrix.

Parameters:

Name Type Description Default
model_path str

Path to the fitted model

required
study_locus StudyLocus

Study locus dataset

required
study_index StudyIndex

Study index dataset

required
v2g V2G

Variant to gene dataset

required

Returns:

Name Type Description
L2GPrediction L2GPrediction

L2G dataset

Source code in src/otg/dataset/l2g_prediction.py
42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
+63
+64
+65
+66
+67
+68
+69
+70
+71
+72
+73
+74
+75
+76
+77
+78
+79
+80
+81
+82
+83
+84
@classmethod
+def from_study_locus(
+    cls: Type[L2GPrediction],
+    model_path: str,
+    study_locus: StudyLocus,
+    study_index: StudyIndex,
+    v2g: V2G,
+    # coloc: Colocalisation,
+) -> L2GPrediction:
+    """Initialise L2G from feature matrix.
+
+    Args:
+        model_path (str): Path to the fitted model
+        study_locus (StudyLocus): Study locus dataset
+        study_index (StudyIndex): Study index dataset
+        v2g (V2G): Variant to gene dataset
+
+    Returns:
+        L2GPrediction: L2G dataset
+    """
+    fm = L2GFeatureMatrix.generate_features(
+        study_locus=study_locus,
+        study_index=StudyIndex,
+        variant_gene=v2g,
+        # colocalisation=coloc,
+    ).fill_na()
+    return L2GPrediction(
+        # Load and apply fitted model
+        _df=(
+            LocusToGeneModel.load_from_disk(
+                model_path,
+                features_list=fm.df.drop("studyLocusId", "geneId").columns,
+            ).predict(fm)
+            # the probability of the positive class is the second element inside the probability array
+            # - this is selected as the L2G probability
+            .select(
+                "studyLocusId",
+                "geneId",
+                vector_to_array("probability")[1].alias("score"),
+            )
+        ),
+        _schema=cls.get_schema(),
+    )
+

get_schema() -> StructType classmethod

Provides the schema for the L2GPrediction dataset.

Returns:

Name Type Description
StructType StructType

Schema for the L2GPrediction dataset

Source code in src/otg/dataset/l2g_prediction.py
33
+34
+35
+36
+37
+38
+39
+40
@classmethod
+def get_schema(cls: type[L2GPrediction]) -> StructType:
+    """Provides the schema for the L2GPrediction dataset.
+
+    Returns:
+        StructType: Schema for the L2GPrediction dataset
+    """
+    return parse_spark_schema("l2g_predictions.json")
+

Schema


\ No newline at end of file diff --git a/python_api/dataset/ld_index/index.html b/python_api/dataset/ld_index/index.html index b1f4b4d17..832339573 100644 --- a/python_api/dataset/ld_index/index.html +++ b/python_api/dataset/ld_index/index.html @@ -1,4 +1,4 @@ - LD Index - Open Targets Genetics

LD Index

otg.dataset.ld_index.LDIndex dataclass

Bases: Dataset

Dataset containing linkage desequilibrium information between variants.

Source code in src/otg/dataset/ld_index.py
14
+ LD Index - Open Targets Genetics       

LD Index

otg.dataset.ld_index.LDIndex dataclass

Bases: Dataset

Dataset containing linkage desequilibrium information between variants.

Source code in src/otg/dataset/ld_index.py
14
 15
 16
 17
diff --git a/python_api/dataset/study_index/index.html b/python_api/dataset/study_index/index.html
index bbbeb4e0a..bc9c07d8c 100644
--- a/python_api/dataset/study_index/index.html
+++ b/python_api/dataset/study_index/index.html
@@ -1,4 +1,4 @@
- Study Index - Open Targets Genetics       

Study Index

otg.dataset.study_index.StudyIndex dataclass

Bases: Dataset

Study index dataset.

A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.

Source code in src/otg/dataset/study_index.py
 21
+ Study Index - Open Targets Genetics       

Study Index

otg.dataset.study_index.StudyIndex dataclass

Bases: Dataset

Study index dataset.

A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.

Source code in src/otg/dataset/study_index.py
 21
  22
  23
  24
@@ -363,6 +363,7 @@
  |-- traitFromSource: string (nullable = false)
  |-- traitFromSourceMappedIds: array (nullable = true)
  |    |-- element: string (containsNull = true)
+ |-- geneId: string (nullable = true)
  |-- pubmedId: string (nullable = true)
  |-- publicationTitle: string (nullable = true)
  |-- publicationFirstAuthor: string (nullable = true)
diff --git a/python_api/dataset/study_locus/index.html b/python_api/dataset/study_locus/index.html
index 0e5c7c6b8..69cfec636 100644
--- a/python_api/dataset/study_locus/index.html
+++ b/python_api/dataset/study_locus/index.html
@@ -1,4 +1,4 @@
- Study Locus - Open Targets Genetics       

Study Locus

otg.dataset.study_locus.StudyLocus dataclass

Bases: Dataset

Study-Locus dataset.

This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.

Source code in src/otg/dataset/study_locus.py
 67
+ Study Locus - Open Targets Genetics       

Study Locus

otg.dataset.study_locus.StudyLocus dataclass

Bases: Dataset

Study-Locus dataset.

This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.

Source code in src/otg/dataset/study_locus.py
 67
  68
  69
  70
@@ -369,7 +369,10 @@
 435
 436
 437
-438
@dataclass
+438
+439
+440
+441
@dataclass
 class StudyLocus(Dataset):
     """Study-Locus dataset.
 
@@ -422,7 +425,7 @@
 
         Args:
             loci_to_overlap (DataFrame): containing `studyLocusId`, `studyType`, `chromosome`, `tagVariantId`, `logABF` and `posteriorProbability` columns.
-            peak_overlaps (DataFrame): containing `left_studyLocusId`, `right_studyLocusId` and `chromosome` columns.
+            peak_overlaps (DataFrame): containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.
 
         Returns:
             StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.
@@ -541,7 +544,10 @@
         """
         self.df = self._df.withColumn(
             "locus",
-            f.expr(f"filter(locus, tag -> (tag.{credible_interval.value}))"),
+            f.filter(
+                f.col("locus"),
+                lambda tag: (tag[credible_interval.value]),
+            ),
         )
         return self
 
@@ -741,10 +747,7 @@
             ),
         )
         return self
-

annotate_credible_sets() -> StudyLocus

Annotate study-locus dataset with credible set flags.

Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.

Returns:

Name Type Description
StudyLocus StudyLocus

including annotation on is95CredibleSet and is99CredibleSet.

Raises:

Type Description
ValueError

If locus column is not available.

Source code in src/otg/dataset/study_locus.py
312
-313
-314
-315
+

annotate_credible_sets() -> StudyLocus

Annotate study-locus dataset with credible set flags.

Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.

Returns:

Name Type Description
StudyLocus StudyLocus

including annotation on is95CredibleSet and is99CredibleSet.

Raises:

Type Description
ValueError

If locus column is not available.

Source code in src/otg/dataset/study_locus.py
315
 316
 317
 318
@@ -794,7 +797,10 @@
 362
 363
 364
-365
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:
+365
+366
+367
+368
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:
     """Annotate study-locus dataset with credible set flags.
 
     Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds
@@ -901,10 +907,7 @@
         <BLANKLINE>
     """
     return f.xxhash64(*[study_id_col, variant_id_col]).alias("studyLocusId")
-

clump() -> StudyLocus

Perform LD clumping of the studyLocus.

Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.

Returns:

Name Type Description
StudyLocus StudyLocus

with empty credible sets for linked variants and QC flag.

Source code in src/otg/dataset/study_locus.py
367
-368
-369
-370
+

clump() -> StudyLocus

Perform LD clumping of the studyLocus.

Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.

Returns:

Name Type Description
StudyLocus StudyLocus

with empty credible sets for linked variants and QC flag.

Source code in src/otg/dataset/study_locus.py
370
 371
 372
 373
@@ -934,7 +937,10 @@
 397
 398
 399
-400
def clump(self: StudyLocus) -> StudyLocus:
+400
+401
+402
+403
def clump(self: StudyLocus) -> StudyLocus:
     """Perform LD clumping of the studyLocus.
 
     Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.
@@ -984,7 +990,10 @@
 238
 239
 240
-241
def filter_credible_set(
+241
+242
+243
+244
def filter_credible_set(
     self: StudyLocus,
     credible_interval: CredibleInterval,
 ) -> StudyLocus:
@@ -998,13 +1007,13 @@
     """
     self.df = self._df.withColumn(
         "locus",
-        f.expr(f"filter(locus, tag -> (tag.{credible_interval.value}))"),
+        f.filter(
+            f.col("locus"),
+            lambda tag: (tag[credible_interval.value]),
+        ),
     )
     return self
-

find_overlaps(study_index: StudyIndex) -> StudyLocusOverlap

Calculate overlapping study-locus.

Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.

Parameters:

Name Type Description Default
study_index StudyIndex

Study index to resolve study types.

required

Returns:

Name Type Description
StudyLocusOverlap StudyLocusOverlap

Pairs of overlapping study-locus with aligned tags.

Source code in src/otg/dataset/study_locus.py
243
-244
-245
-246
+

find_overlaps(study_index: StudyIndex) -> StudyLocusOverlap

Calculate overlapping study-locus.

Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.

Parameters:

Name Type Description Default
study_index StudyIndex

Study index to resolve study types.

required

Returns:

Name Type Description
StudyLocusOverlap StudyLocusOverlap

Pairs of overlapping study-locus with aligned tags.

Source code in src/otg/dataset/study_locus.py
246
 247
 248
 249
@@ -1034,7 +1043,10 @@
 273
 274
 275
-276
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:
+276
+277
+278
+279
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:
     """Calculate overlapping study-locus.
 
     Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always
@@ -1083,16 +1095,16 @@
         StructType: schema for the StudyLocus dataset.
     """
     return parse_spark_schema("study_locus.json")
-

neglog_pvalue() -> Column

Returns the negative log p-value.

Returns:

Name Type Description
Column Column

Negative log p-value

Source code in src/otg/dataset/study_locus.py
301
-302
-303
-304
+

neglog_pvalue() -> Column

Returns the negative log p-value.

Returns:

Name Type Description
Column Column

Negative log p-value

Source code in src/otg/dataset/study_locus.py
304
 305
 306
 307
 308
 309
-310
def neglog_pvalue(self: StudyLocus) -> Column:
+310
+311
+312
+313
def neglog_pvalue(self: StudyLocus) -> Column:
     """Returns the negative log p-value.
 
     Returns:
@@ -1102,10 +1114,7 @@
         self.df.pValueMantissa,
         self.df.pValueExponent,
     )
-

unique_variants_in_locus() -> DataFrame

All unique variants collected in a StudyLocus dataframe.

Returns:

Name Type Description
DataFrame DataFrame

A dataframe containing variantId and chromosome columns.

Source code in src/otg/dataset/study_locus.py
278
-279
-280
-281
+

unique_variants_in_locus() -> DataFrame

All unique variants collected in a StudyLocus dataframe.

Returns:

Name Type Description
DataFrame DataFrame

A dataframe containing variantId and chromosome columns.

Source code in src/otg/dataset/study_locus.py
281
 282
 283
 284
@@ -1123,7 +1132,10 @@
 296
 297
 298
-299
def unique_variants_in_locus(self: StudyLocus) -> DataFrame:
+299
+300
+301
+302
def unique_variants_in_locus(self: StudyLocus) -> DataFrame:
     """All unique variants collected in a `StudyLocus` dataframe.
 
     Returns:
diff --git a/python_api/dataset/study_locus_overlap/index.html b/python_api/dataset/study_locus_overlap/index.html
index 76bc18763..e91c5eb82 100644
--- a/python_api/dataset/study_locus_overlap/index.html
+++ b/python_api/dataset/study_locus_overlap/index.html
@@ -1,4 +1,4 @@
- Study Locus Overlap - Open Targets Genetics       

Study Locus Overlap

otg.dataset.study_locus_overlap.StudyLocusOverlap dataclass

Bases: Dataset

Study-Locus overlap.

This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.

Note

This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.

Source code in src/otg/dataset/study_locus_overlap.py
17
+ Study Locus Overlap - Open Targets Genetics       

Study Locus Overlap

otg.dataset.study_locus_overlap.StudyLocusOverlap dataclass

Bases: Dataset

Study-Locus overlap.

This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.

Note

This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.

Source code in src/otg/dataset/study_locus_overlap.py
17
 18
 19
 20
diff --git a/python_api/dataset/summary_statistics/index.html b/python_api/dataset/summary_statistics/index.html
index 0982b07fa..81fd1de2e 100644
--- a/python_api/dataset/summary_statistics/index.html
+++ b/python_api/dataset/summary_statistics/index.html
@@ -1,4 +1,4 @@
- Summary Statistics - Open Targets Genetics       

Summary Statistics

otg.dataset.summary_statistics.SummaryStatistics dataclass

Bases: Dataset

Summary Statistics dataset.

A summary statistics dataset contains all single point statistics resulting from a GWAS.

Source code in src/otg/dataset/summary_statistics.py
 20
+ Summary Statistics - Open Targets Genetics       

Summary Statistics

otg.dataset.summary_statistics.SummaryStatistics dataclass

Bases: Dataset

Summary Statistics dataset.

A summary statistics dataset contains all single point statistics resulting from a GWAS.

Source code in src/otg/dataset/summary_statistics.py
 20
  21
  22
  23
diff --git a/python_api/dataset/variant_annotation/index.html b/python_api/dataset/variant_annotation/index.html
index 2d67dd658..94a85c090 100644
--- a/python_api/dataset/variant_annotation/index.html
+++ b/python_api/dataset/variant_annotation/index.html
@@ -1,4 +1,4 @@
- Variant annotation - Open Targets Genetics       

Variant annotation

otg.dataset.variant_annotation.VariantAnnotation dataclass

Bases: Dataset

Dataset with variant-level annotations.

Source code in src/otg/dataset/variant_annotation.py
 21
+ Variant annotation - Open Targets Genetics       

Variant annotation

otg.dataset.variant_annotation.VariantAnnotation dataclass

Bases: Dataset

Dataset with variant-level annotations.

Source code in src/otg/dataset/variant_annotation.py
 21
  22
  23
  24
diff --git a/python_api/dataset/variant_index/index.html b/python_api/dataset/variant_index/index.html
index 66889e7df..bc74a1c83 100644
--- a/python_api/dataset/variant_index/index.html
+++ b/python_api/dataset/variant_index/index.html
@@ -1,4 +1,4 @@
- Variant index - Open Targets Genetics       

Variant index

otg.dataset.variant_index.VariantIndex dataclass

Bases: Dataset

Variant index dataset.

Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.

Source code in src/otg/dataset/variant_index.py
20
+ Variant index - Open Targets Genetics       

Variant index

otg.dataset.variant_index.VariantIndex dataclass

Bases: Dataset

Variant index dataset.

Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.

Source code in src/otg/dataset/variant_index.py
20
 21
 22
 23
diff --git a/python_api/dataset/variant_to_gene/index.html b/python_api/dataset/variant_to_gene/index.html
index d2cbc4ec5..82d28d391 100644
--- a/python_api/dataset/variant_to_gene/index.html
+++ b/python_api/dataset/variant_to_gene/index.html
@@ -1,6 +1,4 @@
- Variant-to-gene - Open Targets Genetics       

Variant-to-gene

otg.dataset.v2g.V2G dataclass

Bases: Dataset

Variant-to-gene (V2G) evidence dataset.

A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)

Source code in src/otg/dataset/v2g.py
16
-17
-18
+ Variant-to-gene - Open Targets Genetics       

Variant-to-gene

otg.dataset.v2g.V2G dataclass

Bases: Dataset

Variant-to-gene (V2G) evidence dataset.

A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)

Source code in src/otg/dataset/v2g.py
18
 19
 20
 21
@@ -24,7 +22,16 @@
 39
 40
 41
-42
@dataclass
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
@dataclass
 class V2G(Dataset):
     """Variant-to-gene (V2G) evidence dataset.
 
@@ -41,7 +48,7 @@
         return parse_spark_schema("v2g.json")
 
     def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:
-        """Filter by V2G dataset by genes.
+        """Filter V2G dataset by genes.
 
         Args:
             genes (GeneIndex): Gene index dataset to filter by
@@ -51,9 +58,25 @@
         """
         self.df = self._df.join(genes.df.select("geneId"), on="geneId", how="inner")
         return self
-

filter_by_genes(genes: GeneIndex) -> V2G

Filter by V2G dataset by genes.

Parameters:

Name Type Description Default
genes GeneIndex

Gene index dataset to filter by

required

Returns:

Name Type Description
V2G V2G

V2G dataset filtered by genes

Source code in src/otg/dataset/v2g.py
32
-33
-34
+
+    def extract_distance_tss_minimum(self: V2G) -> None:
+        """Extract minimum distance to TSS."""
+        self.df = self._df.filter(f.col("distance")).withColumn(
+            "distanceTssMinimum",
+            f.expr("min(distTss) OVER (PARTITION BY studyLocusId)"),
+        )
+

extract_distance_tss_minimum() -> None

Extract minimum distance to TSS.

Source code in src/otg/dataset/v2g.py
46
+47
+48
+49
+50
+51
def extract_distance_tss_minimum(self: V2G) -> None:
+    """Extract minimum distance to TSS."""
+    self.df = self._df.filter(f.col("distance")).withColumn(
+        "distanceTssMinimum",
+        f.expr("min(distTss) OVER (PARTITION BY studyLocusId)"),
+    )
+

filter_by_genes(genes: GeneIndex) -> V2G

Filter V2G dataset by genes.

Parameters:

Name Type Description Default
genes GeneIndex

Gene index dataset to filter by

required

Returns:

Name Type Description
V2G V2G

V2G dataset filtered by genes

Source code in src/otg/dataset/v2g.py
34
 35
 36
 37
@@ -61,8 +84,10 @@
 39
 40
 41
-42
def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:
-    """Filter by V2G dataset by genes.
+42
+43
+44
def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:
+    """Filter V2G dataset by genes.
 
     Args:
         genes (GeneIndex): Gene index dataset to filter by
@@ -72,14 +97,14 @@
     """
     self.df = self._df.join(genes.df.select("geneId"), on="geneId", how="inner")
     return self
-

get_schema() -> StructType classmethod

Provides the schema for the V2G dataset.

Returns:

Name Type Description
StructType StructType

Schema for the V2G dataset

Source code in src/otg/dataset/v2g.py
23
-24
-25
+

get_schema() -> StructType classmethod

Provides the schema for the V2G dataset.

Returns:

Name Type Description
StructType StructType

Schema for the V2G dataset

Source code in src/otg/dataset/v2g.py
25
 26
 27
 28
 29
-30
@classmethod
+30
+31
+32
@classmethod
 def get_schema(cls: type[V2G]) -> StructType:
     """Provides the schema for the V2G dataset.
 
diff --git a/python_api/datasource/_datasource/index.html b/python_api/datasource/_datasource/index.html
index aa764a836..6a2687481 100644
--- a/python_api/datasource/_datasource/index.html
+++ b/python_api/datasource/_datasource/index.html
@@ -1 +1 @@
- Data Source - Open Targets Genetics       
\ No newline at end of file + Data Source - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/datasource/finngen/_finngen/index.html b/python_api/datasource/finngen/_finngen/index.html index a6ba86bba..5d577fc5a 100644 --- a/python_api/datasource/finngen/_finngen/index.html +++ b/python_api/datasource/finngen/_finngen/index.html @@ -1,4 +1,4 @@ - FinnGen - Open Targets Genetics

Finngen

Finngen

Study Index

otg.datasource.finngen.study_index.FinnGenStudyIndex

Bases: StudyIndex

Study index dataset from FinnGen.

The following information is aggregated/extracted:

  • Study ID in the special format (FINNGEN_R9_*)
  • Trait name (for example, Amoebiasis)
  • Number of cases and controls
  • Link to the summary statistics location

Some fields are also populated as constants, such as study type and the initial sample size.

Source code in src/otg/datasource/finngen/study_index.py
14
+ Study Index - Open Targets Genetics       

Study Index

otg.datasource.finngen.study_index.FinnGenStudyIndex

Bases: StudyIndex

Study index dataset from FinnGen.

The following information is aggregated/extracted:

  • Study ID in the special format (FINNGEN_R9_*)
  • Trait name (for example, Amoebiasis)
  • Number of cases and controls
  • Link to the summary statistics location

Some fields are also populated as constants, such as study type and the initial sample size.

Source code in src/otg/datasource/finngen/study_index.py
14
 15
 16
 17
diff --git a/python_api/datasource/gnomad/_gnomad/index.html b/python_api/datasource/gnomad/_gnomad/index.html
index 5691fbb8d..84794a1e6 100644
--- a/python_api/datasource/gnomad/_gnomad/index.html
+++ b/python_api/datasource/gnomad/_gnomad/index.html
@@ -1,4 +1,4 @@
- GnomAD - Open Targets Genetics      

Gnomad

Gnomad

LD Matrix

otg.datasource.gnomad.ld.GnomADLDMatrix

Importer of LD information from GnomAD.

The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:

  1. Convert a LD matrix to a Spark DataFrame.
  2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
  3. Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
 20
+ LD Matrix - Open Targets Genetics       

LD Matrix

otg.datasource.gnomad.ld.GnomADLDMatrix

Importer of LD information from GnomAD.

The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:

  1. Convert a LD matrix to a Spark DataFrame.
  2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
  3. Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
 20
  21
  22
  23
diff --git a/python_api/datasource/gnomad/gnomad_variants/index.html b/python_api/datasource/gnomad/gnomad_variants/index.html
index 05b3fc6d2..d6bd9623b 100644
--- a/python_api/datasource/gnomad/gnomad_variants/index.html
+++ b/python_api/datasource/gnomad/gnomad_variants/index.html
@@ -1,4 +1,4 @@
- Variants - Open Targets Genetics       

Variants

otg.datasource.gnomad.variants.GnomADVariants

GnomAD variants included in the GnomAD genomes dataset.

Source code in src/otg/datasource/gnomad/variants.py
 14
+ Variants - Open Targets Genetics       

Variants

otg.datasource.gnomad.variants.GnomADVariants

GnomAD variants included in the GnomAD genomes dataset.

Source code in src/otg/datasource/gnomad/variants.py
 14
  15
  16
  17
diff --git a/python_api/datasource/gwas_catalog/_gwas_catalog/index.html b/python_api/datasource/gwas_catalog/_gwas_catalog/index.html
index 2d5b3ad21..bcb688cd2 100644
--- a/python_api/datasource/gwas_catalog/_gwas_catalog/index.html
+++ b/python_api/datasource/gwas_catalog/_gwas_catalog/index.html
@@ -1 +1 @@
- GWAS Catalog - Open Targets Genetics      
\ No newline at end of file + GWAS Catalog - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/datasource/gwas_catalog/associations/index.html b/python_api/datasource/gwas_catalog/associations/index.html index 877607400..b810b0adc 100644 --- a/python_api/datasource/gwas_catalog/associations/index.html +++ b/python_api/datasource/gwas_catalog/associations/index.html @@ -1,4 +1,4 @@ - Associations - Open Targets Genetics

Associations

otg.datasource.gwas_catalog.associations.GWASCatalogAssociations dataclass

Bases: StudyLocus

Study-locus dataset derived from GWAS Catalog.

Source code in src/otg/datasource/gwas_catalog/associations.py
  30
+ Associations - Open Targets Genetics       

Associations

otg.datasource.gwas_catalog.associations.GWASCatalogAssociations dataclass

Bases: StudyLocus

Study-locus dataset derived from GWAS Catalog.

Source code in src/otg/datasource/gwas_catalog/associations.py
  30
   31
   32
   33
diff --git a/python_api/datasource/gwas_catalog/study_index/index.html b/python_api/datasource/gwas_catalog/study_index/index.html
index 0592203f7..12dea6e3f 100644
--- a/python_api/datasource/gwas_catalog/study_index/index.html
+++ b/python_api/datasource/gwas_catalog/study_index/index.html
@@ -1,4 +1,4 @@
- Study Index - Open Targets Genetics       

Study Index

otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex dataclass

Bases: StudyIndex

Study index from GWAS Catalog.

The following information is harmonised from the GWAS Catalog:

  • All publication related information retained.
  • Mapped measured and background traits parsed.
  • Flagged if harmonized summary statistics datasets available.
  • If available, the ftp path to these files presented.
  • Ancestries from the discovery and replication stages are structured with sample counts.
  • Case/control counts extracted.
  • The number of samples with European ancestry extracted.
Source code in src/otg/datasource/gwas_catalog/study_index.py
 18
+ Study Index - Open Targets Genetics       

Study Index

otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex dataclass

Bases: StudyIndex

Study index from GWAS Catalog.

The following information is harmonised from the GWAS Catalog:

  • All publication related information retained.
  • Mapped measured and background traits parsed.
  • Flagged if harmonized summary statistics datasets available.
  • If available, the ftp path to these files presented.
  • Ancestries from the discovery and replication stages are structured with sample counts.
  • Case/control counts extracted.
  • The number of samples with European ancestry extracted.
Source code in src/otg/datasource/gwas_catalog/study_index.py
 18
  19
  20
  21
diff --git a/python_api/datasource/gwas_catalog/study_splitter/index.html b/python_api/datasource/gwas_catalog/study_splitter/index.html
index 37e3065ae..427823c06 100644
--- a/python_api/datasource/gwas_catalog/study_splitter/index.html
+++ b/python_api/datasource/gwas_catalog/study_splitter/index.html
@@ -1,4 +1,4 @@
- Study Splitter - Open Targets Genetics       

Study Splitter

otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter

Splitting multi-trait GWAS Catalog studies.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
 17
+ Study Splitter - Open Targets Genetics       

Study Splitter

otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter

Splitting multi-trait GWAS Catalog studies.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
 17
  18
  19
  20
diff --git a/python_api/datasource/gwas_catalog/summary_statistics/index.html b/python_api/datasource/gwas_catalog/summary_statistics/index.html
index de86080b9..d284c859f 100644
--- a/python_api/datasource/gwas_catalog/summary_statistics/index.html
+++ b/python_api/datasource/gwas_catalog/summary_statistics/index.html
@@ -1,4 +1,4 @@
- Summary statistics - Open Targets Genetics       

Summary statistics

otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics dataclass

Bases: SummaryStatistics

GWAS Catalog Summary Statistics reader.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
22
+ Summary statistics - Open Targets Genetics       

Summary statistics

otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics dataclass

Bases: SummaryStatistics

GWAS Catalog Summary Statistics reader.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
22
 23
 24
 25
diff --git a/python_api/datasource/intervals/_intervals/index.html b/python_api/datasource/intervals/_intervals/index.html
index 406afca39..86ddbf8e9 100644
--- a/python_api/datasource/intervals/_intervals/index.html
+++ b/python_api/datasource/intervals/_intervals/index.html
@@ -1 +1 @@
- Chromatin intevals - Open Targets Genetics       
\ No newline at end of file + Chromatin intevals - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/datasource/intervals/andersson/index.html b/python_api/datasource/intervals/andersson/index.html index 196d5a8a9..7c096a4cf 100644 --- a/python_api/datasource/intervals/andersson/index.html +++ b/python_api/datasource/intervals/andersson/index.html @@ -1,4 +1,4 @@ - Andersson et al. - Open Targets Genetics

Andersson et al.

otg.datasource.intervals.andersson.IntervalsAndersson

Bases: Intervals

Interval dataset from Andersson et al. 2014.

Source code in src/otg/datasource/intervals/andersson.py
 21
+ Andersson et al. - Open Targets Genetics       

Andersson et al.

otg.datasource.intervals.andersson.IntervalsAndersson

Bases: Intervals

Interval dataset from Andersson et al. 2014.

Source code in src/otg/datasource/intervals/andersson.py
 21
  22
  23
  24
diff --git a/python_api/datasource/intervals/javierre/index.html b/python_api/datasource/intervals/javierre/index.html
index e14181275..853a1bdfa 100644
--- a/python_api/datasource/intervals/javierre/index.html
+++ b/python_api/datasource/intervals/javierre/index.html
@@ -1,4 +1,4 @@
- Javierre et al. - Open Targets Genetics       

Javierre et al.

otg.datasource.intervals.javierre.IntervalsJavierre

Bases: Intervals

Interval dataset from Javierre et al. 2016.

Source code in src/otg/datasource/intervals/javierre.py
 18
+ Javierre et al. - Open Targets Genetics       

Javierre et al.

otg.datasource.intervals.javierre.IntervalsJavierre

Bases: Intervals

Interval dataset from Javierre et al. 2016.

Source code in src/otg/datasource/intervals/javierre.py
 18
  19
  20
  21
diff --git a/python_api/datasource/intervals/jung/index.html b/python_api/datasource/intervals/jung/index.html
index eceb217f0..b015d800c 100644
--- a/python_api/datasource/intervals/jung/index.html
+++ b/python_api/datasource/intervals/jung/index.html
@@ -1,4 +1,4 @@
- Jung et al. - Open Targets Genetics       

Jung et al.

otg.datasource.intervals.jung.IntervalsJung

Bases: Intervals

Interval dataset from Jung et al. 2019.

Source code in src/otg/datasource/intervals/jung.py
 18
+ Jung et al. - Open Targets Genetics       

Jung et al.

otg.datasource.intervals.jung.IntervalsJung

Bases: Intervals

Interval dataset from Jung et al. 2019.

Source code in src/otg/datasource/intervals/jung.py
 18
  19
  20
  21
diff --git a/python_api/datasource/intervals/thurman/index.html b/python_api/datasource/intervals/thurman/index.html
index fe9f2c5d6..f2bda9b95 100644
--- a/python_api/datasource/intervals/thurman/index.html
+++ b/python_api/datasource/intervals/thurman/index.html
@@ -1,4 +1,4 @@
- Thurman et al. - Open Targets Genetics       

Thurman et al.

otg.datasource.intervals.thurman.IntervalsThurman

Bases: Intervals

Interval dataset from Thurman et al. 2012.

Source code in src/otg/datasource/intervals/thurman.py
 18
+ Thurman et al. - Open Targets Genetics       

Thurman et al.

otg.datasource.intervals.thurman.IntervalsThurman

Bases: Intervals

Interval dataset from Thurman et al. 2012.

Source code in src/otg/datasource/intervals/thurman.py
 18
  19
  20
  21
diff --git a/python_api/datasource/open_targets/_open_targets/index.html b/python_api/datasource/open_targets/_open_targets/index.html
index 52b85efde..7d1adea03 100644
--- a/python_api/datasource/open_targets/_open_targets/index.html
+++ b/python_api/datasource/open_targets/_open_targets/index.html
@@ -1,4 +1,4 @@
- Open Targets - Open Targets Genetics      

Open targets

Open targets

L2G Gold Standard

otg.datasource.open_targets.l2g_gold_standard.OpenTargetsL2GGoldStandard

Parser for OTGenetics locus to gene gold standards curation.

The curation is processed to generate a dataset with 2 labels
  • Gold Standard Positive (GSP): Variant is within 500kb of gene
  • Gold Standard Negative (GSN): Variant is not within 500kb of gene
Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
 14
+ 15
+ 16
+ 17
+ 18
+ 19
+ 20
+ 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
class OpenTargetsL2GGoldStandard:
+    """Parser for OTGenetics locus to gene gold standards curation.
+
+    The curation is processed to generate a dataset with 2 labels:
+        - Gold Standard Positive (GSP): Variant is within 500kb of gene
+        - Gold Standard Negative (GSN): Variant is not within 500kb of gene
+    """
+
+    @staticmethod
+    def process_gene_interactions(interactions: DataFrame) -> DataFrame:
+        """Extract top scoring gene-gene interaction from the interactions dataset of the Platform.
+
+        Args:
+            interactions (DataFrame): Gene-gene interactions dataset
+
+        Returns:
+            DataFrame: Top scoring gene-gene interaction per pair of genes
+        """
+        return get_record_with_maximum_value(
+            interactions,
+            ["targetA", "targetB"],
+            "scoring",
+        ).selectExpr(
+            "targetA as geneIdA",
+            "targetB as geneIdB",
+            "scoring as score",
+        )
+
+    @classmethod
+    def as_l2g_gold_standard(
+        cls: type[OpenTargetsL2GGoldStandard],
+        gold_standard_curation: DataFrame,
+        v2g: V2G,
+        study_locus_overlap: StudyLocusOverlap,
+        interactions: DataFrame,
+    ) -> L2GGoldStandard:
+        """Initialise L2GGoldStandard from source dataset.
+
+        Args:
+            gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards
+            v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS
+            study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci
+            interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene
+
+        Returns:
+            L2GGoldStandard: L2G Gold Standard dataset
+        """
+        overlaps_df = study_locus_overlap._df.select(
+            "leftStudyLocusId", "rightStudyLocusId"
+        )
+        interactions_df = cls.process_gene_interactions(interactions)
+        return L2GGoldStandard(
+            _df=(
+                gold_standard_curation.filter(
+                    f.col("gold_standard_info.highest_confidence").isin(
+                        ["High", "Medium"]
+                    )
+                )
+                .select(
+                    f.col("association_info.otg_id").alias("studyId"),
+                    f.col("gold_standard_info.gene_id").alias("geneId"),
+                    f.concat_ws(
+                        "_",
+                        f.col("sentinel_variant.locus_GRCh38.chromosome"),
+                        f.col("sentinel_variant.locus_GRCh38.position"),
+                        f.col("sentinel_variant.alleles.reference"),
+                        f.col("sentinel_variant.alleles.alternative"),
+                    ).alias("variantId"),
+                    f.col("metadata.set_label").alias("source"),
+                )
+                .withColumn(
+                    "studyLocusId",
+                    StudyLocus.assign_study_locus_id("studyId", "variantId"),
+                )
+                .groupBy("studyLocusId", "studyId", "variantId", "geneId")
+                .agg(
+                    f.collect_set("source").alias("sources"),
+                )
+                # Assign Positive or Negative Status based on confidence
+                .join(
+                    v2g.df.filter(f.col("distance").isNotNull()).select(
+                        "variantId", "geneId", "distance"
+                    ),
+                    on=["variantId", "geneId"],
+                    how="inner",
+                )
+                .withColumn(
+                    "goldStandardSet",
+                    f.when(f.col("distance") <= 500_000, f.lit("positive")).otherwise(
+                        f.lit("negative")
+                    ),
+                )
+                # Remove redundant loci by testing they are truly independent
+                .alias("left")
+                .join(
+                    overlaps_df.alias("right"),
+                    (f.col("left.variantId") == f.col("right.leftStudyLocusId"))
+                    | (f.col("left.variantId") == f.col("right.rightStudyLocusId")),
+                    how="left",
+                )
+                .distinct()
+                # Remove redundant genes by testing they do not interact with a positive gene
+                .join(
+                    interactions_df.alias("interactions"),
+                    (f.col("left.geneId") == f.col("interactions.geneIdA"))
+                    | (f.col("left.geneId") == f.col("interactions.geneIdB")),
+                    how="left",
+                )
+                .withColumn("interacting", (f.col("score") > 0.7))
+                # filter out genes where geneIdA has goldStandardSet negative but geneIdA and gene IdB are interacting
+                .filter(
+                    ~(
+                        (f.col("goldStandardSet") == 0)
+                        & (f.col("interacting"))
+                        & (
+                            (f.col("left.geneId") == f.col("interactions.geneIdA"))
+                            | (f.col("left.geneId") == f.col("interactions.geneIdB"))
+                        )
+                    )
+                )
+                .select("studyLocusId", "geneId", "goldStandardSet", "sources")
+            ),
+            _schema=L2GGoldStandard.get_schema(),
+        )
+

as_l2g_gold_standard(gold_standard_curation: DataFrame, v2g: V2G, study_locus_overlap: StudyLocusOverlap, interactions: DataFrame) -> L2GGoldStandard classmethod

Initialise L2GGoldStandard from source dataset.

Parameters:

Name Type Description Default
gold_standard_curation DataFrame

Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards

required
v2g V2G

Variant to gene dataset to bring distance between a variant and a gene's TSS

required
study_locus_overlap StudyLocusOverlap

Study locus overlap dataset to remove duplicated loci

required
interactions DataFrame

Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene

required

Returns:

Name Type Description
L2GGoldStandard L2GGoldStandard

L2G Gold Standard dataset

Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
@classmethod
+def as_l2g_gold_standard(
+    cls: type[OpenTargetsL2GGoldStandard],
+    gold_standard_curation: DataFrame,
+    v2g: V2G,
+    study_locus_overlap: StudyLocusOverlap,
+    interactions: DataFrame,
+) -> L2GGoldStandard:
+    """Initialise L2GGoldStandard from source dataset.
+
+    Args:
+        gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards
+        v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS
+        study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci
+        interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene
+
+    Returns:
+        L2GGoldStandard: L2G Gold Standard dataset
+    """
+    overlaps_df = study_locus_overlap._df.select(
+        "leftStudyLocusId", "rightStudyLocusId"
+    )
+    interactions_df = cls.process_gene_interactions(interactions)
+    return L2GGoldStandard(
+        _df=(
+            gold_standard_curation.filter(
+                f.col("gold_standard_info.highest_confidence").isin(
+                    ["High", "Medium"]
+                )
+            )
+            .select(
+                f.col("association_info.otg_id").alias("studyId"),
+                f.col("gold_standard_info.gene_id").alias("geneId"),
+                f.concat_ws(
+                    "_",
+                    f.col("sentinel_variant.locus_GRCh38.chromosome"),
+                    f.col("sentinel_variant.locus_GRCh38.position"),
+                    f.col("sentinel_variant.alleles.reference"),
+                    f.col("sentinel_variant.alleles.alternative"),
+                ).alias("variantId"),
+                f.col("metadata.set_label").alias("source"),
+            )
+            .withColumn(
+                "studyLocusId",
+                StudyLocus.assign_study_locus_id("studyId", "variantId"),
+            )
+            .groupBy("studyLocusId", "studyId", "variantId", "geneId")
+            .agg(
+                f.collect_set("source").alias("sources"),
+            )
+            # Assign Positive or Negative Status based on confidence
+            .join(
+                v2g.df.filter(f.col("distance").isNotNull()).select(
+                    "variantId", "geneId", "distance"
+                ),
+                on=["variantId", "geneId"],
+                how="inner",
+            )
+            .withColumn(
+                "goldStandardSet",
+                f.when(f.col("distance") <= 500_000, f.lit("positive")).otherwise(
+                    f.lit("negative")
+                ),
+            )
+            # Remove redundant loci by testing they are truly independent
+            .alias("left")
+            .join(
+                overlaps_df.alias("right"),
+                (f.col("left.variantId") == f.col("right.leftStudyLocusId"))
+                | (f.col("left.variantId") == f.col("right.rightStudyLocusId")),
+                how="left",
+            )
+            .distinct()
+            # Remove redundant genes by testing they do not interact with a positive gene
+            .join(
+                interactions_df.alias("interactions"),
+                (f.col("left.geneId") == f.col("interactions.geneIdA"))
+                | (f.col("left.geneId") == f.col("interactions.geneIdB")),
+                how="left",
+            )
+            .withColumn("interacting", (f.col("score") > 0.7))
+            # filter out genes where geneIdA has goldStandardSet negative but geneIdA and gene IdB are interacting
+            .filter(
+                ~(
+                    (f.col("goldStandardSet") == 0)
+                    & (f.col("interacting"))
+                    & (
+                        (f.col("left.geneId") == f.col("interactions.geneIdA"))
+                        | (f.col("left.geneId") == f.col("interactions.geneIdB"))
+                    )
+                )
+            )
+            .select("studyLocusId", "geneId", "goldStandardSet", "sources")
+        ),
+        _schema=L2GGoldStandard.get_schema(),
+    )
+

process_gene_interactions(interactions: DataFrame) -> DataFrame staticmethod

Extract top scoring gene-gene interaction from the interactions dataset of the Platform.

Parameters:

Name Type Description Default
interactions DataFrame

Gene-gene interactions dataset

required

Returns:

Name Type Description
DataFrame DataFrame

Top scoring gene-gene interaction per pair of genes

Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
@staticmethod
+def process_gene_interactions(interactions: DataFrame) -> DataFrame:
+    """Extract top scoring gene-gene interaction from the interactions dataset of the Platform.
+
+    Args:
+        interactions (DataFrame): Gene-gene interactions dataset
+
+    Returns:
+        DataFrame: Top scoring gene-gene interaction per pair of genes
+    """
+    return get_record_with_maximum_value(
+        interactions,
+        ["targetA", "targetB"],
+        "scoring",
+    ).selectExpr(
+        "targetA as geneIdA",
+        "targetB as geneIdB",
+        "scoring as score",
+    )
+

\ No newline at end of file diff --git a/python_api/datasource/open_targets/target/index.html b/python_api/datasource/open_targets/target/index.html index dac5c9e95..3509b1293 100644 --- a/python_api/datasource/open_targets/target/index.html +++ b/python_api/datasource/open_targets/target/index.html @@ -1,4 +1,4 @@ - Target - Open Targets Genetics

Target

otg.datasource.open_targets.target.OpenTargetsTarget

Parser for OTPlatform target dataset.

Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.

The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is: - Genes from all biotypes encoded in canonical chromosomes - Genes in alternative assemblies encoding for a reviewed protein product.

Source code in src/otg/datasource/open_targets/target.py
10
+ Target - Open Targets Genetics       

Target

otg.datasource.open_targets.target.OpenTargetsTarget

Parser for OTPlatform target dataset.

Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.

The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is: - Genes from all biotypes encoded in canonical chromosomes - Genes in alternative assemblies encoding for a reviewed protein product.

Source code in src/otg/datasource/open_targets/target.py
10
 11
 12
 13
diff --git a/python_api/datasource/ukbiobank/_ukbiobank/index.html b/python_api/datasource/ukbiobank/_ukbiobank/index.html
index c0aa9da30..70f3ab8a4 100644
--- a/python_api/datasource/ukbiobank/_ukbiobank/index.html
+++ b/python_api/datasource/ukbiobank/_ukbiobank/index.html
@@ -1,4 +1,4 @@
- UK Biobank - Open Targets Genetics      

Ukbiobank

Ukbiobank

Study Index

otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex

Bases: StudyIndex

Study index dataset from UKBiobank.

The following information is extracted:

  • studyId
  • pubmedId
  • publicationDate
  • publicationJournal
  • publicationTitle
  • publicationFirstAuthor
  • traitFromSource
  • ancestry_discoverySamples
  • ancestry_replicationSamples
  • initialSampleSize
  • nCases
  • replicationSamples

Some fields are populated as constants, such as projectID, studyType, and initial sample size.

Source code in src/otg/datasource/ukbiobank/study_index.py
 14
+ Study Index - Open Targets Genetics       

Study Index

otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex

Bases: StudyIndex

Study index dataset from UKBiobank.

The following information is extracted:

  • studyId
  • pubmedId
  • publicationDate
  • publicationJournal
  • publicationTitle
  • publicationFirstAuthor
  • traitFromSource
  • ancestry_discoverySamples
  • ancestry_replicationSamples
  • initialSampleSize
  • nCases
  • replicationSamples

Some fields are populated as constants, such as projectID, studyType, and initial sample size.

Source code in src/otg/datasource/ukbiobank/study_index.py
 14
  15
  16
  17
diff --git a/python_api/method/_method/index.html b/python_api/method/_method/index.html
index c55794c1a..acc103189 100644
--- a/python_api/method/_method/index.html
+++ b/python_api/method/_method/index.html
@@ -1 +1 @@
- Method - Open Targets Genetics       
\ No newline at end of file + Method - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/method/clumping/index.html b/python_api/method/clumping/index.html index 7b6f08d4b..0b3a70401 100644 --- a/python_api/method/clumping/index.html +++ b/python_api/method/clumping/index.html @@ -1,4 +1,4 @@ - Clumping - Open Targets Genetics

Clumping

Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.

We have implemented 2 clumping methods:

otg.method.clump.LDclumping

LD clumping reports the most significant genetic associations in a region in terms of a smaller number of “clumps” of genetically linked SNPs.

Source code in src/otg/method/clump.py
17
+ Clumping - Open Targets Genetics       

Clumping

Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.

We have implemented 2 clumping methods:

otg.method.clump.LDclumping

LD clumping reports the most significant genetic associations in a region in terms of a smaller number of “clumps” of genetically linked SNPs.

Source code in src/otg/method/clump.py
17
 18
 19
 20
diff --git a/python_api/method/coloc/index.html b/python_api/method/coloc/index.html
index f93213728..a1201ad40 100644
--- a/python_api/method/coloc/index.html
+++ b/python_api/method/coloc/index.html
@@ -1,4 +1,4 @@
- Coloc - Open Targets Genetics       

Coloc

otg.method.colocalisation.Coloc

Calculate bayesian colocalisation based on overlapping signals from credible sets.

Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.

Hypothesis Description
H0 no association with either trait in the region
H1 association with trait 1 only
H2 association with trait 2 only
H3 both traits are associated, but have different single causal variants
H4 both traits are associated and share the same single causal variant

Approximate Bayes factors required

Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).

Source code in src/otg/method/colocalisation.py
 89
+ Coloc - Open Targets Genetics       

Coloc

otg.method.colocalisation.Coloc

Calculate bayesian colocalisation based on overlapping signals from credible sets.

Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.

Hypothesis Description
H0 no association with either trait in the region
H1 association with trait 1 only
H2 association with trait 2 only
H3 both traits are associated, but have different single causal variants
H4 both traits are associated and share the same single causal variant

Approximate Bayes factors required

Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).

Source code in src/otg/method/colocalisation.py
 89
  90
  91
  92
diff --git a/python_api/method/ecaviar/index.html b/python_api/method/ecaviar/index.html
index 428a288a3..76e459ac5 100644
--- a/python_api/method/ecaviar/index.html
+++ b/python_api/method/ecaviar/index.html
@@ -1,4 +1,4 @@
- eCAVIAR - Open Targets Genetics       

eCAVIAR

otg.method.colocalisation.ECaviar

ECaviar-based colocalisation analysis.

It extends CAVIAR framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.

Source code in src/otg/method/colocalisation.py
22
+ eCAVIAR - Open Targets Genetics       

eCAVIAR

otg.method.colocalisation.ECaviar

ECaviar-based colocalisation analysis.

It extends CAVIAR framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.

Source code in src/otg/method/colocalisation.py
22
 23
 24
 25
diff --git a/python_api/method/l2g/_l2g/index.html b/python_api/method/l2g/_l2g/index.html
new file mode 100644
index 000000000..820a18d08
--- /dev/null
+++ b/python_api/method/l2g/_l2g/index.html
@@ -0,0 +1 @@
+ Locus to Gene (L2G) classifier - Open Targets Genetics      
\ No newline at end of file diff --git a/python_api/method/l2g/evaluator/index.html b/python_api/method/l2g/evaluator/index.html new file mode 100644 index 000000000..c46df596e --- /dev/null +++ b/python_api/method/l2g/evaluator/index.html @@ -0,0 +1,450 @@ + W&B evaluator - Open Targets Genetics

W&B evaluator

otg.method.l2g.evaluator.WandbEvaluator

Bases: Evaluator

Wrapper for pyspark Evaluators. It is expected that the user will provide an Evaluators, and this wrapper will log metrics from said evaluator to W&B.

Source code in src/otg/method/l2g/evaluator.py
 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
+139
+140
+141
+142
+143
+144
+145
+146
+147
+148
+149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
+161
+162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
+186
+187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
+200
+201
+202
+203
+204
+205
+206
class WandbEvaluator(Evaluator):
+    """Wrapper for pyspark Evaluators. It is expected that the user will provide an Evaluators, and this wrapper will log metrics from said evaluator to W&B."""
+
+    spark_ml_evaluator: Param = Param(
+        Params._dummy(), "spark_ml_evaluator", "evaluator from pyspark.ml.evaluation"  # type: ignore
+    )
+
+    wandb_run: Param = Param(
+        Params._dummy(),  # type: ignore
+        "wandb_run",
+        "wandb run.  Expects an already initialized run.  You should set this, or wandb_run_kwargs, NOT BOTH",
+    )
+
+    wandb_run_kwargs: Param = Param(
+        Params._dummy(),
+        "wandb_run_kwargs",
+        "kwargs to be passed to wandb.init.  You should set this, or wandb_runId, NOT BOTH.  Setting this is useful when using with WandbCrossValdidator",
+    )
+
+    wandb_runId: Param = Param(  # noqa: N815
+        Params._dummy(),  # type: ignore
+        "wandb_runId",
+        "wandb run id.  if not providing an intialized run to wandb_run, a run with id wandb_runId will be resumed",
+    )
+
+    wandb_project_name: Param = Param(
+        Params._dummy(),
+        "wandb_project_name",
+        "name of W&B project",
+        typeConverter=TypeConverters.toString,
+    )
+
+    label_values: Param = Param(
+        Params._dummy(),
+        "label_values",
+        "for classification and multiclass classification, this is a list of values the label can assume\nIf provided Multiclass or Multilabel evaluator without label_values, we'll figure it out from dataset passed through to evaluate.",
+    )
+
+    _input_kwargs: Dict[str, Any]
+
+    @keyword_only
+    def __init__(
+        self: WandbEvaluator,
+        *,
+        label_values: list,
+        wandb_run: wandb.sdk.wandb_run.Run | None = None,
+        spark_ml_evaluator: Evaluator | None = None,
+    ) -> None:
+        """Initialize a WandbEvaluator.
+
+        Args:
+            label_values (list): List of label values.
+            wandb_run (wandb.sdk.wandb_run.Run | None): Wandb run object. Defaults to None.
+            spark_ml_evaluator (Evaluator | None): Spark ML evaluator. Defaults to None.
+        """
+        if label_values is None:
+            label_values = []
+        super(Evaluator, self).__init__()
+
+        self.metrics = {
+            MulticlassClassificationEvaluator: [
+                "f1",
+                "accuracy",
+                "weightedPrecision",
+                "weightedRecall",
+                "weightedTruePositiveRate",
+                "weightedFalsePositiveRate",
+                "weightedFMeasure",
+                "truePositiveRateByLabel",
+                "falsePositiveRateByLabel",
+                "precisionByLabel",
+                "recallByLabel",
+                "fMeasureByLabel",
+                "logLoss",
+                "hammingLoss",
+            ],
+            BinaryClassificationEvaluator: ["areaUnderROC", "areaUnderPR"],
+        }
+
+        self._setDefault(label_values=[])
+        kwargs = self._input_kwargs
+        self._set(**kwargs)
+
+    def setspark_ml_evaluator(self: WandbEvaluator, value: Evaluator) -> None:
+        """Set the spark_ml_evaluator parameter.
+
+        Args:
+            value (Evaluator): Spark ML evaluator.
+        """
+        self._set(spark_ml_evaluator=value)
+
+    def setlabel_values(self: WandbEvaluator, value: list) -> None:
+        """Set the label_values parameter.
+
+        Args:
+            value (list): List of label values.
+        """
+        self._set(label_values=value)
+
+    def getspark_ml_evaluator(self: WandbEvaluator) -> Evaluator:
+        """Get the spark_ml_evaluator parameter.
+
+        Returns:
+            Evaluator: Spark ML evaluator.
+        """
+        return self.getOrDefault(self.spark_ml_evaluator)
+
+    def getwandb_run(self: WandbEvaluator) -> wandb.sdk.wandb_run.Run:
+        """Get the wandb_run parameter.
+
+        Returns:
+            wandb.sdk.wandb_run.Run: Wandb run object.
+        """
+        return self.getOrDefault(self.wandb_run)
+
+    def getwandb_project_name(self: WandbEvaluator) -> str:
+        """Get the wandb_project_name parameter.
+
+        Returns:
+            str: Name of the W&B project.
+        """
+        return self.getOrDefault(self.wandb_project_name)
+
+    def getlabel_values(self: WandbEvaluator) -> list:
+        """Get the label_values parameter.
+
+        Returns:
+            list: List of label values.
+        """
+        return self.getOrDefault(self.label_values)
+
+    def _evaluate(self: WandbEvaluator, dataset: DataFrame) -> float:
+        """Evaluate the model on the given dataset.
+
+        Args:
+            dataset (DataFrame): Dataset to evaluate the model on.
+
+        Returns:
+            float: Metric value.
+        """
+        dataset.persist()
+        metric_values = []
+        label_values = self.getlabel_values()
+        spark_ml_evaluator = self.getspark_ml_evaluator()
+        run = self.getwandb_run()
+        evaluator_type = type(spark_ml_evaluator)
+        if isinstance(spark_ml_evaluator, RankingEvaluator):
+            metric_values.append(("k", spark_ml_evaluator.getK()))
+        for metric in self.metrics[evaluator_type]:
+            if "ByLabel" in metric and label_values == []:
+                print(
+                    "no label_values for the target have been provided and will be determined by the dataset.  This could take some time"
+                )
+                label_values = [
+                    r[spark_ml_evaluator.getLabelCol()]
+                    for r in dataset.select(spark_ml_evaluator.getLabelCol())
+                    .distinct()
+                    .collect()
+                ]
+                if isinstance(label_values[0], list):
+                    merged = list(itertools.chain(*label_values))
+                    label_values = list(dict.fromkeys(merged).keys())
+                    self.setlabel_values(label_values)
+            for label in label_values:
+                out = spark_ml_evaluator.evaluate(
+                    dataset,
+                    {
+                        spark_ml_evaluator.metricLabel: label,
+                        spark_ml_evaluator.metricName: metric,
+                    },
+                )
+                metric_values.append((f"{metric}:{label}", out))
+            out = spark_ml_evaluator.evaluate(
+                dataset, {spark_ml_evaluator.metricName: metric}
+            )
+            metric_values.append((f"{metric}", out))
+        run.log(dict(metric_values))
+        config = [
+            (f"{k.parent.split('_')[0]}.{k.name}", v)
+            for k, v in spark_ml_evaluator.extractParamMap().items()
+            if "metric" not in k.name
+        ]
+        run.config.update(dict(config))
+        return_metric = spark_ml_evaluator.evaluate(dataset)
+        dataset.unpersist()
+        return return_metric
+

getlabel_values() -> list

Get the label_values parameter.

Returns:

Name Type Description
list list

List of label values.

Source code in src/otg/method/l2g/evaluator.py
144
+145
+146
+147
+148
+149
+150
def getlabel_values(self: WandbEvaluator) -> list:
+    """Get the label_values parameter.
+
+    Returns:
+        list: List of label values.
+    """
+    return self.getOrDefault(self.label_values)
+

getspark_ml_evaluator() -> Evaluator

Get the spark_ml_evaluator parameter.

Returns:

Name Type Description
Evaluator Evaluator

Spark ML evaluator.

Source code in src/otg/method/l2g/evaluator.py
120
+121
+122
+123
+124
+125
+126
def getspark_ml_evaluator(self: WandbEvaluator) -> Evaluator:
+    """Get the spark_ml_evaluator parameter.
+
+    Returns:
+        Evaluator: Spark ML evaluator.
+    """
+    return self.getOrDefault(self.spark_ml_evaluator)
+

getwandb_project_name() -> str

Get the wandb_project_name parameter.

Returns:

Name Type Description
str str

Name of the W&B project.

Source code in src/otg/method/l2g/evaluator.py
136
+137
+138
+139
+140
+141
+142
def getwandb_project_name(self: WandbEvaluator) -> str:
+    """Get the wandb_project_name parameter.
+
+    Returns:
+        str: Name of the W&B project.
+    """
+    return self.getOrDefault(self.wandb_project_name)
+

getwandb_run() -> wandb.sdk.wandb_run.Run

Get the wandb_run parameter.

Returns:

Type Description
Run

wandb.sdk.wandb_run.Run: Wandb run object.

Source code in src/otg/method/l2g/evaluator.py
128
+129
+130
+131
+132
+133
+134
def getwandb_run(self: WandbEvaluator) -> wandb.sdk.wandb_run.Run:
+    """Get the wandb_run parameter.
+
+    Returns:
+        wandb.sdk.wandb_run.Run: Wandb run object.
+    """
+    return self.getOrDefault(self.wandb_run)
+

setlabel_values(value: list) -> None

Set the label_values parameter.

Parameters:

Name Type Description Default
value list

List of label values.

required
Source code in src/otg/method/l2g/evaluator.py
112
+113
+114
+115
+116
+117
+118
def setlabel_values(self: WandbEvaluator, value: list) -> None:
+    """Set the label_values parameter.
+
+    Args:
+        value (list): List of label values.
+    """
+    self._set(label_values=value)
+

setspark_ml_evaluator(value: Evaluator) -> None

Set the spark_ml_evaluator parameter.

Parameters:

Name Type Description Default
value Evaluator

Spark ML evaluator.

required
Source code in src/otg/method/l2g/evaluator.py
104
+105
+106
+107
+108
+109
+110
def setspark_ml_evaluator(self: WandbEvaluator, value: Evaluator) -> None:
+    """Set the spark_ml_evaluator parameter.
+
+    Args:
+        value (Evaluator): Spark ML evaluator.
+    """
+    self._set(spark_ml_evaluator=value)
+

\ No newline at end of file diff --git a/python_api/method/l2g/feature_factory/index.html b/python_api/method/l2g/feature_factory/index.html new file mode 100644 index 000000000..2d6240eac --- /dev/null +++ b/python_api/method/l2g/feature_factory/index.html @@ -0,0 +1,467 @@ + L2G Feature Factory - Open Targets Genetics

L2G Feature Factory

otg.method.l2g.feature_factory.L2GFeature dataclass

Bases: Dataset

Property of a study locus pair.

Source code in src/otg/method/l2g/feature_factory.py
26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
@dataclass
+class L2GFeature(Dataset):
+    """Property of a study locus pair."""
+
+    @classmethod
+    def get_schema(cls: type[L2GFeature]) -> StructType:
+        """Provides the schema for the L2GFeature dataset.
+
+        Returns:
+            StructType: Schema for the L2GFeature dataset
+        """
+        return parse_spark_schema("l2g_feature.json")
+

get_schema() -> StructType classmethod

Provides the schema for the L2GFeature dataset.

Returns:

Name Type Description
StructType StructType

Schema for the L2GFeature dataset

Source code in src/otg/method/l2g/feature_factory.py
30
+31
+32
+33
+34
+35
+36
+37
@classmethod
+def get_schema(cls: type[L2GFeature]) -> StructType:
+    """Provides the schema for the L2GFeature dataset.
+
+    Returns:
+        StructType: Schema for the L2GFeature dataset
+    """
+    return parse_spark_schema("l2g_feature.json")
+

otg.method.l2g.feature_factory.ColocalisationFactory

Feature extraction in colocalisation.

Source code in src/otg/method/l2g/feature_factory.py
 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
+139
+140
+141
+142
+143
+144
+145
+146
+147
+148
+149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
+161
+162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
+186
+187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
+200
+201
+202
+203
+204
+205
+206
+207
class ColocalisationFactory:
+    """Feature extraction in colocalisation."""
+
+    @staticmethod
+    def _get_max_coloc_per_study_locus(
+        study_locus: StudyLocus,
+        studies: StudyIndex,
+        colocalisation: Colocalisation,
+        colocalisation_method: str,
+    ) -> L2GFeature:
+        """Get the maximum colocalisation posterior probability for each pair of overlapping study-locus per type of colocalisation method and QTL type.
+
+        Args:
+            study_locus (StudyLocus): Study locus dataset
+            studies (StudyIndex): Study index dataset
+            colocalisation (Colocalisation): Colocalisation dataset
+            colocalisation_method (str): Colocalisation method to extract the max from
+
+        Returns:
+            L2GFeature: Stores the features with the max coloc probabilities for each pair of study-locus
+
+        Raises:
+            ValueError: If the colocalisation method is not supported
+        """
+        if colocalisation_method not in ["COLOC", "eCAVIAR"]:
+            raise ValueError(
+                f"Colocalisation method {colocalisation_method} not supported"
+            )
+        if colocalisation_method == "COLOC":
+            coloc_score_col_name = "log2h4h3"
+            coloc_feature_col_template = "max_coloc_llr"
+
+        elif colocalisation_method == "eCAVIAR":
+            coloc_score_col_name = "clpp"
+            coloc_feature_col_template = "max_coloc_clpp"
+
+        colocalising_study_locus = (
+            study_locus.df.select("studyLocusId", "studyId")
+            # annotate studyLoci with overlapping IDs on the left - to just keep GWAS associations
+            .join(
+                colocalisation._df.selectExpr(
+                    "leftStudyLocusId as studyLocusId",
+                    "rightStudyLocusId",
+                    "colocalisationMethod",
+                    f"{coloc_score_col_name} as coloc_score",
+                ),
+                on="studyLocusId",
+                how="inner",
+            )
+            # bring study metadata to just keep QTL studies on the right
+            .join(
+                study_locus.df.selectExpr(
+                    "studyLocusId as rightStudyLocusId", "studyId as right_studyId"
+                ),
+                on="rightStudyLocusId",
+                how="inner",
+            )
+            .join(
+                f.broadcast(
+                    studies._df.selectExpr(
+                        "studyId as right_studyId",
+                        "studyType as right_studyType",
+                        "geneId",
+                    )
+                ),
+                on="right_studyId",
+                how="inner",
+            )
+            .filter(
+                (f.col("colocalisationMethod") == colocalisation_method)
+                & (f.col("right_studyType") != "gwas")
+            )
+            .select("studyLocusId", "right_studyType", "geneId", "coloc_score")
+        )
+
+        # Max LLR calculation per studyLocus AND type of QTL
+        local_max = get_record_with_maximum_value(
+            colocalising_study_locus,
+            ["studyLocusId", "right_studyType", "geneId"],
+            "coloc_score",
+        )
+        neighbourhood_max = (
+            get_record_with_maximum_value(
+                colocalising_study_locus,
+                ["studyLocusId", "right_studyType"],
+                "coloc_score",
+            )
+            .join(
+                local_max.selectExpr("studyLocusId", "coloc_score as coloc_local_max"),
+                on="studyLocusId",
+                how="inner",
+            )
+            .withColumn(
+                f"{coloc_feature_col_template}_nbh",
+                f.col("coloc_local_max") - f.col("coloc_score"),
+            )
+        )
+
+        # Split feature per molQTL
+        local_dfs = []
+        nbh_dfs = []
+        study_types = (
+            colocalising_study_locus.select("right_studyType").distinct().collect()
+        )
+
+        for qtl_type in study_types:
+            local_max = local_max.filter(
+                f.col("right_studyType") == qtl_type
+            ).withColumnRenamed(
+                "coloc_score", f"{qtl_type}_{coloc_feature_col_template}_local"
+            )
+            local_dfs.append(local_max)
+
+            neighbourhood_max = neighbourhood_max.filter(
+                f.col("right_studyType") == qtl_type
+            ).withColumnRenamed(
+                f"{coloc_feature_col_template}_nbh",
+                f"{qtl_type}_{coloc_feature_col_template}_nbh",
+            )
+            nbh_dfs.append(neighbourhood_max)
+
+        wide_dfs = reduce(
+            lambda x, y: x.unionByName(y, allowMissingColumns=True),
+            local_dfs + nbh_dfs,
+            colocalising_study_locus.limit(0),
+        )
+
+        return L2GFeature(
+            _df=_convert_from_wide_to_long(
+                wide_dfs,
+                id_vars=("studyLocusId", "geneId"),
+                var_name="featureName",
+                value_name="featureValue",
+            ),
+            _schema=L2GFeature.get_schema(),
+        )
+
+    @staticmethod
+    def _get_coloc_features(
+        study_locus: StudyLocus, studies: StudyIndex, colocalisation: Colocalisation
+    ) -> L2GFeature:
+        """Calls _get_max_coloc_per_study_locus for both methods and concatenates the results.
+
+        Args:
+            study_locus (StudyLocus): Study locus dataset
+            studies (StudyIndex): Study index dataset
+            colocalisation (Colocalisation): Colocalisation dataset
+
+        Returns:
+            L2GFeature: Stores the features with the max coloc probabilities for each pair of study-locus
+        """
+        coloc_llr = ColocalisationFactory._get_max_coloc_per_study_locus(
+            study_locus,
+            studies,
+            colocalisation,
+            "COLOC",
+        )
+        coloc_clpp = ColocalisationFactory._get_max_coloc_per_study_locus(
+            study_locus,
+            studies,
+            colocalisation,
+            "eCAVIAR",
+        )
+
+        return L2GFeature(
+            _df=coloc_llr.df.unionByName(coloc_clpp.df, allowMissingColumns=True),
+            _schema=L2GFeature.get_schema(),
+        )
+

otg.method.l2g.feature_factory.StudyLocusFactory

Bases: StudyLocus

Feature extraction in study locus.

Source code in src/otg/method/l2g/feature_factory.py
210
+211
+212
+213
+214
+215
+216
+217
+218
+219
+220
+221
+222
+223
+224
+225
+226
+227
+228
+229
+230
+231
+232
+233
+234
+235
+236
+237
+238
+239
+240
+241
+242
+243
+244
+245
+246
+247
+248
+249
+250
+251
+252
+253
+254
+255
+256
class StudyLocusFactory(StudyLocus):
+    """Feature extraction in study locus."""
+
+    @staticmethod
+    def _get_tss_distance_features(
+        study_locus: StudyLocus, distances: V2G
+    ) -> L2GFeature:
+        """Joins StudyLocus with the V2G to extract the minimum distance to a gene TSS of all variants in a StudyLocus credible set.
+
+        Args:
+            study_locus (StudyLocus): Study locus dataset
+            distances (V2G): Dataframe containing the distances of all variants to all genes TSS within a region
+
+        Returns:
+            L2GFeature: Stores the features with the minimum distance among all variants in the credible set and a gene TSS.
+
+        """
+        wide_df = (
+            study_locus.filter_credible_set(CredibleInterval.IS95)
+            .df.select(
+                "studyLocusId",
+                "variantId",
+                f.explode("locus.variantId").alias("tagVariantId"),
+            )
+            .join(
+                distances.df.selectExpr(
+                    "variantId as tagVariantId", "geneId", "distance"
+                ),
+                on="tagVariantId",
+                how="inner",
+            )
+            .groupBy("studyLocusId", "geneId")
+            .agg(
+                f.min("distance").alias("distanceTssMinimum"),
+                f.mean("distance").alias("distanceTssMean"),
+            )
+        )
+
+        return L2GFeature(
+            _df=_convert_from_wide_to_long(
+                wide_df,
+                id_vars=("studyLocusId", "geneId"),
+                var_name="featureName",
+                value_name="featureValue",
+            ),
+            _schema=L2GFeature.get_schema(),
+        )
+

\ No newline at end of file diff --git a/python_api/method/l2g/model/index.html b/python_api/method/l2g/model/index.html new file mode 100644 index 000000000..a1fb2d967 --- /dev/null +++ b/python_api/method/l2g/model/index.html @@ -0,0 +1,904 @@ + L2G Model - Open Targets Genetics

L2G Model

otg.method.l2g.model.LocusToGeneModel dataclass

Wrapper for the Locus to Gene classifier.

Source code in src/otg/method/l2g/model.py
 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
+139
+140
+141
+142
+143
+144
+145
+146
+147
+148
+149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
+161
+162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
+186
+187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
+200
+201
+202
+203
+204
+205
+206
+207
+208
+209
+210
+211
+212
+213
+214
+215
+216
+217
+218
+219
+220
+221
+222
+223
+224
+225
+226
+227
+228
+229
+230
+231
+232
+233
+234
+235
+236
+237
+238
+239
+240
+241
+242
+243
+244
+245
+246
+247
+248
+249
+250
+251
+252
+253
+254
+255
+256
+257
+258
+259
+260
+261
+262
+263
+264
+265
+266
+267
+268
+269
+270
+271
+272
+273
+274
+275
+276
@dataclass
+class LocusToGeneModel:
+    """Wrapper for the Locus to Gene classifier."""
+
+    features_list: list[str]
+    estimator: Any = None
+    pipeline: Pipeline = Pipeline(stages=[])
+    model: PipelineModel | None = None
+
+    def __post_init__(self: LocusToGeneModel) -> None:
+        """Post init that adds the model to the ML pipeline."""
+        label_indexer = StringIndexer(
+            inputCol="goldStandardSet", outputCol="label", handleInvalid="keep"
+        )
+        vector_assembler = LocusToGeneModel.features_vector_assembler(
+            self.features_list
+        )
+
+        self.pipeline = Pipeline(
+            stages=[
+                label_indexer,
+                vector_assembler,
+            ]
+        )
+
+    def save(self: LocusToGeneModel, path: str) -> None:
+        """Saves fitted pipeline model to disk.
+
+        Args:
+            path (str): Path to save the model to
+
+        Raises:
+            ValueError: If the model has not been fitted yet
+        """
+        if self.model is None:
+            raise ValueError("Model has not been fitted yet.")
+        self.model.write().overwrite().save(path)
+
+    @property
+    def classifier(self: LocusToGeneModel) -> Any:
+        """Return the model.
+
+        Returns:
+            Any: An estimator object from Spark ML
+        """
+        return self.estimator
+
+    @staticmethod
+    def features_vector_assembler(features_cols: list[str]) -> VectorAssembler:
+        """Spark transformer to assemble the feature columns into a vector.
+
+        Args:
+            features_cols (list[str]): List of feature columns to assemble
+
+        Returns:
+            VectorAssembler: Spark transformer to assemble the feature columns into a vector
+
+        Examples:
+            >>> from pyspark.ml.feature import VectorAssembler
+            >>> df = spark.createDataFrame([(5.2, 3.5)], schema="feature_1 FLOAT, feature_2 FLOAT")
+            >>> assembler = LocusToGeneModel.features_vector_assembler(["feature_1", "feature_2"])
+            >>> assembler.transform(df).show()
+            +---------+---------+--------------------+
+            |feature_1|feature_2|            features|
+            +---------+---------+--------------------+
+            |      5.2|      3.5|[5.19999980926513...|
+            +---------+---------+--------------------+
+            <BLANKLINE>
+        """
+        return (
+            VectorAssembler(handleInvalid="error")
+            .setInputCols(features_cols)
+            .setOutputCol("features")
+        )
+
+    @staticmethod
+    def log_to_wandb(
+        results: DataFrame,
+        binary_evaluator: BinaryClassificationEvaluator,
+        multi_evaluator: MulticlassClassificationEvaluator,
+        wandb_run: Run,
+    ) -> None:
+        """Perform evaluation of the model by applying it to a test set and tracking the results with W&B.
+
+        Args:
+            results (DataFrame): Dataframe containing the predictions
+            binary_evaluator (BinaryClassificationEvaluator): Binary evaluator
+            multi_evaluator (MulticlassClassificationEvaluator): Multiclass evaluator
+            wandb_run (Run): W&B run to log the results to
+        """
+        binary_wandb_evaluator = WandbEvaluator(
+            spark_ml_evaluator=binary_evaluator, wandb_run=wandb_run
+        )
+        binary_wandb_evaluator.evaluate(results)
+        multi_wandb_evaluator = WandbEvaluator(
+            spark_ml_evaluator=multi_evaluator, wandb_run=wandb_run
+        )
+        multi_wandb_evaluator.evaluate(results)
+
+    @classmethod
+    def load_from_disk(
+        cls: Type[LocusToGeneModel], path: str, features_list: list[str]
+    ) -> LocusToGeneModel:
+        """Load a fitted pipeline model from disk.
+
+        Args:
+            path (str): Path to the model
+            features_list (list[str]): List of features used for the model
+
+        Returns:
+            LocusToGeneModel: L2G model loaded from disk
+        """
+        return cls(model=PipelineModel.load(path), features_list=features_list)
+
+    @classifier.setter  # type: ignore
+    def classifier(self: LocusToGeneModel, new_estimator: Any) -> None:
+        """Set the model.
+
+        Args:
+            new_estimator (Any): An estimator object from Spark ML
+        """
+        self.estimator = new_estimator
+
+    def get_param_grid(self: LocusToGeneModel) -> list:
+        """Return the parameter grid for the model.
+
+        Returns:
+            list: List of parameter maps to use for cross validation
+        """
+        return (
+            ParamGridBuilder()
+            .addGrid(self.estimator.max_depth, [3, 5, 7])
+            .addGrid(self.estimator.learning_rate, [0.01, 0.1, 1.0])
+            .build()
+        )
+
+    def add_pipeline_stage(
+        self: LocusToGeneModel, transformer: Transformer
+    ) -> LocusToGeneModel:
+        """Adds a stage to the L2G pipeline.
+
+        Args:
+            transformer (Transformer): Spark transformer to add to the pipeline
+
+        Returns:
+            LocusToGeneModel: L2G model with the new transformer
+
+        Examples:
+            >>> from pyspark.ml.regression import LinearRegression
+            >>> estimator = LinearRegression()
+            >>> test_model = LocusToGeneModel(features_list=["a", "b"])
+            >>> print(len(test_model.pipeline.getStages()))
+            2
+            >>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))
+            3
+        """
+        pipeline_stages = self.pipeline.getStages()
+        new_stages = pipeline_stages + [transformer]
+        self.pipeline = Pipeline(stages=new_stages)
+        return self
+
+    def evaluate(
+        self: LocusToGeneModel,
+        results: DataFrame,
+        hyperparameters: dict,
+        wandb_run_name: str | None,
+    ) -> None:
+        """Perform evaluation of the model by applying it to a test set and tracking the results with W&B.
+
+        Args:
+            results (DataFrame): Dataframe containing the predictions
+            hyperparameters (dict): Hyperparameters used for the model
+            wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B
+        """
+        binary_evaluator = BinaryClassificationEvaluator(
+            rawPredictionCol="rawPrediction", labelCol="label"
+        )
+        multi_evaluator = MulticlassClassificationEvaluator(
+            labelCol="label", predictionCol="prediction"
+        )
+
+        print("Evaluating model...")
+        print(
+            "... Area under ROC curve:",
+            binary_evaluator.evaluate(
+                results, {binary_evaluator.metricName: "areaUnderROC"}
+            ),
+        )
+        print(
+            "... Area under Precision-Recall curve:",
+            binary_evaluator.evaluate(
+                results, {binary_evaluator.metricName: "areaUnderPR"}
+            ),
+        )
+        print(
+            "... Accuracy:",
+            multi_evaluator.evaluate(results, {multi_evaluator.metricName: "accuracy"}),
+        )
+        print(
+            "... F1 score:",
+            multi_evaluator.evaluate(results, {multi_evaluator.metricName: "f1"}),
+        )
+
+        if wandb_run_name:
+            print("Logging to W&B...")
+            run = wandb.init(
+                project="otg_l2g", config=hyperparameters, name=wandb_run_name
+            )
+            if isinstance(run, Run):
+                LocusToGeneModel.log_to_wandb(
+                    results, binary_evaluator, multi_evaluator, run
+                )
+                run.finish()
+
+    def plot_importance(self: LocusToGeneModel) -> None:
+        """Plot the feature importance of the model."""
+        # xgb_plot_importance(self)  # FIXME: What is the attribute that stores the model?
+
+    def fit(
+        self: LocusToGeneModel,
+        feature_matrix: L2GFeatureMatrix,
+    ) -> LocusToGeneModel:
+        """Fit the pipeline to the feature matrix dataframe.
+
+        Args:
+            feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to fit the model to
+
+        Returns:
+            LocusToGeneModel: Fitted model
+        """
+        self.model = self.pipeline.fit(feature_matrix.df)
+        return self
+
+    def predict(
+        self: LocusToGeneModel,
+        feature_matrix: L2GFeatureMatrix,
+    ) -> DataFrame:
+        """Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.
+
+        Args:
+            feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to apply the model to
+
+        Returns:
+            DataFrame: Dataframe with predictions
+
+        Raises:
+            ValueError: If the model has not been fitted yet
+        """
+        if not self.model:
+            raise ValueError("Model not fitted yet. `fit()` has to be called first.")
+        return self.model.transform(feature_matrix.df)
+

classifier: Any property writable

Return the model.

Returns:

Name Type Description
Any Any

An estimator object from Spark ML

add_pipeline_stage(transformer: Transformer) -> LocusToGeneModel

Adds a stage to the L2G pipeline.

Parameters:

Name Type Description Default
transformer Transformer

Spark transformer to add to the pipeline

required

Returns:

Name Type Description
LocusToGeneModel LocusToGeneModel

L2G model with the new transformer

Examples:

>>> from pyspark.ml.regression import LinearRegression
+>>> estimator = LinearRegression()
+>>> test_model = LocusToGeneModel(features_list=["a", "b"])
+>>> print(len(test_model.pipeline.getStages()))
+2
+>>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))
+3
+
Source code in src/otg/method/l2g/model.py
162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
def add_pipeline_stage(
+    self: LocusToGeneModel, transformer: Transformer
+) -> LocusToGeneModel:
+    """Adds a stage to the L2G pipeline.
+
+    Args:
+        transformer (Transformer): Spark transformer to add to the pipeline
+
+    Returns:
+        LocusToGeneModel: L2G model with the new transformer
+
+    Examples:
+        >>> from pyspark.ml.regression import LinearRegression
+        >>> estimator = LinearRegression()
+        >>> test_model = LocusToGeneModel(features_list=["a", "b"])
+        >>> print(len(test_model.pipeline.getStages()))
+        2
+        >>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))
+        3
+    """
+    pipeline_stages = self.pipeline.getStages()
+    new_stages = pipeline_stages + [transformer]
+    self.pipeline = Pipeline(stages=new_stages)
+    return self
+

evaluate(results: DataFrame, hyperparameters: dict, wandb_run_name: str | None) -> None

Perform evaluation of the model by applying it to a test set and tracking the results with W&B.

Parameters:

Name Type Description Default
results DataFrame

Dataframe containing the predictions

required
hyperparameters dict

Hyperparameters used for the model

required
wandb_run_name str | None

Descriptive name for the run to be tracked with W&B

required
Source code in src/otg/method/l2g/model.py
187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
+200
+201
+202
+203
+204
+205
+206
+207
+208
+209
+210
+211
+212
+213
+214
+215
+216
+217
+218
+219
+220
+221
+222
+223
+224
+225
+226
+227
+228
+229
+230
+231
+232
+233
+234
+235
+236
+237
+238
def evaluate(
+    self: LocusToGeneModel,
+    results: DataFrame,
+    hyperparameters: dict,
+    wandb_run_name: str | None,
+) -> None:
+    """Perform evaluation of the model by applying it to a test set and tracking the results with W&B.
+
+    Args:
+        results (DataFrame): Dataframe containing the predictions
+        hyperparameters (dict): Hyperparameters used for the model
+        wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B
+    """
+    binary_evaluator = BinaryClassificationEvaluator(
+        rawPredictionCol="rawPrediction", labelCol="label"
+    )
+    multi_evaluator = MulticlassClassificationEvaluator(
+        labelCol="label", predictionCol="prediction"
+    )
+
+    print("Evaluating model...")
+    print(
+        "... Area under ROC curve:",
+        binary_evaluator.evaluate(
+            results, {binary_evaluator.metricName: "areaUnderROC"}
+        ),
+    )
+    print(
+        "... Area under Precision-Recall curve:",
+        binary_evaluator.evaluate(
+            results, {binary_evaluator.metricName: "areaUnderPR"}
+        ),
+    )
+    print(
+        "... Accuracy:",
+        multi_evaluator.evaluate(results, {multi_evaluator.metricName: "accuracy"}),
+    )
+    print(
+        "... F1 score:",
+        multi_evaluator.evaluate(results, {multi_evaluator.metricName: "f1"}),
+    )
+
+    if wandb_run_name:
+        print("Logging to W&B...")
+        run = wandb.init(
+            project="otg_l2g", config=hyperparameters, name=wandb_run_name
+        )
+        if isinstance(run, Run):
+            LocusToGeneModel.log_to_wandb(
+                results, binary_evaluator, multi_evaluator, run
+            )
+            run.finish()
+

features_vector_assembler(features_cols: list[str]) -> VectorAssembler staticmethod

Spark transformer to assemble the feature columns into a vector.

Parameters:

Name Type Description Default
features_cols list[str]

List of feature columns to assemble

required

Returns:

Name Type Description
VectorAssembler VectorAssembler

Spark transformer to assemble the feature columns into a vector

Examples:

>>> from pyspark.ml.feature import VectorAssembler
+>>> df = spark.createDataFrame([(5.2, 3.5)], schema="feature_1 FLOAT, feature_2 FLOAT")
+>>> assembler = LocusToGeneModel.features_vector_assembler(["feature_1", "feature_2"])
+>>> assembler.transform(df).show()
++---------+---------+--------------------+
+|feature_1|feature_2|            features|
++---------+---------+--------------------+
+|      5.2|      3.5|[5.19999980926513...|
++---------+---------+--------------------+
+
Source code in src/otg/method/l2g/model.py
73
+74
+75
+76
+77
+78
+79
+80
+81
+82
+83
+84
+85
+86
+87
+88
+89
+90
+91
+92
+93
+94
+95
+96
+97
+98
+99
@staticmethod
+def features_vector_assembler(features_cols: list[str]) -> VectorAssembler:
+    """Spark transformer to assemble the feature columns into a vector.
+
+    Args:
+        features_cols (list[str]): List of feature columns to assemble
+
+    Returns:
+        VectorAssembler: Spark transformer to assemble the feature columns into a vector
+
+    Examples:
+        >>> from pyspark.ml.feature import VectorAssembler
+        >>> df = spark.createDataFrame([(5.2, 3.5)], schema="feature_1 FLOAT, feature_2 FLOAT")
+        >>> assembler = LocusToGeneModel.features_vector_assembler(["feature_1", "feature_2"])
+        >>> assembler.transform(df).show()
+        +---------+---------+--------------------+
+        |feature_1|feature_2|            features|
+        +---------+---------+--------------------+
+        |      5.2|      3.5|[5.19999980926513...|
+        +---------+---------+--------------------+
+        <BLANKLINE>
+    """
+    return (
+        VectorAssembler(handleInvalid="error")
+        .setInputCols(features_cols)
+        .setOutputCol("features")
+    )
+

fit(feature_matrix: L2GFeatureMatrix) -> LocusToGeneModel

Fit the pipeline to the feature matrix dataframe.

Parameters:

Name Type Description Default
feature_matrix L2GFeatureMatrix

Feature matrix dataframe to fit the model to

required

Returns:

Name Type Description
LocusToGeneModel LocusToGeneModel

Fitted model

Source code in src/otg/method/l2g/model.py
244
+245
+246
+247
+248
+249
+250
+251
+252
+253
+254
+255
+256
+257
def fit(
+    self: LocusToGeneModel,
+    feature_matrix: L2GFeatureMatrix,
+) -> LocusToGeneModel:
+    """Fit the pipeline to the feature matrix dataframe.
+
+    Args:
+        feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to fit the model to
+
+    Returns:
+        LocusToGeneModel: Fitted model
+    """
+    self.model = self.pipeline.fit(feature_matrix.df)
+    return self
+

get_param_grid() -> list

Return the parameter grid for the model.

Returns:

Name Type Description
list list

List of parameter maps to use for cross validation

Source code in src/otg/method/l2g/model.py
149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
def get_param_grid(self: LocusToGeneModel) -> list:
+    """Return the parameter grid for the model.
+
+    Returns:
+        list: List of parameter maps to use for cross validation
+    """
+    return (
+        ParamGridBuilder()
+        .addGrid(self.estimator.max_depth, [3, 5, 7])
+        .addGrid(self.estimator.learning_rate, [0.01, 0.1, 1.0])
+        .build()
+    )
+

load_from_disk(path: str, features_list: list[str]) -> LocusToGeneModel classmethod

Load a fitted pipeline model from disk.

Parameters:

Name Type Description Default
path str

Path to the model

required
features_list list[str]

List of features used for the model

required

Returns:

Name Type Description
LocusToGeneModel LocusToGeneModel

L2G model loaded from disk

Source code in src/otg/method/l2g/model.py
125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
@classmethod
+def load_from_disk(
+    cls: Type[LocusToGeneModel], path: str, features_list: list[str]
+) -> LocusToGeneModel:
+    """Load a fitted pipeline model from disk.
+
+    Args:
+        path (str): Path to the model
+        features_list (list[str]): List of features used for the model
+
+    Returns:
+        LocusToGeneModel: L2G model loaded from disk
+    """
+    return cls(model=PipelineModel.load(path), features_list=features_list)
+

log_to_wandb(results: DataFrame, binary_evaluator: BinaryClassificationEvaluator, multi_evaluator: MulticlassClassificationEvaluator, wandb_run: Run) -> None staticmethod

Perform evaluation of the model by applying it to a test set and tracking the results with W&B.

Parameters:

Name Type Description Default
results DataFrame

Dataframe containing the predictions

required
binary_evaluator BinaryClassificationEvaluator

Binary evaluator

required
multi_evaluator MulticlassClassificationEvaluator

Multiclass evaluator

required
wandb_run Run

W&B run to log the results to

required
Source code in src/otg/method/l2g/model.py
101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
@staticmethod
+def log_to_wandb(
+    results: DataFrame,
+    binary_evaluator: BinaryClassificationEvaluator,
+    multi_evaluator: MulticlassClassificationEvaluator,
+    wandb_run: Run,
+) -> None:
+    """Perform evaluation of the model by applying it to a test set and tracking the results with W&B.
+
+    Args:
+        results (DataFrame): Dataframe containing the predictions
+        binary_evaluator (BinaryClassificationEvaluator): Binary evaluator
+        multi_evaluator (MulticlassClassificationEvaluator): Multiclass evaluator
+        wandb_run (Run): W&B run to log the results to
+    """
+    binary_wandb_evaluator = WandbEvaluator(
+        spark_ml_evaluator=binary_evaluator, wandb_run=wandb_run
+    )
+    binary_wandb_evaluator.evaluate(results)
+    multi_wandb_evaluator = WandbEvaluator(
+        spark_ml_evaluator=multi_evaluator, wandb_run=wandb_run
+    )
+    multi_wandb_evaluator.evaluate(results)
+

plot_importance() -> None

Plot the feature importance of the model.

Source code in src/otg/method/l2g/model.py
240
+241
def plot_importance(self: LocusToGeneModel) -> None:
+    """Plot the feature importance of the model."""
+

predict(feature_matrix: L2GFeatureMatrix) -> DataFrame

Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.

Parameters:

Name Type Description Default
feature_matrix L2GFeatureMatrix

Feature matrix dataframe to apply the model to

required

Returns:

Name Type Description
DataFrame DataFrame

Dataframe with predictions

Raises:

Type Description
ValueError

If the model has not been fitted yet

Source code in src/otg/method/l2g/model.py
259
+260
+261
+262
+263
+264
+265
+266
+267
+268
+269
+270
+271
+272
+273
+274
+275
+276
def predict(
+    self: LocusToGeneModel,
+    feature_matrix: L2GFeatureMatrix,
+) -> DataFrame:
+    """Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.
+
+    Args:
+        feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to apply the model to
+
+    Returns:
+        DataFrame: Dataframe with predictions
+
+    Raises:
+        ValueError: If the model has not been fitted yet
+    """
+    if not self.model:
+        raise ValueError("Model not fitted yet. `fit()` has to be called first.")
+    return self.model.transform(feature_matrix.df)
+

save(path: str) -> None

Saves fitted pipeline model to disk.

Parameters:

Name Type Description Default
path str

Path to save the model to

required

Raises:

Type Description
ValueError

If the model has not been fitted yet

Source code in src/otg/method/l2g/model.py
51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
def save(self: LocusToGeneModel, path: str) -> None:
+    """Saves fitted pipeline model to disk.
+
+    Args:
+        path (str): Path to save the model to
+
+    Raises:
+        ValueError: If the model has not been fitted yet
+    """
+    if self.model is None:
+        raise ValueError("Model has not been fitted yet.")
+    self.model.write().overwrite().save(path)
+

\ No newline at end of file diff --git a/python_api/method/l2g/trainer/index.html b/python_api/method/l2g/trainer/index.html new file mode 100644 index 000000000..efe573542 --- /dev/null +++ b/python_api/method/l2g/trainer/index.html @@ -0,0 +1,374 @@ + L2G Trainer - Open Targets Genetics

L2G Trainer

otg.method.l2g.trainer.LocusToGeneTrainer dataclass

Modelling of what is the most likely causal gene associated with a given locus.

Source code in src/otg/method/l2g/trainer.py
 15
+ 16
+ 17
+ 18
+ 19
+ 20
+ 21
+ 22
+ 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
@dataclass
+class LocusToGeneTrainer:
+    """Modelling of what is the most likely causal gene associated with a given locus."""
+
+    _model: LocusToGeneModel
+    train_set: L2GFeatureMatrix
+
+    @classmethod
+    def train(
+        cls: type[LocusToGeneTrainer],
+        data: L2GFeatureMatrix,
+        l2g_model: LocusToGeneModel,
+        features_list: list[str],
+        evaluate: bool,
+        wandb_run_name: str | None = None,
+        model_path: str | None = None,
+        **hyperparams: dict,
+    ) -> LocusToGeneModel:
+        """Train the Locus to Gene model.
+
+        Args:
+            data (L2GFeatureMatrix): Feature matrix containing the data
+            l2g_model (LocusToGeneModel): Model to fit to the data on
+            features_list (list[str]): List of features to use for the model
+            evaluate (bool): Whether to evaluate the model on a test set
+            wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B
+            model_path (str | None): Path to save the model to
+            **hyperparams (dict): Hyperparameters to use for the model
+
+        Returns:
+            LocusToGeneModel: Trained model
+        """
+        train, test = data.select_features(features_list).train_test_split(fraction=0.8)
+
+        model = l2g_model.add_pipeline_stage(l2g_model.estimator).fit(train)
+
+        if evaluate:
+            l2g_model.evaluate(
+                results=model.predict(test),
+                hyperparameters=hyperparams,
+                wandb_run_name=wandb_run_name,
+            )
+        if model_path:
+            l2g_model.save(model_path)
+        return l2g_model
+
+    @classmethod
+    def cross_validate(
+        cls: type[LocusToGeneTrainer],
+        l2g_model: LocusToGeneModel,
+        data: L2GFeatureMatrix,
+        num_folds: int,
+        param_grid: Optional[list] = None,
+    ) -> LocusToGeneModel:
+        """Perform k-fold cross validation on the model.
+
+        By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each
+        combination of parameters and return the best model.
+
+        Args:
+            l2g_model (LocusToGeneModel): Model to fit to the data on
+            data (L2GFeatureMatrix): Data to perform cross validation on
+            num_folds (int): Number of folds to use for cross validation
+            param_grid (Optional[list]): List of parameter maps to use for cross validation
+
+        Returns:
+            LocusToGeneModel: Trained model fitted with the best hyperparameters
+
+        Raises:
+            ValueError: Parameter grid is empty. Cannot perform cross-validation.
+            ValueError: Unable to retrieve the best model.
+        """
+        evaluator = MulticlassClassificationEvaluator()
+        params_grid = param_grid or l2g_model.get_param_grid()
+        if not param_grid:
+            raise ValueError(
+                "Parameter grid is empty. Cannot perform cross-validation."
+            )
+        cv = CrossValidator(
+            numFolds=num_folds,
+            estimator=l2g_model.estimator,
+            estimatorParamMaps=params_grid,
+            evaluator=evaluator,
+            parallelism=2,
+            collectSubModels=False,
+            seed=42,
+        )
+
+        l2g_model.add_pipeline_stage(cv)
+
+        # Integrate the best model from the last stage of the pipeline
+        if (full_pipeline_model := l2g_model.fit(data).model) is None or not hasattr(
+            full_pipeline_model, "stages"
+        ):
+            raise ValueError("Unable to retrieve the best model.")
+        l2g_model.model = full_pipeline_model.stages[-1].bestModel
+
+        return l2g_model
+

cross_validate(l2g_model: LocusToGeneModel, data: L2GFeatureMatrix, num_folds: int, param_grid: Optional[list] = None) -> LocusToGeneModel classmethod

Perform k-fold cross validation on the model.

By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each combination of parameters and return the best model.

Parameters:

Name Type Description Default
l2g_model LocusToGeneModel

Model to fit to the data on

required
data L2GFeatureMatrix

Data to perform cross validation on

required
num_folds int

Number of folds to use for cross validation

required
param_grid Optional[list]

List of parameter maps to use for cross validation

None

Returns:

Name Type Description
LocusToGeneModel LocusToGeneModel

Trained model fitted with the best hyperparameters

Raises:

Type Description
ValueError

Parameter grid is empty. Cannot perform cross-validation.

ValueError

Unable to retrieve the best model.

Source code in src/otg/method/l2g/trainer.py
 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
@classmethod
+def cross_validate(
+    cls: type[LocusToGeneTrainer],
+    l2g_model: LocusToGeneModel,
+    data: L2GFeatureMatrix,
+    num_folds: int,
+    param_grid: Optional[list] = None,
+) -> LocusToGeneModel:
+    """Perform k-fold cross validation on the model.
+
+    By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each
+    combination of parameters and return the best model.
+
+    Args:
+        l2g_model (LocusToGeneModel): Model to fit to the data on
+        data (L2GFeatureMatrix): Data to perform cross validation on
+        num_folds (int): Number of folds to use for cross validation
+        param_grid (Optional[list]): List of parameter maps to use for cross validation
+
+    Returns:
+        LocusToGeneModel: Trained model fitted with the best hyperparameters
+
+    Raises:
+        ValueError: Parameter grid is empty. Cannot perform cross-validation.
+        ValueError: Unable to retrieve the best model.
+    """
+    evaluator = MulticlassClassificationEvaluator()
+    params_grid = param_grid or l2g_model.get_param_grid()
+    if not param_grid:
+        raise ValueError(
+            "Parameter grid is empty. Cannot perform cross-validation."
+        )
+    cv = CrossValidator(
+        numFolds=num_folds,
+        estimator=l2g_model.estimator,
+        estimatorParamMaps=params_grid,
+        evaluator=evaluator,
+        parallelism=2,
+        collectSubModels=False,
+        seed=42,
+    )
+
+    l2g_model.add_pipeline_stage(cv)
+
+    # Integrate the best model from the last stage of the pipeline
+    if (full_pipeline_model := l2g_model.fit(data).model) is None or not hasattr(
+        full_pipeline_model, "stages"
+    ):
+        raise ValueError("Unable to retrieve the best model.")
+    l2g_model.model = full_pipeline_model.stages[-1].bestModel
+
+    return l2g_model
+

train(data: L2GFeatureMatrix, l2g_model: LocusToGeneModel, features_list: list[str], evaluate: bool, wandb_run_name: str | None = None, model_path: str | None = None, **hyperparams: dict) -> LocusToGeneModel classmethod

Train the Locus to Gene model.

Parameters:

Name Type Description Default
data L2GFeatureMatrix

Feature matrix containing the data

required
l2g_model LocusToGeneModel

Model to fit to the data on

required
features_list list[str]

List of features to use for the model

required
evaluate bool

Whether to evaluate the model on a test set

required
wandb_run_name str | None

Descriptive name for the run to be tracked with W&B

None
model_path str | None

Path to save the model to

None
**hyperparams dict

Hyperparameters to use for the model

{}

Returns:

Name Type Description
LocusToGeneModel LocusToGeneModel

Trained model

Source code in src/otg/method/l2g/trainer.py
22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
@classmethod
+def train(
+    cls: type[LocusToGeneTrainer],
+    data: L2GFeatureMatrix,
+    l2g_model: LocusToGeneModel,
+    features_list: list[str],
+    evaluate: bool,
+    wandb_run_name: str | None = None,
+    model_path: str | None = None,
+    **hyperparams: dict,
+) -> LocusToGeneModel:
+    """Train the Locus to Gene model.
+
+    Args:
+        data (L2GFeatureMatrix): Feature matrix containing the data
+        l2g_model (LocusToGeneModel): Model to fit to the data on
+        features_list (list[str]): List of features to use for the model
+        evaluate (bool): Whether to evaluate the model on a test set
+        wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B
+        model_path (str | None): Path to save the model to
+        **hyperparams (dict): Hyperparameters to use for the model
+
+    Returns:
+        LocusToGeneModel: Trained model
+    """
+    train, test = data.select_features(features_list).train_test_split(fraction=0.8)
+
+    model = l2g_model.add_pipeline_stage(l2g_model.estimator).fit(train)
+
+    if evaluate:
+        l2g_model.evaluate(
+            results=model.predict(test),
+            hyperparameters=hyperparams,
+            wandb_run_name=wandb_run_name,
+        )
+    if model_path:
+        l2g_model.save(model_path)
+    return l2g_model
+

\ No newline at end of file diff --git a/python_api/method/ld_annotator/index.html b/python_api/method/ld_annotator/index.html index 69cd28e14..e6a4de317 100644 --- a/python_api/method/ld_annotator/index.html +++ b/python_api/method/ld_annotator/index.html @@ -1,4 +1,4 @@ - LDAnnotator - Open Targets Genetics

LDAnnotator

otg.method.ld.LDAnnotator

Class to annotate linkage disequilibrium (LD) operations from GnomAD.

Source code in src/otg/method/ld.py
 17
+ LDAnnotator - Open Targets Genetics       

LDAnnotator

otg.method.ld.LDAnnotator

Class to annotate linkage disequilibrium (LD) operations from GnomAD.

Source code in src/otg/method/ld.py
 17
  18
  19
  20
diff --git a/python_api/method/pics/index.html b/python_api/method/pics/index.html
index 513a27d60..fe90de233 100644
--- a/python_api/method/pics/index.html
+++ b/python_api/method/pics/index.html
@@ -1,4 +1,4 @@
- PICS - Open Targets Genetics       

PICS

otg.method.pics.PICS

Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.

Source code in src/otg/method/pics.py
 17
+ PICS - Open Targets Genetics       

PICS

otg.method.pics.PICS

Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.

Source code in src/otg/method/pics.py
 17
  18
  19
  20
diff --git a/python_api/method/window_based_clumping/index.html b/python_api/method/window_based_clumping/index.html
index 3c232ab56..70e7b60b4 100644
--- a/python_api/method/window_based_clumping/index.html
+++ b/python_api/method/window_based_clumping/index.html
@@ -1,4 +1,4 @@
- Window-based clumping - Open Targets Genetics       

Window-based clumping

otg.method.window_based_clumping.WindowBasedClumping

Get semi-lead snps from summary statistics using a window based function.

Source code in src/otg/method/window_based_clumping.py
 23
+ Window-based clumping - Open Targets Genetics       

Window-based clumping

otg.method.window_based_clumping.WindowBasedClumping

Get semi-lead snps from summary statistics using a window based function.

Source code in src/otg/method/window_based_clumping.py
 23
  24
  25
  26
diff --git a/python_api/step/_step/index.html b/python_api/step/_step/index.html
index 92faa5a81..31a460f4c 100644
--- a/python_api/step/_step/index.html
+++ b/python_api/step/_step/index.html
@@ -1 +1 @@
- Step - Open Targets Genetics       
\ No newline at end of file + Step - Open Targets Genetics
\ No newline at end of file diff --git a/python_api/step/colocalisation/index.html b/python_api/step/colocalisation/index.html index 4c6d22d14..c0637ba94 100644 --- a/python_api/step/colocalisation/index.html +++ b/python_api/step/colocalisation/index.html @@ -1,4 +1,4 @@ - Colocalisation - Open Targets Genetics

Colocalisation

otg.colocalisation.ColocalisationStep dataclass

Colocalisation step.

This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).

Attributes:

Name Type Description
study_locus_path DictConfig

Input Study-locus path.

coloc_path DictConfig

Output Colocalisation path.

priorc1 float

Prior on variant being causal for trait 1.

priorc2 float

Prior on variant being causal for trait 2.

priorc12 float

Prior on variant being causal for traits 1 and 2.

Source code in src/otg/colocalisation.py
14
+ Colocalisation - Open Targets Genetics       

Colocalisation

otg.colocalisation.ColocalisationStep dataclass

Colocalisation step.

This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).

Attributes:

Name Type Description
session Session

Session object.

study_locus_path DictConfig

Input Study-locus path.

coloc_path DictConfig

Output Colocalisation path.

priorc1 float

Prior on variant being causal for trait 1.

priorc2 float

Prior on variant being causal for trait 2.

priorc12 float

Prior on variant being causal for traits 1 and 2.

Source code in src/otg/colocalisation.py
14
 15
 16
 17
@@ -37,13 +37,15 @@
 50
 51
 52
-53
@dataclass
+53
+54
@dataclass
 class ColocalisationStep:
     """Colocalisation step.
 
     This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).
 
     Attributes:
+        session (Session): Session object.
         study_locus_path (DictConfig): Input Study-locus path.
         coloc_path (DictConfig): Output Colocalisation path.
         priorc1 (float): Prior on variant being causal for trait 1.
diff --git a/python_api/step/finngen/index.html b/python_api/step/finngen/index.html
index 132b3075a..7cb2ed162 100644
--- a/python_api/step/finngen/index.html
+++ b/python_api/step/finngen/index.html
@@ -1,4 +1,4 @@
- FinnGen - Open Targets Genetics       

FinnGen

otg.finngen.FinnGenStep dataclass

FinnGen ingestion step.

Attributes:

Name Type Description
finngen_phenotype_table_url str

FinnGen API for fetching the list of studies.

finngen_release_prefix str

Release prefix pattern.

finngen_sumstat_url_prefix str

URL prefix for summary statistics location.

finngen_sumstat_url_suffix str

URL prefix suffix for summary statistics location.

finngen_study_index_out str

Output path for the FinnGen study index dataset.

finngen_summary_stats_out str

Output path for the FinnGen summary statistics.

Source code in src/otg/finngen.py
15
+ FinnGen - Open Targets Genetics       

FinnGen

otg.finngen.FinnGenStep dataclass

FinnGen ingestion step.

Attributes:

Name Type Description
session Session

Session object.

finngen_phenotype_table_url str

FinnGen API for fetching the list of studies.

finngen_release_prefix str

Release prefix pattern.

finngen_sumstat_url_prefix str

URL prefix for summary statistics location.

finngen_sumstat_url_suffix str

URL prefix suffix for summary statistics location.

finngen_study_index_out str

Output path for the FinnGen study index dataset.

finngen_summary_stats_out str

Output path for the FinnGen summary statistics.

Source code in src/otg/finngen.py
15
 16
 17
 18
@@ -56,11 +56,13 @@
 70
 71
 72
-73
@dataclass
+73
+74
@dataclass
 class FinnGenStep:
     """FinnGen ingestion step.
 
     Attributes:
+        session (Session): Session object.
         finngen_phenotype_table_url (str): FinnGen API for fetching the list of studies.
         finngen_release_prefix (str): Release prefix pattern.
         finngen_sumstat_url_prefix (str): URL prefix for summary statistics location.
diff --git a/python_api/step/gene_index/index.html b/python_api/step/gene_index/index.html
index c139fa5ae..7ac1628eb 100644
--- a/python_api/step/gene_index/index.html
+++ b/python_api/step/gene_index/index.html
@@ -1,4 +1,4 @@
- Gene Index - Open Targets Genetics       

Gene Index

otg.gene_index.GeneIndexStep dataclass

Gene index step.

This step generates a gene index dataset from an Open Targets Platform target dataset.

Attributes:

Name Type Description
target_path str

Open targets Platform target dataset path.

gene_index_path str

Output gene index path.

Source code in src/otg/gene_index.py
12
+ Gene Index - Open Targets Genetics       

Gene Index

otg.gene_index.GeneIndexStep dataclass

Gene index step.

This step generates a gene index dataset from an Open Targets Platform target dataset.

Attributes:

Name Type Description
session Session

Session object.

target_path str

Open targets Platform target dataset path.

gene_index_path str

Output gene index path.

Source code in src/otg/gene_index.py
12
 13
 14
 15
@@ -21,13 +21,15 @@
 32
 33
 34
-35
@dataclass
+35
+36
@dataclass
 class GeneIndexStep:
     """Gene index step.
 
     This step generates a gene index dataset from an Open Targets Platform target dataset.
 
     Attributes:
+        session (Session): Session object.
         target_path (str): Open targets Platform target dataset path.
         gene_index_path (str): Output gene index path.
     """
diff --git a/python_api/step/gwas_catalog/index.html b/python_api/step/gwas_catalog/index.html
index a6fdc7f1d..33f9d31a0 100644
--- a/python_api/step/gwas_catalog/index.html
+++ b/python_api/step/gwas_catalog/index.html
@@ -1,4 +1,4 @@
- GWAS Catalog - Open Targets Genetics       

GWAS Catalog

otg.gwas_catalog.GWASCatalogStep dataclass

GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.

Attributes:

Name Type Description
catalog_studies_file str

Raw GWAS catalog studies file.

catalog_ancestry_file str

Ancestry annotations file from GWAS Catalog.

catalog_sumstats_lut str

GWAS Catalog summary statistics lookup table.

catalog_associations_file str

Raw GWAS catalog associations file.

variant_annotation_path str

Input variant annotation path.

ld_populations list

List of populations to include.

min_r2 float

Minimum r2 to consider when considering variants within a window.

catalog_studies_out str

Output GWAS catalog studies path.

catalog_associations_out str

Output GWAS catalog associations path.

Source code in src/otg/gwas_catalog.py
19
+ GWAS Catalog - Open Targets Genetics       

GWAS Catalog

otg.gwas_catalog.GWASCatalogStep dataclass

GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.

Attributes:

Name Type Description
session Session

Session object.

catalog_studies_file str

Raw GWAS catalog studies file.

catalog_ancestry_file str

Ancestry annotations file from GWAS Catalog.

catalog_sumstats_lut str

GWAS Catalog summary statistics lookup table.

catalog_associations_file str

Raw GWAS catalog associations file.

variant_annotation_path str

Input variant annotation path.

ld_populations list

List of populations to include.

min_r2 float

Minimum r2 to consider when considering variants within a window.

catalog_studies_out str

Output GWAS catalog studies path.

catalog_associations_out str

Output GWAS catalog associations path.

Source code in src/otg/gwas_catalog.py
19
 20
 21
 22
@@ -72,11 +72,13 @@
 90
 91
 92
-93
@dataclass
+93
+94
@dataclass
 class GWASCatalogStep:
     """GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.
 
     Attributes:
+        session (Session): Session object.
         catalog_studies_file (str): Raw GWAS catalog studies file.
         catalog_ancestry_file (str): Ancestry annotations file from GWAS Catalog.
         catalog_sumstats_lut (str): GWAS Catalog summary statistics lookup table.
diff --git a/python_api/step/gwas_catalog_sumstat_preprocess/index.html b/python_api/step/gwas_catalog_sumstat_preprocess/index.html
index da76c8b29..8af9e076a 100644
--- a/python_api/step/gwas_catalog_sumstat_preprocess/index.html
+++ b/python_api/step/gwas_catalog_sumstat_preprocess/index.html
@@ -1,4 +1,4 @@
- GWAS Catalog sumstat preprocess - Open Targets Genetics       

GWAS Catalog sumstat preprocess

otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep dataclass

Step to preprocess GWAS Catalog harmonised summary stats.

Attributes:

Name Type Description
raw_sumstats_path str

Input raw GWAS Catalog summary statistics path.

out_sumstats_path str

Output GWAS Catalog summary statistics path.

study_id str

GWAS Catalog study identifier.

Source code in src/otg/gwas_catalog_sumstat_preprocess.py
12
+ GWAS Catalog sumstat preprocess - Open Targets Genetics       

GWAS Catalog sumstat preprocess

otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep dataclass

Step to preprocess GWAS Catalog harmonised summary stats.

Attributes:

Name Type Description
session Session

Session object.

raw_sumstats_path str

Input raw GWAS Catalog summary statistics path.

out_sumstats_path str

Output GWAS Catalog summary statistics path.

study_id str

GWAS Catalog study identifier.

Source code in src/otg/gwas_catalog_sumstat_preprocess.py
12
 13
 14
 15
@@ -33,11 +33,13 @@
 44
 45
 46
-47
@dataclass
+47
+48
@dataclass
 class GWASCatalogSumstatsPreprocessStep:
     """Step to preprocess GWAS Catalog harmonised summary stats.
 
     Attributes:
+        session (Session): Session object.
         raw_sumstats_path (str): Input raw GWAS Catalog summary statistics path.
         out_sumstats_path (str): Output GWAS Catalog summary statistics path.
         study_id (str): GWAS Catalog study identifier.
diff --git a/python_api/step/l2g/index.html b/python_api/step/l2g/index.html
new file mode 100644
index 000000000..4b7c902bd
--- /dev/null
+++ b/python_api/step/l2g/index.html
@@ -0,0 +1,354 @@
+ Locus-to-gene (L2G) - Open Targets Genetics       

Locus-to-gene (L2G)

otg.l2g.LocusToGeneStep dataclass

Locus to gene step.

Attributes:

Name Type Description
session Session

Session object.

extended_spark_conf dict[str, str] | None

Extended Spark configuration.

run_mode str

One of "train" or "predict".

wandb_run_name str | None

Name of the run to be tracked on W&B.

perform_cross_validation bool

Whether to perform cross validation.

model_path str | None

Path to save the model.

predictions_path str | None

Path to save the predictions.

study_locus_path str

Path to study locus Parquet files.

variant_gene_path str

Path to variant to gene Parquet files.

colocalisation_path str

Path to colocalisation Parquet files.

study_index_path str

Path to study index Parquet files.

study_locus_overlap_path str | None

Path to study locus overlap Parquet files.

gold_standard_curation_path str | None

Path to gold standard curation JSON files.

gene_interactions_path str | None

Path to gene interactions Parquet files.

features_list list[str]

List of features to use.

hyperparameters dict

Hyperparameters for the model.

Source code in src/otg/l2g.py
 23
+ 24
+ 25
+ 26
+ 27
+ 28
+ 29
+ 30
+ 31
+ 32
+ 33
+ 34
+ 35
+ 36
+ 37
+ 38
+ 39
+ 40
+ 41
+ 42
+ 43
+ 44
+ 45
+ 46
+ 47
+ 48
+ 49
+ 50
+ 51
+ 52
+ 53
+ 54
+ 55
+ 56
+ 57
+ 58
+ 59
+ 60
+ 61
+ 62
+ 63
+ 64
+ 65
+ 66
+ 67
+ 68
+ 69
+ 70
+ 71
+ 72
+ 73
+ 74
+ 75
+ 76
+ 77
+ 78
+ 79
+ 80
+ 81
+ 82
+ 83
+ 84
+ 85
+ 86
+ 87
+ 88
+ 89
+ 90
+ 91
+ 92
+ 93
+ 94
+ 95
+ 96
+ 97
+ 98
+ 99
+100
+101
+102
+103
+104
+105
+106
+107
+108
+109
+110
+111
+112
+113
+114
+115
+116
+117
+118
+119
+120
+121
+122
+123
+124
+125
+126
+127
+128
+129
+130
+131
+132
+133
+134
+135
+136
+137
+138
+139
+140
+141
+142
+143
+144
+145
+146
+147
+148
+149
+150
+151
+152
+153
+154
+155
+156
+157
+158
+159
+160
+161
+162
+163
+164
+165
+166
+167
+168
+169
+170
+171
+172
+173
+174
+175
+176
+177
+178
+179
+180
+181
+182
+183
+184
+185
+186
+187
+188
+189
+190
+191
+192
+193
+194
+195
+196
+197
+198
+199
@dataclass
+class LocusToGeneStep:
+    """Locus to gene step.
+
+    Attributes:
+        session (Session): Session object.
+        extended_spark_conf (dict[str, str] | None): Extended Spark configuration.
+        run_mode (str): One of "train" or "predict".
+        wandb_run_name (str | None): Name of the run to be tracked on W&B.
+        perform_cross_validation (bool): Whether to perform cross validation.
+        model_path (str | None): Path to save the model.
+        predictions_path (str | None): Path to save the predictions.
+        study_locus_path (str): Path to study locus Parquet files.
+        variant_gene_path (str): Path to variant to gene Parquet files.
+        colocalisation_path (str): Path to colocalisation Parquet files.
+        study_index_path (str): Path to study index Parquet files.
+        study_locus_overlap_path (str | None): Path to study locus overlap Parquet files.
+        gold_standard_curation_path (str | None): Path to gold standard curation JSON files.
+        gene_interactions_path (str | None): Path to gene interactions Parquet files.
+        features_list (list[str]): List of features to use.
+        hyperparameters (dict): Hyperparameters for the model.
+    """
+
+    session: Session = Session()
+    extended_spark_conf: dict[str, str] | None = None
+
+    run_mode: str = MISSING
+    wandb_run_name: str | None = None
+    perform_cross_validation: bool = False
+    model_path: str | None = None
+    predictions_path: str | None = None
+    study_locus_path: str = MISSING
+    variant_gene_path: str = MISSING
+    colocalisation_path: str = MISSING
+    study_index_path: str = MISSING
+    study_locus_overlap_path: str | None = None
+    gold_standard_curation_path: str | None = None
+    gene_interactions_path: str | None = None
+    features_list: list[str] = field(
+        default_factory=lambda: [
+            # average distance of all tagging variants to gene TSS
+            "distanceTssMean",
+            # # minimum distance of all tagging variants to gene TSS
+            # "distanceTssMinimum",
+            # # max clpp for each (study, locus, gene) aggregating over all eQTLs
+            # "eqtlColocClppLocalMaximum",
+            # # max clpp for each (study, locus) aggregating over all eQTLs
+            # "eqtlColocClppNeighborhoodMaximum",
+            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all eQTLs
+            # "eqtlColocLlrLocalMaximum",
+            # # max log-likelihood ratio value for each (study, locus) aggregating over all eQTLs
+            # "eqtlColocLlrNeighborhoodMaximum",
+            # # max clpp for each (study, locus, gene) aggregating over all pQTLs
+            # "pqtlColocClppLocalMaximum",
+            # # max clpp for each (study, locus) aggregating over all pQTLs
+            # "pqtlColocClppNeighborhoodMaximum",
+            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all pQTLs
+            # "pqtlColocLlrLocalMaximum",
+            # # max log-likelihood ratio value for each (study, locus) aggregating over all pQTLs
+            # "pqtlColocLlrNeighborhoodMaximum",
+            # # max clpp for each (study, locus, gene) aggregating over all sQTLs
+            # "sqtlColocClppLocalMaximum",
+            # # max clpp for each (study, locus) aggregating over all sQTLs
+            # "sqtlColocClppNeighborhoodMaximum",
+            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all sQTLs
+            # "sqtlColocLlrLocalMaximum",
+            # # max log-likelihood ratio value for each (study, locus) aggregating over all sQTLs
+            # "sqtlColocLlrNeighborhoodMaximum",
+        ]
+    )
+    hyperparameters: dict = field(
+        default_factory=lambda: {
+            "max_depth": 5,
+            "loss_function": "binary:logistic",
+        }
+    )
+
+    def __post_init__(self: LocusToGeneStep) -> None:
+        """Run step.
+
+        Raises:
+            ValueError: if run_mode is not one of "train" or "predict".
+        """
+        if self.run_mode not in ["train", "predict"]:
+            raise ValueError(
+                f"run_mode must be one of 'train' or 'predict', got {self.run_mode}"
+            )
+        # Load common inputs
+        study_locus = StudyLocus.from_parquet(
+            self.session, self.study_locus_path, recursiveFileLookup=True
+        )
+        studies = StudyIndex.from_parquet(self.session, self.study_index_path)
+        v2g = V2G.from_parquet(self.session, self.variant_gene_path)
+        # coloc = Colocalisation.from_parquet(self.session, self.colocalisation_path) # TODO: run step
+
+        if self.run_mode == "train":
+            # Process gold standard and L2G features
+            study_locus_overlap = StudyLocusOverlap.from_parquet(
+                self.session, self.study_locus_overlap_path
+            )
+            gs_curation = self.session.spark.read.json(self.gold_standard_curation_path)
+            interactions = self.session.spark.read.parquet(self.gene_interactions_path)
+
+            gold_standards = L2GGoldStandard.from_otg_curation(
+                gold_standard_curation=gs_curation,
+                v2g=v2g,
+                study_locus_overlap=study_locus_overlap,
+                interactions=interactions,
+            )
+
+            fm = L2GFeatureMatrix.generate_features(
+                study_locus=study_locus,
+                study_index=studies,
+                variant_gene=v2g,
+                # colocalisation=coloc,
+            )
+
+            # Join and fill null values with 0
+            data = L2GFeatureMatrix(
+                _df=gold_standards.df.drop("sources").join(
+                    fm.df, on=["studyLocusId", "geneId"], how="inner"
+                ),
+                _schema=L2GFeatureMatrix.get_schema(),
+            ).fill_na()
+
+            # Instantiate classifier
+            estimator = SparkXGBClassifier(
+                eval_metric="logloss",
+                features_col="features",
+                label_col="label",
+                max_depth=5,
+            )
+            l2g_model = LocusToGeneModel(
+                features_list=list(self.features_list), estimator=estimator
+            )
+            if self.perform_cross_validation:
+                # Perform cross validation to extract what are the best hyperparameters
+                cv_folds = self.hyperparameters.get("cross_validation_folds", 5)
+                LocusToGeneTrainer.cross_validate(
+                    l2g_model=l2g_model,
+                    data=data,
+                    num_folds=cv_folds,
+                )
+            else:
+                # Train model
+                model = LocusToGeneTrainer.train(
+                    data=data,
+                    l2g_model=l2g_model,
+                    features_list=list(self.features_list),
+                    model_path=self.model_path,
+                    evaluate=True,
+                    wandb_run_name=self.wandb_run_name,
+                    **self.hyperparameters,
+                )
+                model.save(self.model_path)
+                self.session.logger.info(
+                    f"Finished L2G step. L2G model saved to {self.model_path}"
+                )
+
+        if self.run_mode == "predict":
+            if not self.model_path or not self.predictions_path:
+                raise ValueError(
+                    "model_path and predictions_path must be set for predict mode."
+                )
+            predictions = L2GPrediction.from_study_locus(
+                self.model_path,
+                study_locus,
+                studies,
+                v2g,
+                # coloc
+            )
+            predictions.df.write.mode(self.session.write_mode).parquet(
+                self.predictions_path
+            )
+            self.session.logger.info(
+                f"Finished L2G step. L2G predictions saved to {self.predictions_path}"
+            )
+

\ No newline at end of file diff --git a/python_api/step/ld_index/index.html b/python_api/step/ld_index/index.html index 2bfffda75..7c9458356 100644 --- a/python_api/step/ld_index/index.html +++ b/python_api/step/ld_index/index.html @@ -1,4 +1,4 @@ - LD Index - Open Targets Genetics

LD Index

otg.ld_index.LDIndexStep dataclass

LD index step.

This step is resource intensive

Suggested params: high memory machine, 5TB of boot disk, no SSDs.

Attributes:

Name Type Description
ld_matrix_template str

Template path for LD matrix from gnomAD.

ld_index_raw_template str

Template path for the variant indices correspondance in the LD Matrix from gnomAD.

min_r2 float

Minimum r2 to consider when considering variants within a window.

grch37_to_grch38_chain_path str

Path to GRCh37 to GRCh38 chain file.

ld_populations List[str]

List of population-specific LD matrices to process.

ld_index_out str

Output LD index path.

Source code in src/otg/ld_index.py
14
+ LD Index - Open Targets Genetics       

LD Index

otg.ld_index.LDIndexStep dataclass

LD index step.

This step is resource intensive

Suggested params: high memory machine, 5TB of boot disk, no SSDs.

Attributes:

Name Type Description
session Session

Session object.

start_hail bool

Whether to start Hail. Defaults to True.

ld_matrix_template str

Template path for LD matrix from gnomAD.

ld_index_raw_template str

Template path for the variant indices correspondance in the LD Matrix from gnomAD.

min_r2 float

Minimum r2 to consider when considering variants within a window.

grch37_to_grch38_chain_path str

Path to GRCh37 to GRCh38 chain file.

ld_populations List[str]

List of population-specific LD matrices to process.

ld_index_out str

Output LD index path.

Source code in src/otg/ld_index.py
14
 15
 16
 17
@@ -51,7 +51,10 @@
 64
 65
 66
-67
@dataclass
+67
+68
+69
+70
@dataclass
 class LDIndexStep:
     """LD index step.
 
@@ -59,6 +62,8 @@
         Suggested params: high memory machine, 5TB of boot disk, no SSDs.
 
     Attributes:
+        session (Session): Session object.
+        start_hail (bool): Whether to start Hail. Defaults to True.
         ld_matrix_template (str): Template path for LD matrix from gnomAD.
         ld_index_raw_template (str): Template path for the variant indices correspondance in the LD Matrix from gnomAD.
         min_r2 (float): Minimum r2 to consider when considering variants within a window.
@@ -68,6 +73,7 @@
     """
 
     session: Session = Session()
+    start_hail: bool = True
 
     ld_matrix_template: str = "gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm"
     ld_index_raw_template: str = "gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht"
diff --git a/python_api/step/ukbiobank/index.html b/python_api/step/ukbiobank/index.html
index adf7b0855..e5ea0e891 100644
--- a/python_api/step/ukbiobank/index.html
+++ b/python_api/step/ukbiobank/index.html
@@ -1,4 +1,4 @@
- UK Biobank - Open Targets Genetics       

UK Biobank

otg.ukbiobank.UKBiobankStep dataclass

UKBiobank study table ingestion step.

Attributes:

Name Type Description
ukbiobank_manifest str

UKBiobank manifest of studies.

ukbiobank_study_index_out str

Output path for the UKBiobank study index dataset.

Source code in src/otg/ukbiobank.py
13
+ UK Biobank - Open Targets Genetics       

UK Biobank

otg.ukbiobank.UKBiobankStep dataclass

UKBiobank study table ingestion step.

Attributes:

Name Type Description
session Session

Session object.

ukbiobank_manifest str

UKBiobank manifest of studies.

ukbiobank_study_index_out str

Output path for the UKBiobank study index dataset.

Source code in src/otg/ukbiobank.py
13
 14
 15
 16
@@ -25,11 +25,13 @@
 37
 38
 39
-40
@dataclass
+40
+41
@dataclass
 class UKBiobankStep:
     """UKBiobank study table ingestion step.
 
     Attributes:
+        session (Session): Session object.
         ukbiobank_manifest (str): UKBiobank manifest of studies.
         ukbiobank_study_index_out (str): Output path for the UKBiobank study index dataset.
     """
diff --git a/python_api/step/variant_annotation_step/index.html b/python_api/step/variant_annotation_step/index.html
index 72f021982..54e710efb 100644
--- a/python_api/step/variant_annotation_step/index.html
+++ b/python_api/step/variant_annotation_step/index.html
@@ -1,4 +1,4 @@
- Variant Annotation - Open Targets Genetics       

Variant Annotation

otg.variant_annotation.VariantAnnotationStep dataclass

Variant annotation step.

Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.

Attributes:

Name Type Description
gnomad_genomes str

Path to gnomAD genomes hail table.

chain_38_to_37 str

Path to GRCh38 to GRCh37 chain file.

variant_annotation_path str

Output variant annotation path.

populations List[str]

List of populations to include.

Source code in src/otg/variant_annotation.py
14
+ Variant Annotation - Open Targets Genetics       

Variant Annotation

otg.variant_annotation.VariantAnnotationStep dataclass

Variant annotation step.

Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.

Attributes:

Name Type Description
session Session

Session object.

start_hail bool

Whether to start a Hail session. Defaults to True.

gnomad_genomes str

Path to gnomAD genomes hail table.

chain_38_to_37 str

Path to GRCh38 to GRCh37 chain file.

variant_annotation_path str

Output variant annotation path.

populations List[str]

List of populations to include.

Source code in src/otg/variant_annotation.py
14
 15
 16
 17
@@ -48,13 +48,18 @@
 61
 62
 63
-64
@dataclass
+64
+65
+66
+67
@dataclass
 class VariantAnnotationStep:
     """Variant annotation step.
 
     Variant annotation step produces a dataset of the type `VariantAnnotation` derived from gnomADs `gnomad.genomes.vX.X.X.sites.ht` Hail's table. This dataset is used to validate variants and as a source of annotation.
 
     Attributes:
+        session (Session): Session object.
+        start_hail (bool): Whether to start a Hail session. Defaults to True.
         gnomad_genomes (str): Path to gnomAD genomes hail table.
         chain_38_to_37 (str): Path to GRCh38 to GRCh37 chain file.
         variant_annotation_path (str): Output variant annotation path.
@@ -62,6 +67,7 @@
     """
 
     session: Session = Session()
+    start_hail: bool = True
 
     gnomad_genomes: str = MISSING
     chain_38_to_37: str = MISSING
diff --git a/python_api/step/variant_index_step/index.html b/python_api/step/variant_index_step/index.html
index 38dea8ca5..3f5e99074 100644
--- a/python_api/step/variant_index_step/index.html
+++ b/python_api/step/variant_index_step/index.html
@@ -1,4 +1,4 @@
- Variant Index - Open Targets Genetics       

Variant Index

otg.variant_index.VariantIndexStep dataclass

Run variant index step to only variants in study-locus sets.

Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.

Attributes:

Name Type Description
variant_annotation_path str

Input variant annotation path.

study_locus_path str

Input study-locus path.

variant_index_path str

Output variant index path.

Source code in src/otg/variant_index.py
14
+ Variant Index - Open Targets Genetics       

Variant Index

otg.variant_index.VariantIndexStep dataclass

Run variant index step to only variants in study-locus sets.

Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.

Attributes:

Name Type Description
session Session

Session object.

variant_annotation_path str

Input variant annotation path.

study_locus_path str

Input study-locus path.

variant_index_path str

Output variant index path.

Source code in src/otg/variant_index.py
14
 15
 16
 17
@@ -33,13 +33,15 @@
 46
 47
 48
-49
@dataclass
+49
+50
@dataclass
 class VariantIndexStep:
     """Run variant index step to only variants in study-locus sets.
 
     Using a `VariantAnnotation` dataset as a reference, this step creates and writes a dataset of the type `VariantIndex` that includes only variants that have disease-association data with a reduced set of annotations.
 
     Attributes:
+        session (Session): Session object.
         variant_annotation_path (str): Input variant annotation path.
         study_locus_path (str): Input study-locus path.
         variant_index_path (str): Output variant index path.
diff --git a/python_api/step/variant_to_gene_step/index.html b/python_api/step/variant_to_gene_step/index.html
index 4eb460820..afddd4800 100644
--- a/python_api/step/variant_to_gene_step/index.html
+++ b/python_api/step/variant_to_gene_step/index.html
@@ -1,4 +1,4 @@
- Variant-to-gene - Open Targets Genetics       

Variant-to-gene

otg.v2g.V2GStep dataclass

Variant-to-gene (V2G) step.

This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:

  1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
  2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
  3. Distance between the variant and each gene's canonical transcription start site (TSS).

Attributes:

Name Type Description
variant_index_path str

Input variant index path.

variant_annotation_path str

Input variant annotation path.

gene_index_path str

Input gene index path.

vep_consequences_path str

Input VEP consequences path.

liftover_chain_file_path str

Path to GRCh37 to GRCh38 chain file.

liftover_max_length_difference int

Maximum length difference for liftover.

max_distance int

Maximum distance to consider.

approved_biotypes list[str]

List of approved biotypes.

intervals dict

Dictionary of interval sources.

v2g_path str

Output V2G path.

Source code in src/otg/v2g.py
 20
+ Variant-to-gene - Open Targets Genetics       

Variant-to-gene

otg.v2g.V2GStep dataclass

Variant-to-gene (V2G) step.

This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:

  1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
  2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
  3. Distance between the variant and each gene's canonical transcription start site (TSS).

Attributes:

Name Type Description
session Session

Session object.

variant_index_path str

Input variant index path.

variant_annotation_path str

Input variant annotation path.

gene_index_path str

Input gene index path.

vep_consequences_path str

Input VEP consequences path.

liftover_chain_file_path str

Path to GRCh37 to GRCh38 chain file.

liftover_max_length_difference int

Maximum length difference for liftover.

max_distance int

Maximum distance to consider.

approved_biotypes list[str]

List of approved biotypes.

intervals dict

Dictionary of interval sources.

v2g_path str

Output V2G path.

Source code in src/otg/v2g.py
 20
  21
  22
  23
@@ -114,7 +114,8 @@
 133
 134
 135
-136
@dataclass
+136
+137
@dataclass
 class V2GStep:
     """Variant-to-gene (V2G) step.
 
@@ -125,6 +126,7 @@
     3. Distance between the variant and each gene's canonical transcription start site (TSS).
 
     Attributes:
+        session (Session): Session object.
         variant_index_path (str): Input variant index path.
         variant_annotation_path (str): Input variant annotation path.
         gene_index_path (str): Input gene index path.
diff --git a/roadmap/index.html b/roadmap/index.html
index 1c15e21a2..5482decbb 100644
--- a/roadmap/index.html
+++ b/roadmap/index.html
@@ -1 +1 @@
- Roadmap - Open Targets Genetics       
\ No newline at end of file + Roadmap - Open Targets Genetics
\ No newline at end of file diff --git a/search/search_index.json b/search/search_index.json index ba6055b6f..0b45bc56e 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Open Targets Genetics","text":"

Ingestion and analysis of genetic and functional genomic data for the identification and prioritisation of drug targets.

This project is still in experimental phase. Please refer to the roadmap section for more information.

For all development information, including running the code, troubleshooting, or contributing, see the development section.

"},{"location":"installation/","title":"Installation","text":"

TBC

"},{"location":"roadmap/","title":"Roadmap","text":"

The Open Targets core team is working on refactoring Open Targets Genetics, aiming to:

  • Re-focus the product around Target ID
  • Create a gold standard toolkit for post-GWAS analysis
  • Faster/robust addition of new datasets and datatypes
  • Reduce computational and financial cost

See here for a list of open issues for this project.

Schematic diagram representing the drafted process:

"},{"location":"usage/","title":"How-to","text":"

TBC

"},{"location":"development/_development/","title":"Development","text":"

This section contains various technical information on how to develop and run the code.

"},{"location":"development/airflow/","title":"Running Airflow workflows","text":"

Airflow code is located in src/airflow. Make sure to execute all of the instructions from that directory, unless stated otherwise.

"},{"location":"development/airflow/#set-up-docker","title":"Set up Docker","text":"

We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):

sudo apt install docker-compose\n

Next, verify that you can run Docker. This should say \"Hello from Docker\":

docker run hello-world\n

If the command above raises a permission error, fix it and reboot:

sudo usermod -a -G docker $USER\nnewgrp docker\n
"},{"location":"development/airflow/#set-up-airflow","title":"Set up Airflow","text":"

This section is adapted from instructions from https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html. When you run the commands, make sure your current working directory is src/airflow.

# Download the latest docker-compose.yaml file.\ncurl -sLfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml\n\n# Make expected directories.\nmkdir -p ./config ./dags ./logs ./plugins\n\n# Construct the modified Docker image with additional PIP dependencies.\ndocker build . --tag opentargets-airflow:2.7.1\n\n# Set environment variables.\ncat << EOF > .env\nAIRFLOW_UID=$(id -u)\nAIRFLOW_IMAGE_NAME=opentargets-airflow:2.7.1\nEOF\n

Now modify docker-compose.yaml and add the following to the x-airflow-common \u2192 environment section:

GOOGLE_APPLICATION_CREDENTIALS: '/opt/airflow/config/application_default_credentials.json'\nAIRFLOW__CELERY__WORKER_CONCURRENCY: 32\nAIRFLOW__CORE__PARALLELISM: 32\nAIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 32\nAIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 16\nAIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1\n

"},{"location":"development/airflow/#start-airflow","title":"Start Airflow","text":"
docker-compose up\n

Airflow UI will now be available at http://localhost:8080/home. Default username and password are both airflow.

"},{"location":"development/airflow/#configure-google-cloud-access","title":"Configure Google Cloud access","text":"

In order to be able to access Google Cloud and do work with Dataproc, Airflow will need to be configured. First, obtain Google default application credentials by running this command and following the instructions:

gcloud auth application-default login\n

Next, copy the file into the config/ subdirectory which we created above:

cp ~/.config/gcloud/application_default_credentials.json config/\n

Now open the Airflow UI and:

  • Navigate to Admin \u2192 Connections.
  • Click on \"Add new record\".
  • Set \"Connection type\" to `Google Cloud``.
  • Set \"Connection ID\" to google_cloud_default.
  • Set \"Credential Configuration File\" to /opt/airflow/config/application_default_credentials.json.
  • Click on \"Save\".
"},{"location":"development/airflow/#run-a-workflow","title":"Run a workflow","text":"

Workflows, which must be placed under the dags/ directory, will appear in the \"DAGs\" section of the UI, which is also the main page. They can be triggered manually by opening a workflow and clicking on the \"Play\" button in the upper right corner.

In order to restart a failed task, click on it and then click on \"Clear task\".

"},{"location":"development/airflow/#troubleshooting","title":"Troubleshooting","text":"

Note that when you a a new workflow under dags/, Airflow will not pick that up immediately. By default the filesystem is only scanned for new DAGs every 300s. However, once the DAG is added, updates are applied nearly instantaneously.

Also, if you edit the DAG while an instance of it is running, it might cause problems with the run, as Airflow will try to update the tasks and their properties in DAG according to the file changes.

"},{"location":"development/contributing/","title":"Contributing guidelines","text":""},{"location":"development/contributing/#one-time-configuration","title":"One-time configuration","text":"

The steps in this section only ever need to be done once on any particular system.

Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.

Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.

Check that you have java installed.

"},{"location":"development/contributing/#environment-configuration","title":"Environment configuration","text":"

Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.

It is recommended to use VS Code as an IDE for development.

"},{"location":"development/contributing/#how-to-run-the-code","title":"How to run the code","text":"

All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.

In order to run the code:

  1. Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.

  2. Manually edit your local pyproject.toml file and modify the version of the code.

    • This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
    • You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
    • Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
    • Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
  3. Run make build.

    • This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
    • A version specific subpath is used, so uploading the code will not affect any branches but your own.
    • If there was already a code bundle uploaded with the same version number, it will be replaced.
  4. Submit the Dataproc job with poetry run python workflow/workflow_template.py

    • You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
    • The script will provision the cluster and submit the job.
    • The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
    • Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
    • On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
"},{"location":"development/contributing/#contributing-checklist","title":"Contributing checklist","text":"

When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)

For more details on each of these steps, see the sections below.

"},{"location":"development/contributing/#documentation","title":"Documentation","text":"
  • If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
  • Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
  • Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
  • Step definition in docs/reference/step (example: docs/reference/step/finngen.md)
"},{"location":"development/contributing/#configuration","title":"Configuration","text":"
  • Input and output paths in config/datasets/gcp.yaml
  • Step configuration in config/step/STEP.yaml (example: config/step/finngen.yaml)
"},{"location":"development/contributing/#classes","title":"Classes","text":"
  • Dataset class in src/org/dataset/ (example: src/otg/dataset/study_index.py \u2192 StudyIndexFinnGen)
  • Step main running class in src/org/STEP.py (example: src/org/finngen.py)
"},{"location":"development/contributing/#tests","title":"Tests","text":"
  • Test study fixture in tests/conftest.py (example: mock_study_index_finngen in that module)
  • Test sample data in tests/data_samples (example: tests/data_samples/finngen_studies_sample.json)
  • Test definition in tests/ (example: tests/dataset/test_study_index.py \u2192 test_study_index_finngen_creation)
"},{"location":"development/troubleshooting/","title":"Troubleshooting","text":""},{"location":"development/troubleshooting/#blaslapack","title":"BLAS/LAPACK","text":"

If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.

"},{"location":"development/troubleshooting/#pyenv-and-poetry","title":"Pyenv and Poetry","text":"

If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:

  1. Close your currently activated environment, if any: exit
  2. Uninstall Poetry: curl -sSL https://install.python-poetry.org | python3 - --uninstall
  3. Clear Poetry cache: rm -rf ~/.cache/pypoetry
  4. Clear pre-commit cache: rm -rf ~/.cache/pre-commit
  5. Switch to system Python shell: pyenv shell system
  6. Edit ~/.bashrc to remove the lines related to Pyenv configuration
  7. Remove Pyenv configuration and cache: rm -rf ~/.pyenv

After that, open a fresh shell session and run make setup-dev again.

"},{"location":"development/troubleshooting/#java","title":"Java","text":"

Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.

If you are encountering problems with initialising a Spark session, try using Java 11.

"},{"location":"development/troubleshooting/#pre-commit","title":"Pre-commit","text":"

If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.

One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.

Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.

On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.

After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).

"},{"location":"python_api/dataset/_dataset/","title":"Dataset","text":""},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset","title":"otg.dataset.dataset.Dataset dataclass","text":"

Bases: ABC

Open Targets Genetics Dataset.

Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the json.schemas module.

Source code in src/otg/dataset/dataset.py
@dataclass\nclass Dataset(ABC):\n    \"\"\"Open Targets Genetics Dataset.\n\n    `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `json.schemas` module.\n    \"\"\"\n\n    _df: DataFrame\n    _schema: StructType\n\n    def __post_init__(self: Dataset) -> None:\n        \"\"\"Post init.\"\"\"\n        self.validate_schema()\n\n    @property\n    def df(self: Dataset) -> DataFrame:\n        \"\"\"Dataframe included in the Dataset.\n\n        Returns:\n            DataFrame: Dataframe included in the Dataset\n        \"\"\"\n        return self._df\n\n    @df.setter\n    def df(self: Dataset, new_df: DataFrame) -> None:  # noqa: CCE001\n        \"\"\"Dataframe setter.\n\n        Args:\n            new_df (DataFrame): New dataframe to be included in the Dataset\n        \"\"\"\n        self._df: DataFrame = new_df\n        self.validate_schema()\n\n    @property\n    def schema(self: Dataset) -> StructType:\n        \"\"\"Dataframe expected schema.\n\n        Returns:\n            StructType: Dataframe expected schema\n        \"\"\"\n        return self._schema\n\n    @classmethod\n    @abstractmethod\n    def get_schema(cls: type[Dataset]) -> StructType:\n        \"\"\"Abstract method to get the schema. Must be implemented by child classes.\n\n        Returns:\n            StructType: Schema for the Dataset\n        \"\"\"\n        pass\n\n    @classmethod\n    def from_parquet(\n        cls: type[Dataset], session: Session, path: str, **kwargs: dict[str, Any]\n    ) -> Dataset:\n        \"\"\"Reads a parquet file into a Dataset with a given schema.\n\n        Args:\n            session (Session): Spark session\n            path (str): Path to the parquet file\n            **kwargs (dict[str, Any]): Additional arguments to pass to spark.read.parquet\n\n        Returns:\n            Dataset: Dataset with the parquet file contents\n        \"\"\"\n        schema = cls.get_schema()\n        df = session.read_parquet(path=path, schema=schema, **kwargs)\n        return cls(_df=df, _schema=schema)\n\n    def validate_schema(self: Dataset) -> None:  # sourcery skip: invert-any-all\n        \"\"\"Validate DataFrame schema against expected class schema.\n\n        Raises:\n            ValueError: DataFrame schema is not valid\n        \"\"\"\n        expected_schema = self._schema\n        expected_fields = flatten_schema(expected_schema)\n        observed_schema = self._df.schema\n        observed_fields = flatten_schema(observed_schema)\n\n        # Unexpected fields in dataset\n        if unexpected_field_names := [\n            x.name\n            for x in observed_fields\n            if x.name not in [y.name for y in expected_fields]\n        ]:\n            raise ValueError(\n                f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n            )\n\n        # Required fields not in dataset\n        required_fields = [x.name for x in expected_schema if not x.nullable]\n        if missing_required_fields := [\n            req\n            for req in required_fields\n            if not any(field.name == req for field in observed_fields)\n        ]:\n            raise ValueError(\n                f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n            )\n\n        # Fields with duplicated names\n        if duplicated_fields := [\n            x for x in set(observed_fields) if observed_fields.count(x) > 1\n        ]:\n            raise ValueError(\n                f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n            )\n\n        # Fields with different datatype\n        observed_field_types = {\n            field.name: type(field.dataType) for field in observed_fields\n        }\n        expected_field_types = {\n            field.name: type(field.dataType) for field in expected_fields\n        }\n        if fields_with_different_observed_datatype := [\n            name\n            for name, observed_type in observed_field_types.items()\n            if name in expected_field_types\n            and observed_type != expected_field_types[name]\n        ]:\n            raise ValueError(\n                f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n            )\n\n    def persist(self: Dataset) -> Dataset:\n        \"\"\"Persist in memory the DataFrame included in the Dataset.\n\n        Returns:\n            Dataset: Persisted Dataset\n        \"\"\"\n        self.df = self._df.persist()\n        return self\n\n    def unpersist(self: Dataset) -> Dataset:\n        \"\"\"Remove the persisted DataFrame from memory.\n\n        Returns:\n            Dataset: Unpersisted Dataset\n        \"\"\"\n        self.df = self._df.unpersist()\n        return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.df","title":"df: DataFrame property writable","text":"

Dataframe included in the Dataset.

Returns:

Name Type Description DataFrame DataFrame

Dataframe included in the Dataset

"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.schema","title":"schema: StructType property","text":"

Dataframe expected schema.

Returns:

Name Type Description StructType StructType

Dataframe expected schema

"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.from_parquet","title":"from_parquet(session: Session, path: str, **kwargs: dict[str, Any]) -> Dataset classmethod","text":"

Reads a parquet file into a Dataset with a given schema.

Parameters:

Name Type Description Default session Session

Spark session

required path str

Path to the parquet file

required **kwargs dict[str, Any]

Additional arguments to pass to spark.read.parquet

{}

Returns:

Name Type Description Dataset Dataset

Dataset with the parquet file contents

Source code in src/otg/dataset/dataset.py
@classmethod\ndef from_parquet(\n    cls: type[Dataset], session: Session, path: str, **kwargs: dict[str, Any]\n) -> Dataset:\n    \"\"\"Reads a parquet file into a Dataset with a given schema.\n\n    Args:\n        session (Session): Spark session\n        path (str): Path to the parquet file\n        **kwargs (dict[str, Any]): Additional arguments to pass to spark.read.parquet\n\n    Returns:\n        Dataset: Dataset with the parquet file contents\n    \"\"\"\n    schema = cls.get_schema()\n    df = session.read_parquet(path=path, schema=schema, **kwargs)\n    return cls(_df=df, _schema=schema)\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.get_schema","title":"get_schema() -> StructType abstractmethod classmethod","text":"

Abstract method to get the schema. Must be implemented by child classes.

Returns:

Name Type Description StructType StructType

Schema for the Dataset

Source code in src/otg/dataset/dataset.py
@classmethod\n@abstractmethod\ndef get_schema(cls: type[Dataset]) -> StructType:\n    \"\"\"Abstract method to get the schema. Must be implemented by child classes.\n\n    Returns:\n        StructType: Schema for the Dataset\n    \"\"\"\n    pass\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.persist","title":"persist() -> Dataset","text":"

Persist in memory the DataFrame included in the Dataset.

Returns:

Name Type Description Dataset Dataset

Persisted Dataset

Source code in src/otg/dataset/dataset.py
def persist(self: Dataset) -> Dataset:\n    \"\"\"Persist in memory the DataFrame included in the Dataset.\n\n    Returns:\n        Dataset: Persisted Dataset\n    \"\"\"\n    self.df = self._df.persist()\n    return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.unpersist","title":"unpersist() -> Dataset","text":"

Remove the persisted DataFrame from memory.

Returns:

Name Type Description Dataset Dataset

Unpersisted Dataset

Source code in src/otg/dataset/dataset.py
def unpersist(self: Dataset) -> Dataset:\n    \"\"\"Remove the persisted DataFrame from memory.\n\n    Returns:\n        Dataset: Unpersisted Dataset\n    \"\"\"\n    self.df = self._df.unpersist()\n    return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.validate_schema","title":"validate_schema() -> None","text":"

Validate DataFrame schema against expected class schema.

Raises:

Type Description ValueError

DataFrame schema is not valid

Source code in src/otg/dataset/dataset.py
def validate_schema(self: Dataset) -> None:  # sourcery skip: invert-any-all\n    \"\"\"Validate DataFrame schema against expected class schema.\n\n    Raises:\n        ValueError: DataFrame schema is not valid\n    \"\"\"\n    expected_schema = self._schema\n    expected_fields = flatten_schema(expected_schema)\n    observed_schema = self._df.schema\n    observed_fields = flatten_schema(observed_schema)\n\n    # Unexpected fields in dataset\n    if unexpected_field_names := [\n        x.name\n        for x in observed_fields\n        if x.name not in [y.name for y in expected_fields]\n    ]:\n        raise ValueError(\n            f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n        )\n\n    # Required fields not in dataset\n    required_fields = [x.name for x in expected_schema if not x.nullable]\n    if missing_required_fields := [\n        req\n        for req in required_fields\n        if not any(field.name == req for field in observed_fields)\n    ]:\n        raise ValueError(\n            f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n        )\n\n    # Fields with duplicated names\n    if duplicated_fields := [\n        x for x in set(observed_fields) if observed_fields.count(x) > 1\n    ]:\n        raise ValueError(\n            f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n        )\n\n    # Fields with different datatype\n    observed_field_types = {\n        field.name: type(field.dataType) for field in observed_fields\n    }\n    expected_field_types = {\n        field.name: type(field.dataType) for field in expected_fields\n    }\n    if fields_with_different_observed_datatype := [\n        name\n        for name, observed_type in observed_field_types.items()\n        if name in expected_field_types\n        and observed_type != expected_field_types[name]\n    ]:\n        raise ValueError(\n            f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n        )\n
"},{"location":"python_api/dataset/colocalisation/","title":"Colocalisation","text":""},{"location":"python_api/dataset/colocalisation/#otg.dataset.colocalisation.Colocalisation","title":"otg.dataset.colocalisation.Colocalisation dataclass","text":"

Bases: Dataset

Colocalisation results for pairs of overlapping study-locus.

Source code in src/otg/dataset/colocalisation.py
@dataclass\nclass Colocalisation(Dataset):\n    \"\"\"Colocalisation results for pairs of overlapping study-locus.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[Colocalisation]) -> StructType:\n        \"\"\"Provides the schema for the Colocalisation dataset.\n\n        Returns:\n            StructType: Schema for the Colocalisation dataset\n        \"\"\"\n        return parse_spark_schema(\"colocalisation.json\")\n
"},{"location":"python_api/dataset/colocalisation/#otg.dataset.colocalisation.Colocalisation.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the Colocalisation dataset.

Returns:

Name Type Description StructType StructType

Schema for the Colocalisation dataset

Source code in src/otg/dataset/colocalisation.py
@classmethod\ndef get_schema(cls: type[Colocalisation]) -> StructType:\n    \"\"\"Provides the schema for the Colocalisation dataset.\n\n    Returns:\n        StructType: Schema for the Colocalisation dataset\n    \"\"\"\n    return parse_spark_schema(\"colocalisation.json\")\n
"},{"location":"python_api/dataset/colocalisation/#schema","title":"Schema","text":"
root\n |-- leftStudyLocusId: long (nullable = false)\n |-- rightStudyLocusId: long (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- colocalisationMethod: string (nullable = false)\n |-- numberColocalisingVariants: long (nullable = false)\n |-- h0: double (nullable = true)\n |-- h1: double (nullable = true)\n |-- h2: double (nullable = true)\n |-- h3: double (nullable = true)\n |-- h4: double (nullable = true)\n |-- log2h4h3: double (nullable = true)\n |-- clpp: double (nullable = true)\n
"},{"location":"python_api/dataset/gene_index/","title":"Gene Index","text":""},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex","title":"otg.dataset.gene_index.GeneIndex dataclass","text":"

Bases: Dataset

Gene index dataset.

Gene-based annotation.

Source code in src/otg/dataset/gene_index.py
@dataclass\nclass GeneIndex(Dataset):\n    \"\"\"Gene index dataset.\n\n    Gene-based annotation.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[GeneIndex]) -> StructType:\n        \"\"\"Provides the schema for the GeneIndex dataset.\n\n        Returns:\n            StructType: Schema for the GeneIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"gene_index.json\")\n\n    def filter_by_biotypes(self: GeneIndex, biotypes: list) -> GeneIndex:\n        \"\"\"Filter by approved biotypes.\n\n        Args:\n            biotypes (list): List of Ensembl biotypes to keep.\n\n        Returns:\n            GeneIndex: Gene index dataset filtered by biotypes.\n        \"\"\"\n        self.df = self._df.filter(f.col(\"biotype\").isin(biotypes))\n        return self\n\n    def locations_lut(self: GeneIndex) -> DataFrame:\n        \"\"\"Gene location information.\n\n        Returns:\n            DataFrame: Gene LUT including genomic location information.\n        \"\"\"\n        return self.df.select(\n            \"geneId\",\n            \"chromosome\",\n            \"start\",\n            \"end\",\n            \"strand\",\n            \"tss\",\n        )\n\n    def symbols_lut(self: GeneIndex) -> DataFrame:\n        \"\"\"Gene symbol lookup table.\n\n        Pre-processess gene/target dataset to create lookup table of gene symbols, including\n        obsoleted gene symbols.\n\n        Returns:\n            DataFrame: Gene LUT for symbol mapping containing `geneId` and `geneSymbol` columns.\n        \"\"\"\n        return self.df.select(\n            f.explode(\n                f.array_union(f.array(\"approvedSymbol\"), f.col(\"obsoleteSymbols\"))\n            ).alias(\"geneSymbol\"),\n            \"*\",\n        )\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.filter_by_biotypes","title":"filter_by_biotypes(biotypes: list) -> GeneIndex","text":"

Filter by approved biotypes.

Parameters:

Name Type Description Default biotypes list

List of Ensembl biotypes to keep.

required

Returns:

Name Type Description GeneIndex GeneIndex

Gene index dataset filtered by biotypes.

Source code in src/otg/dataset/gene_index.py
def filter_by_biotypes(self: GeneIndex, biotypes: list) -> GeneIndex:\n    \"\"\"Filter by approved biotypes.\n\n    Args:\n        biotypes (list): List of Ensembl biotypes to keep.\n\n    Returns:\n        GeneIndex: Gene index dataset filtered by biotypes.\n    \"\"\"\n    self.df = self._df.filter(f.col(\"biotype\").isin(biotypes))\n    return self\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the GeneIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the GeneIndex dataset

Source code in src/otg/dataset/gene_index.py
@classmethod\ndef get_schema(cls: type[GeneIndex]) -> StructType:\n    \"\"\"Provides the schema for the GeneIndex dataset.\n\n    Returns:\n        StructType: Schema for the GeneIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"gene_index.json\")\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.locations_lut","title":"locations_lut() -> DataFrame","text":"

Gene location information.

Returns:

Name Type Description DataFrame DataFrame

Gene LUT including genomic location information.

Source code in src/otg/dataset/gene_index.py
def locations_lut(self: GeneIndex) -> DataFrame:\n    \"\"\"Gene location information.\n\n    Returns:\n        DataFrame: Gene LUT including genomic location information.\n    \"\"\"\n    return self.df.select(\n        \"geneId\",\n        \"chromosome\",\n        \"start\",\n        \"end\",\n        \"strand\",\n        \"tss\",\n    )\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.symbols_lut","title":"symbols_lut() -> DataFrame","text":"

Gene symbol lookup table.

Pre-processess gene/target dataset to create lookup table of gene symbols, including obsoleted gene symbols.

Returns:

Name Type Description DataFrame DataFrame

Gene LUT for symbol mapping containing geneId and geneSymbol columns.

Source code in src/otg/dataset/gene_index.py
def symbols_lut(self: GeneIndex) -> DataFrame:\n    \"\"\"Gene symbol lookup table.\n\n    Pre-processess gene/target dataset to create lookup table of gene symbols, including\n    obsoleted gene symbols.\n\n    Returns:\n        DataFrame: Gene LUT for symbol mapping containing `geneId` and `geneSymbol` columns.\n    \"\"\"\n    return self.df.select(\n        f.explode(\n            f.array_union(f.array(\"approvedSymbol\"), f.col(\"obsoleteSymbols\"))\n        ).alias(\"geneSymbol\"),\n        \"*\",\n    )\n
"},{"location":"python_api/dataset/gene_index/#schema","title":"Schema","text":"
root\n |-- geneId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- approvedSymbol: string (nullable = true)\n |-- biotype: string (nullable = true)\n |-- approvedName: string (nullable = true)\n |-- obsoleteSymbols: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- tss: long (nullable = true)\n |-- start: long (nullable = true)\n |-- end: long (nullable = true)\n |-- strand: integer (nullable = true)\n
"},{"location":"python_api/dataset/intervals/","title":"Intervals","text":""},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals","title":"otg.dataset.intervals.Intervals dataclass","text":"

Bases: Dataset

Intervals dataset links genes to genomic regions based on genome interaction studies.

Source code in src/otg/dataset/intervals.py
@dataclass\nclass Intervals(Dataset):\n    \"\"\"Intervals dataset links genes to genomic regions based on genome interaction studies.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[Intervals]) -> StructType:\n        \"\"\"Provides the schema for the Intervals dataset.\n\n        Returns:\n            StructType: Schema for the Intervals dataset\n        \"\"\"\n        return parse_spark_schema(\"intervals.json\")\n\n    @classmethod\n    def from_source(\n        cls: type[Intervals],\n        spark: SparkSession,\n        source_name: str,\n        source_path: str,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Collect interval data for a particular source.\n\n        Args:\n            spark (SparkSession): Spark session\n            source_name (str): Name of the interval source\n            source_path (str): Path to the interval source file\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance to convert coordinats from hg37 to hg38\n\n        Returns:\n            Intervals: Intervals dataset\n\n        Raises:\n            ValueError: If the source name is not recognised\n        \"\"\"\n        from otg.datasource.intervals.andersson import IntervalsAndersson\n        from otg.datasource.intervals.javierre import IntervalsJavierre\n        from otg.datasource.intervals.jung import IntervalsJung\n        from otg.datasource.intervals.thurman import IntervalsThurman\n\n        source_to_class = {\n            \"andersson\": IntervalsAndersson,\n            \"javierre\": IntervalsJavierre,\n            \"jung\": IntervalsJung,\n            \"thurman\": IntervalsThurman,\n        }\n\n        if source_name not in source_to_class:\n            raise ValueError(f\"Unknown interval source: {source_name}\")\n\n        source_class = source_to_class[source_name]\n        data = source_class.read(spark, source_path)\n        return source_class.parse(data, gene_index, lift)\n\n    def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n        \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n        Args:\n            variant_index (VariantIndex): Variant index dataset\n\n        Returns:\n            V2G: Variant-to-gene evidence dataset\n        \"\"\"\n        return V2G(\n            _df=(\n                self.df.alias(\"interval\")\n                .join(\n                    variant_index.df.selectExpr(\n                        \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n                    ).alias(\"vi\"),\n                    on=[\n                        f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n                        f.col(\"vi.position\").between(\n                            f.col(\"interval.start\"), f.col(\"interval.end\")\n                        ),\n                    ],\n                    how=\"inner\",\n                )\n                .drop(\"start\", \"end\", \"vi_chromosome\", \"position\")\n            ),\n            _schema=V2G.get_schema(),\n        )\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.from_source","title":"from_source(spark: SparkSession, source_name: str, source_path: str, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Collect interval data for a particular source.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required source_name str

Name of the interval source

required source_path str

Path to the interval source file

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance to convert coordinats from hg37 to hg38

required

Returns:

Name Type Description Intervals Intervals

Intervals dataset

Raises:

Type Description ValueError

If the source name is not recognised

Source code in src/otg/dataset/intervals.py
@classmethod\ndef from_source(\n    cls: type[Intervals],\n    spark: SparkSession,\n    source_name: str,\n    source_path: str,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Collect interval data for a particular source.\n\n    Args:\n        spark (SparkSession): Spark session\n        source_name (str): Name of the interval source\n        source_path (str): Path to the interval source file\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance to convert coordinats from hg37 to hg38\n\n    Returns:\n        Intervals: Intervals dataset\n\n    Raises:\n        ValueError: If the source name is not recognised\n    \"\"\"\n    from otg.datasource.intervals.andersson import IntervalsAndersson\n    from otg.datasource.intervals.javierre import IntervalsJavierre\n    from otg.datasource.intervals.jung import IntervalsJung\n    from otg.datasource.intervals.thurman import IntervalsThurman\n\n    source_to_class = {\n        \"andersson\": IntervalsAndersson,\n        \"javierre\": IntervalsJavierre,\n        \"jung\": IntervalsJung,\n        \"thurman\": IntervalsThurman,\n    }\n\n    if source_name not in source_to_class:\n        raise ValueError(f\"Unknown interval source: {source_name}\")\n\n    source_class = source_to_class[source_name]\n    data = source_class.read(spark, source_path)\n    return source_class.parse(data, gene_index, lift)\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the Intervals dataset.

Returns:

Name Type Description StructType StructType

Schema for the Intervals dataset

Source code in src/otg/dataset/intervals.py
@classmethod\ndef get_schema(cls: type[Intervals]) -> StructType:\n    \"\"\"Provides the schema for the Intervals dataset.\n\n    Returns:\n        StructType: Schema for the Intervals dataset\n    \"\"\"\n    return parse_spark_schema(\"intervals.json\")\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.v2g","title":"v2g(variant_index: VariantIndex) -> V2G","text":"

Convert intervals into V2G by intersecting with a variant index.

Parameters:

Name Type Description Default variant_index VariantIndex

Variant index dataset

required

Returns:

Name Type Description V2G V2G

Variant-to-gene evidence dataset

Source code in src/otg/dataset/intervals.py
def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n    \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n    Args:\n        variant_index (VariantIndex): Variant index dataset\n\n    Returns:\n        V2G: Variant-to-gene evidence dataset\n    \"\"\"\n    return V2G(\n        _df=(\n            self.df.alias(\"interval\")\n            .join(\n                variant_index.df.selectExpr(\n                    \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n                ).alias(\"vi\"),\n                on=[\n                    f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n                    f.col(\"vi.position\").between(\n                        f.col(\"interval.start\"), f.col(\"interval.end\")\n                    ),\n                ],\n                how=\"inner\",\n            )\n            .drop(\"start\", \"end\", \"vi_chromosome\", \"position\")\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/intervals/#schema","title":"Schema","text":"
root\n |-- chromosome: string (nullable = false)\n |-- start: string (nullable = false)\n |-- end: string (nullable = false)\n |-- geneId: string (nullable = false)\n |-- resourceScore: double (nullable = true)\n |-- score: double (nullable = true)\n |-- datasourceId: string (nullable = false)\n |-- datatypeId: string (nullable = false)\n |-- pmid: string (nullable = true)\n |-- biofeature: string (nullable = true)\n
"},{"location":"python_api/dataset/ld_index/","title":"LD Index","text":""},{"location":"python_api/dataset/ld_index/#otg.dataset.ld_index.LDIndex","title":"otg.dataset.ld_index.LDIndex dataclass","text":"

Bases: Dataset

Dataset containing linkage desequilibrium information between variants.

Source code in src/otg/dataset/ld_index.py
@dataclass\nclass LDIndex(Dataset):\n    \"\"\"Dataset containing linkage desequilibrium information between variants.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[LDIndex]) -> StructType:\n        \"\"\"Provides the schema for the LDIndex dataset.\n\n        Returns:\n            StructType: Schema for the LDIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"ld_index.json\")\n
"},{"location":"python_api/dataset/ld_index/#otg.dataset.ld_index.LDIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the LDIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the LDIndex dataset

Source code in src/otg/dataset/ld_index.py
@classmethod\ndef get_schema(cls: type[LDIndex]) -> StructType:\n    \"\"\"Provides the schema for the LDIndex dataset.\n\n    Returns:\n        StructType: Schema for the LDIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"ld_index.json\")\n
"},{"location":"python_api/dataset/ld_index/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- ldSet: array (nullable = false)\n |    |-- element: struct (containsNull = false)\n |    |    |-- tagVariantId: string (nullable = false)\n |    |    |-- rValues: array (nullable = false)\n |    |    |    |-- element: struct (containsNull = false)\n |    |    |    |    |-- population: string (nullable = false)\n |    |    |    |    |-- r: double (nullable = false)\n
"},{"location":"python_api/dataset/study_index/","title":"Study Index","text":""},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex","title":"otg.dataset.study_index.StudyIndex dataclass","text":"

Bases: Dataset

Study index dataset.

A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.

Source code in src/otg/dataset/study_index.py
@dataclass\nclass StudyIndex(Dataset):\n    \"\"\"Study index dataset.\n\n    A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.\n    \"\"\"\n\n    @staticmethod\n    def _aggregate_samples_by_ancestry(merged: Column, ancestry: Column) -> Column:\n        \"\"\"Aggregate sample counts by ancestry in a list of struct colmns.\n\n        Args:\n            merged (Column): A column representing merged data (list of structs).\n            ancestry (Column): The `ancestry` parameter is a column that represents the ancestry of each\n                sample. (a struct)\n\n        Returns:\n            Column: the modified \"merged\" column after aggregating the samples by ancestry.\n        \"\"\"\n        # Iterating over the list of ancestries and adding the sample size if label matches:\n        return f.transform(\n            merged,\n            lambda a: f.when(\n                a.ancestry == ancestry.ancestry,\n                f.struct(\n                    a.ancestry.alias(\"ancestry\"),\n                    (a.sampleSize + ancestry.sampleSize).alias(\"sampleSize\"),\n                ),\n            ).otherwise(a),\n        )\n\n    @staticmethod\n    def _map_ancestries_to_ld_population(gwas_ancestry_label: Column) -> Column:\n        \"\"\"Normalise ancestry column from GWAS studies into reference LD panel based on a pre-defined map.\n\n        This function assumes all possible ancestry categories have a corresponding\n        LD panel in the LD index. It is very important to have the ancestry labels\n        moved to the LD panel map.\n\n        Args:\n            gwas_ancestry_label (Column): A struct column with ancestry label like Finnish,\n                European, African etc. and the corresponding sample size.\n\n        Returns:\n            Column: Struct column with the mapped LD population label and the sample size.\n        \"\"\"\n        # Loading ancestry label to LD population label:\n        json_dict = json.loads(\n            pkg_resources.read_text(\n                data, \"gwas_population_2_LD_panel_map.json\", encoding=\"utf-8\"\n            )\n        )\n        map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n        return f.struct(\n            map_expr[gwas_ancestry_label.ancestry].alias(\"ancestry\"),\n            gwas_ancestry_label.sampleSize.alias(\"sampleSize\"),\n        )\n\n    @classmethod\n    def get_schema(cls: type[StudyIndex]) -> StructType:\n        \"\"\"Provide the schema for the StudyIndex dataset.\n\n        Returns:\n            StructType: The schema of the StudyIndex dataset.\n        \"\"\"\n        return parse_spark_schema(\"study_index.json\")\n\n    @classmethod\n    def aggregate_and_map_ancestries(\n        cls: type[StudyIndex], discovery_samples: Column\n    ) -> Column:\n        \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n        Args:\n            discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n        Returns:\n            Column: A list of struct with mapped LD population and their relative sample size.\n        \"\"\"\n        # Map ancestry categories to population labels of the LD index:\n        mapped_ancestries = f.transform(\n            discovery_samples, cls._map_ancestries_to_ld_population\n        )\n\n        # Aggregate sample sizes belonging to the same LD population:\n        aggregated_counts = f.aggregate(\n            mapped_ancestries,\n            f.array_distinct(\n                f.transform(\n                    mapped_ancestries,\n                    lambda x: f.struct(\n                        x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n                    ),\n                )\n            ),\n            cls._aggregate_samples_by_ancestry,\n        )\n        # Getting total sample count:\n        total_sample_count = f.aggregate(\n            aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n        ).alias(\"sampleSize\")\n\n        # Calculating relative sample size for each LD population:\n        return f.transform(\n            aggregated_counts,\n            lambda ld_population: f.struct(\n                ld_population.ancestry.alias(\"ldPopulation\"),\n                (ld_population.sampleSize / total_sample_count).alias(\n                    \"relativeSampleSize\"\n                ),\n            ),\n        )\n\n    def study_type_lut(self: StudyIndex) -> DataFrame:\n        \"\"\"Return a lookup table of study type.\n\n        Returns:\n            DataFrame: A dataframe containing `studyId` and `studyType` columns.\n        \"\"\"\n        return self.df.select(\"studyId\", \"studyType\")\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.aggregate_and_map_ancestries","title":"aggregate_and_map_ancestries(discovery_samples: Column) -> Column classmethod","text":"

Map ancestries to populations in the LD reference and calculate relative sample size.

Parameters:

Name Type Description Default discovery_samples Column

A list of struct column. Has an ancestry column and a sampleSize columns

required

Returns:

Name Type Description Column Column

A list of struct with mapped LD population and their relative sample size.

Source code in src/otg/dataset/study_index.py
@classmethod\ndef aggregate_and_map_ancestries(\n    cls: type[StudyIndex], discovery_samples: Column\n) -> Column:\n    \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n    Args:\n        discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n    Returns:\n        Column: A list of struct with mapped LD population and their relative sample size.\n    \"\"\"\n    # Map ancestry categories to population labels of the LD index:\n    mapped_ancestries = f.transform(\n        discovery_samples, cls._map_ancestries_to_ld_population\n    )\n\n    # Aggregate sample sizes belonging to the same LD population:\n    aggregated_counts = f.aggregate(\n        mapped_ancestries,\n        f.array_distinct(\n            f.transform(\n                mapped_ancestries,\n                lambda x: f.struct(\n                    x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n                ),\n            )\n        ),\n        cls._aggregate_samples_by_ancestry,\n    )\n    # Getting total sample count:\n    total_sample_count = f.aggregate(\n        aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n    ).alias(\"sampleSize\")\n\n    # Calculating relative sample size for each LD population:\n    return f.transform(\n        aggregated_counts,\n        lambda ld_population: f.struct(\n            ld_population.ancestry.alias(\"ldPopulation\"),\n            (ld_population.sampleSize / total_sample_count).alias(\n                \"relativeSampleSize\"\n            ),\n        ),\n    )\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provide the schema for the StudyIndex dataset.

Returns:

Name Type Description StructType StructType

The schema of the StudyIndex dataset.

Source code in src/otg/dataset/study_index.py
@classmethod\ndef get_schema(cls: type[StudyIndex]) -> StructType:\n    \"\"\"Provide the schema for the StudyIndex dataset.\n\n    Returns:\n        StructType: The schema of the StudyIndex dataset.\n    \"\"\"\n    return parse_spark_schema(\"study_index.json\")\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.study_type_lut","title":"study_type_lut() -> DataFrame","text":"

Return a lookup table of study type.

Returns:

Name Type Description DataFrame DataFrame

A dataframe containing studyId and studyType columns.

Source code in src/otg/dataset/study_index.py
def study_type_lut(self: StudyIndex) -> DataFrame:\n    \"\"\"Return a lookup table of study type.\n\n    Returns:\n        DataFrame: A dataframe containing `studyId` and `studyType` columns.\n    \"\"\"\n    return self.df.select(\"studyId\", \"studyType\")\n
"},{"location":"python_api/dataset/study_index/#schema","title":"Schema","text":"
root\n |-- studyId: string (nullable = false)\n |-- projectId: string (nullable = false)\n |-- studyType: string (nullable = false)\n |-- traitFromSource: string (nullable = false)\n |-- traitFromSourceMappedIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- pubmedId: string (nullable = true)\n |-- publicationTitle: string (nullable = true)\n |-- publicationFirstAuthor: string (nullable = true)\n |-- publicationDate: string (nullable = true)\n |-- publicationJournal: string (nullable = true)\n |-- backgroundTraitFromSourceMappedIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- initialSampleSize: string (nullable = true)\n |-- nCases: long (nullable = true)\n |-- nControls: long (nullable = true)\n |-- nSamples: long (nullable = true)\n |-- ldPopulationStructure: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- ldPopulation: string (nullable = true)\n |    |    |-- relativeSampleSize: double (nullable = true)\n |-- discoverySamples: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- sampleSize: long (nullable = true)\n |    |    |-- ancestry: string (nullable = true)\n |-- replicationSamples: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- sampleSize: long (nullable = true)\n |    |    |-- ancestry: string (nullable = true)\n |-- summarystatsLocation: string (nullable = true)\n |-- hasSumstats: boolean (nullable = true)\n
"},{"location":"python_api/dataset/study_locus/","title":"Study Locus","text":""},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus","title":"otg.dataset.study_locus.StudyLocus dataclass","text":"

Bases: Dataset

Study-Locus dataset.

This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.

Source code in src/otg/dataset/study_locus.py
@dataclass\nclass StudyLocus(Dataset):\n    \"\"\"Study-Locus dataset.\n\n    This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.\n    \"\"\"\n\n    @staticmethod\n    def _overlapping_peaks(credset_to_overlap: DataFrame) -> DataFrame:\n        \"\"\"Calculate overlapping signals (study-locus) between GWAS-GWAS and GWAS-Molecular trait.\n\n        Args:\n            credset_to_overlap (DataFrame): DataFrame containing at least `studyLocusId`, `studyType`, `chromosome` and `tagVariantId` columns.\n\n        Returns:\n            DataFrame: containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.\n        \"\"\"\n        # Reduce columns to the minimum to reduce the size of the dataframe\n        credset_to_overlap = credset_to_overlap.select(\n            \"studyLocusId\", \"studyType\", \"chromosome\", \"tagVariantId\"\n        )\n        return (\n            credset_to_overlap.alias(\"left\")\n            .filter(f.col(\"studyType\") == \"gwas\")\n            # Self join with complex condition. Left it's all gwas and right can be gwas or molecular trait\n            .join(\n                credset_to_overlap.alias(\"right\"),\n                on=[\n                    f.col(\"left.chromosome\") == f.col(\"right.chromosome\"),\n                    f.col(\"left.tagVariantId\") == f.col(\"right.tagVariantId\"),\n                    (f.col(\"right.studyType\") != \"gwas\")\n                    | (f.col(\"left.studyLocusId\") > f.col(\"right.studyLocusId\")),\n                ],\n                how=\"inner\",\n            )\n            .select(\n                f.col(\"left.studyLocusId\").alias(\"leftStudyLocusId\"),\n                f.col(\"right.studyLocusId\").alias(\"rightStudyLocusId\"),\n                f.col(\"left.chromosome\").alias(\"chromosome\"),\n            )\n            .distinct()\n            .repartition(\"chromosome\")\n            .persist()\n        )\n\n    @staticmethod\n    def _align_overlapping_tags(\n        loci_to_overlap: DataFrame, peak_overlaps: DataFrame\n    ) -> StudyLocusOverlap:\n        \"\"\"Align overlapping tags in pairs of overlapping study-locus, keeping all tags in both loci.\n\n        Args:\n            loci_to_overlap (DataFrame): containing `studyLocusId`, `studyType`, `chromosome`, `tagVariantId`, `logABF` and `posteriorProbability` columns.\n            peak_overlaps (DataFrame): containing `left_studyLocusId`, `right_studyLocusId` and `chromosome` columns.\n\n        Returns:\n            StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n        \"\"\"\n        # Complete information about all tags in the left study-locus of the overlap\n        stats_cols = [\n            \"logABF\",\n            \"posteriorProbability\",\n            \"beta\",\n            \"pValueMantissa\",\n            \"pValueExponent\",\n        ]\n        overlapping_left = loci_to_overlap.select(\n            f.col(\"chromosome\"),\n            f.col(\"tagVariantId\"),\n            f.col(\"studyLocusId\").alias(\"leftStudyLocusId\"),\n            *[f.col(col).alias(f\"left_{col}\") for col in stats_cols],\n        ).join(peak_overlaps, on=[\"chromosome\", \"leftStudyLocusId\"], how=\"inner\")\n\n        # Complete information about all tags in the right study-locus of the overlap\n        overlapping_right = loci_to_overlap.select(\n            f.col(\"chromosome\"),\n            f.col(\"tagVariantId\"),\n            f.col(\"studyLocusId\").alias(\"rightStudyLocusId\"),\n            *[f.col(col).alias(f\"right_{col}\") for col in stats_cols],\n        ).join(peak_overlaps, on=[\"chromosome\", \"rightStudyLocusId\"], how=\"inner\")\n\n        # Include information about all tag variants in both study-locus aligned by tag variant id\n        overlaps = overlapping_left.join(\n            overlapping_right,\n            on=[\n                \"chromosome\",\n                \"rightStudyLocusId\",\n                \"leftStudyLocusId\",\n                \"tagVariantId\",\n            ],\n            how=\"outer\",\n        ).select(\n            \"leftStudyLocusId\",\n            \"rightStudyLocusId\",\n            \"chromosome\",\n            \"tagVariantId\",\n            f.struct(\n                *[f\"left_{e}\" for e in stats_cols] + [f\"right_{e}\" for e in stats_cols]\n            ).alias(\"statistics\"),\n        )\n        return StudyLocusOverlap(\n            _df=overlaps,\n            _schema=StudyLocusOverlap.get_schema(),\n        )\n\n    @staticmethod\n    def _update_quality_flag(\n        qc: Column, flag_condition: Column, flag_text: StudyLocusQualityCheck\n    ) -> Column:\n        \"\"\"Update the provided quality control list with a new flag if condition is met.\n\n        Args:\n            qc (Column): Array column with the current list of qc flags.\n            flag_condition (Column): This is a column of booleans, signing which row should be flagged\n            flag_text (StudyLocusQualityCheck): Text for the new quality control flag\n\n        Returns:\n            Column: Array column with the updated list of qc flags.\n        \"\"\"\n        qc = f.when(qc.isNull(), f.array()).otherwise(qc)\n        return f.when(\n            flag_condition,\n            f.array_union(qc, f.array(f.lit(flag_text.value))),\n        ).otherwise(qc)\n\n    @staticmethod\n    def assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n        \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n        Args:\n            study_id_col (Column): column name with a study ID\n            variant_id_col (Column): column name with a variant ID\n\n        Returns:\n            Column: column with a study locus ID\n\n        Examples:\n            >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n            >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n            +----------+----------+--------------------+\n            |   studyId| variantId|      study_locus_id|\n            +----------+----------+--------------------+\n            |GCST000001|1_1000_A_C| 7437284926964690765|\n            |GCST000002|1_1000_A_C|-7653912547667845377|\n            +----------+----------+--------------------+\n            <BLANKLINE>\n        \"\"\"\n        return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n\n    @classmethod\n    def get_schema(cls: type[StudyLocus]) -> StructType:\n        \"\"\"Provides the schema for the StudyLocus dataset.\n\n        Returns:\n            StructType: schema for the StudyLocus dataset.\n        \"\"\"\n        return parse_spark_schema(\"study_locus.json\")\n\n    def filter_credible_set(\n        self: StudyLocus,\n        credible_interval: CredibleInterval,\n    ) -> StudyLocus:\n        \"\"\"Filter study-locus tag variants based on given credible interval.\n\n        Args:\n            credible_interval (CredibleInterval): Credible interval to filter for.\n\n        Returns:\n            StudyLocus: Filtered study-locus dataset.\n        \"\"\"\n        self.df = self._df.withColumn(\n            \"locus\",\n            f.expr(f\"filter(locus, tag -> (tag.{credible_interval.value}))\"),\n        )\n        return self\n\n    def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n        \"\"\"Calculate overlapping study-locus.\n\n        Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n        appearing on the right side.\n\n        Args:\n            study_index (StudyIndex): Study index to resolve study types.\n\n        Returns:\n            StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n        \"\"\"\n        loci_to_overlap = (\n            self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n            .withColumn(\"locus\", f.explode(\"locus\"))\n            .select(\n                \"studyLocusId\",\n                \"studyType\",\n                \"chromosome\",\n                f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n                f.col(\"locus.logABF\").alias(\"logABF\"),\n                f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n                f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n                f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n                f.col(\"locus.beta\").alias(\"beta\"),\n            )\n            .persist()\n        )\n\n        # overlapping study-locus\n        peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n        # study-locus overlap by aligning overlapping variants\n        return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n\n    def unique_variants_in_locus(self: StudyLocus) -> DataFrame:\n        \"\"\"All unique variants collected in a `StudyLocus` dataframe.\n\n        Returns:\n            DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n        \"\"\"\n        return (\n            self.df.withColumn(\n                \"variantId\",\n                # Joint array of variants in that studylocus. Locus can be null\n                f.explode(\n                    f.array_union(\n                        f.array(f.col(\"variantId\")),\n                        f.coalesce(f.col(\"locus.variantId\"), f.array()),\n                    )\n                ),\n            )\n            .select(\n                \"variantId\", f.split(f.col(\"variantId\"), \"_\")[0].alias(\"chromosome\")\n            )\n            .distinct()\n        )\n\n    def neglog_pvalue(self: StudyLocus) -> Column:\n        \"\"\"Returns the negative log p-value.\n\n        Returns:\n            Column: Negative log p-value\n        \"\"\"\n        return calculate_neglog_pvalue(\n            self.df.pValueMantissa,\n            self.df.pValueExponent,\n        )\n\n    def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n        \"\"\"Annotate study-locus dataset with credible set flags.\n\n        Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n        `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n        of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n        Returns:\n            StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n\n        Raises:\n            ValueError: If `locus` column is not available.\n        \"\"\"\n        if \"locus\" not in self.df.columns:\n            raise ValueError(\"Locus column not available.\")\n\n        self.df = self.df.withColumn(\n            # Sort credible set by posterior probability in descending order\n            \"locus\",\n            f.when(\n                f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n                order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n            ),\n        ).withColumn(\n            # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n            # and zip the cumulative sums array with the credible set array to add the flags\n            \"locus\",\n            f.when(\n                f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n                f.zip_with(\n                    f.col(\"locus\"),\n                    f.transform(\n                        f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n                        lambda index: f.aggregate(\n                            f.slice(\n                                # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n                                # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n                                f.col(\"locus.posteriorProbability\"),\n                                1,\n                                index - 1,\n                            ),\n                            f.lit(0.0),\n                            lambda acc, el: acc + el,\n                        ),\n                    ),\n                    lambda struct_e, acc: struct_e.withField(\n                        CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n                    ).withField(\n                        CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n                    ),\n                ),\n            ),\n        )\n        return self\n\n    def clump(self: StudyLocus) -> StudyLocus:\n        \"\"\"Perform LD clumping of the studyLocus.\n\n        Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n        Returns:\n            StudyLocus: with empty credible sets for linked variants and QC flag.\n        \"\"\"\n        self.df = (\n            self.df.withColumn(\n                \"is_lead_linked\",\n                LDclumping._is_lead_linked(\n                    self.df.studyId,\n                    self.df.variantId,\n                    self.df.pValueExponent,\n                    self.df.pValueMantissa,\n                    self.df.ldSet,\n                ),\n            )\n            .withColumn(\n                \"ldSet\",\n                f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n            )\n            .withColumn(\n                \"qualityControls\",\n                StudyLocus._update_quality_flag(\n                    f.col(\"qualityControls\"),\n                    f.col(\"is_lead_linked\"),\n                    StudyLocusQualityCheck.LD_CLUMPED,\n                ),\n            )\n            .drop(\"is_lead_linked\")\n        )\n        return self\n\n    def _qc_unresolved_ld(\n        self: StudyLocus,\n    ) -> StudyLocus:\n        \"\"\"Flag associations with variants that are not found in the LD reference.\n\n        Returns:\n            StudyLocus: Updated study locus.\n        \"\"\"\n        self.df = self.df.withColumn(\n            \"qualityControls\",\n            self._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"ldSet\").isNull(),\n                StudyLocusQualityCheck.UNRESOLVED_LD,\n            ),\n        )\n        return self\n\n    def _qc_no_population(self: StudyLocus) -> StudyLocus:\n        \"\"\"Flag associations where the study doesn't have population information to resolve LD.\n\n        Returns:\n            StudyLocus: Updated study locus.\n        \"\"\"\n        # If the tested column is not present, return self unchanged:\n        if \"ldPopulationStructure\" not in self.df.columns:\n            return self\n\n        self.df = self.df.withColumn(\n            \"qualityControls\",\n            self._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"ldPopulationStructure\").isNull(),\n                StudyLocusQualityCheck.NO_POPULATION,\n            ),\n        )\n        return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.annotate_credible_sets","title":"annotate_credible_sets() -> StudyLocus","text":"

Annotate study-locus dataset with credible set flags.

Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.

Returns:

Name Type Description StudyLocus StudyLocus

including annotation on is95CredibleSet and is99CredibleSet.

Raises:

Type Description ValueError

If locus column is not available.

Source code in src/otg/dataset/study_locus.py
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n    \"\"\"Annotate study-locus dataset with credible set flags.\n\n    Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n    `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n    of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n    Returns:\n        StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n\n    Raises:\n        ValueError: If `locus` column is not available.\n    \"\"\"\n    if \"locus\" not in self.df.columns:\n        raise ValueError(\"Locus column not available.\")\n\n    self.df = self.df.withColumn(\n        # Sort credible set by posterior probability in descending order\n        \"locus\",\n        f.when(\n            f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n            order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n        ),\n    ).withColumn(\n        # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n        # and zip the cumulative sums array with the credible set array to add the flags\n        \"locus\",\n        f.when(\n            f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n            f.zip_with(\n                f.col(\"locus\"),\n                f.transform(\n                    f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n                    lambda index: f.aggregate(\n                        f.slice(\n                            # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n                            # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n                            f.col(\"locus.posteriorProbability\"),\n                            1,\n                            index - 1,\n                        ),\n                        f.lit(0.0),\n                        lambda acc, el: acc + el,\n                    ),\n                ),\n                lambda struct_e, acc: struct_e.withField(\n                    CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n                ).withField(\n                    CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n                ),\n            ),\n        ),\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.assign_study_locus_id","title":"assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column staticmethod","text":"

Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.

Parameters:

Name Type Description Default study_id_col Column

column name with a study ID

required variant_id_col Column

column name with a variant ID

required

Returns:

Name Type Description Column Column

column with a study locus ID

Examples:

>>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n>>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n+----------+----------+--------------------+\n|   studyId| variantId|      study_locus_id|\n+----------+----------+--------------------+\n|GCST000001|1_1000_A_C| 7437284926964690765|\n|GCST000002|1_1000_A_C|-7653912547667845377|\n+----------+----------+--------------------+\n
Source code in src/otg/dataset/study_locus.py
@staticmethod\ndef assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n    \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n    Args:\n        study_id_col (Column): column name with a study ID\n        variant_id_col (Column): column name with a variant ID\n\n    Returns:\n        Column: column with a study locus ID\n\n    Examples:\n        >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n        >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n        +----------+----------+--------------------+\n        |   studyId| variantId|      study_locus_id|\n        +----------+----------+--------------------+\n        |GCST000001|1_1000_A_C| 7437284926964690765|\n        |GCST000002|1_1000_A_C|-7653912547667845377|\n        +----------+----------+--------------------+\n        <BLANKLINE>\n    \"\"\"\n    return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.clump","title":"clump() -> StudyLocus","text":"

Perform LD clumping of the studyLocus.

Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.

Returns:

Name Type Description StudyLocus StudyLocus

with empty credible sets for linked variants and QC flag.

Source code in src/otg/dataset/study_locus.py
def clump(self: StudyLocus) -> StudyLocus:\n    \"\"\"Perform LD clumping of the studyLocus.\n\n    Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n    Returns:\n        StudyLocus: with empty credible sets for linked variants and QC flag.\n    \"\"\"\n    self.df = (\n        self.df.withColumn(\n            \"is_lead_linked\",\n            LDclumping._is_lead_linked(\n                self.df.studyId,\n                self.df.variantId,\n                self.df.pValueExponent,\n                self.df.pValueMantissa,\n                self.df.ldSet,\n            ),\n        )\n        .withColumn(\n            \"ldSet\",\n            f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n        )\n        .withColumn(\n            \"qualityControls\",\n            StudyLocus._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"is_lead_linked\"),\n                StudyLocusQualityCheck.LD_CLUMPED,\n            ),\n        )\n        .drop(\"is_lead_linked\")\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.filter_credible_set","title":"filter_credible_set(credible_interval: CredibleInterval) -> StudyLocus","text":"

Filter study-locus tag variants based on given credible interval.

Parameters:

Name Type Description Default credible_interval CredibleInterval

Credible interval to filter for.

required

Returns:

Name Type Description StudyLocus StudyLocus

Filtered study-locus dataset.

Source code in src/otg/dataset/study_locus.py
def filter_credible_set(\n    self: StudyLocus,\n    credible_interval: CredibleInterval,\n) -> StudyLocus:\n    \"\"\"Filter study-locus tag variants based on given credible interval.\n\n    Args:\n        credible_interval (CredibleInterval): Credible interval to filter for.\n\n    Returns:\n        StudyLocus: Filtered study-locus dataset.\n    \"\"\"\n    self.df = self._df.withColumn(\n        \"locus\",\n        f.expr(f\"filter(locus, tag -> (tag.{credible_interval.value}))\"),\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.find_overlaps","title":"find_overlaps(study_index: StudyIndex) -> StudyLocusOverlap","text":"

Calculate overlapping study-locus.

Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.

Parameters:

Name Type Description Default study_index StudyIndex

Study index to resolve study types.

required

Returns:

Name Type Description StudyLocusOverlap StudyLocusOverlap

Pairs of overlapping study-locus with aligned tags.

Source code in src/otg/dataset/study_locus.py
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n    \"\"\"Calculate overlapping study-locus.\n\n    Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n    appearing on the right side.\n\n    Args:\n        study_index (StudyIndex): Study index to resolve study types.\n\n    Returns:\n        StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n    \"\"\"\n    loci_to_overlap = (\n        self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n        .withColumn(\"locus\", f.explode(\"locus\"))\n        .select(\n            \"studyLocusId\",\n            \"studyType\",\n            \"chromosome\",\n            f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n            f.col(\"locus.logABF\").alias(\"logABF\"),\n            f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n            f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n            f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n            f.col(\"locus.beta\").alias(\"beta\"),\n        )\n        .persist()\n    )\n\n    # overlapping study-locus\n    peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n    # study-locus overlap by aligning overlapping variants\n    return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the StudyLocus dataset.

Returns:

Name Type Description StructType StructType

schema for the StudyLocus dataset.

Source code in src/otg/dataset/study_locus.py
@classmethod\ndef get_schema(cls: type[StudyLocus]) -> StructType:\n    \"\"\"Provides the schema for the StudyLocus dataset.\n\n    Returns:\n        StructType: schema for the StudyLocus dataset.\n    \"\"\"\n    return parse_spark_schema(\"study_locus.json\")\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.neglog_pvalue","title":"neglog_pvalue() -> Column","text":"

Returns the negative log p-value.

Returns:

Name Type Description Column Column

Negative log p-value

Source code in src/otg/dataset/study_locus.py
def neglog_pvalue(self: StudyLocus) -> Column:\n    \"\"\"Returns the negative log p-value.\n\n    Returns:\n        Column: Negative log p-value\n    \"\"\"\n    return calculate_neglog_pvalue(\n        self.df.pValueMantissa,\n        self.df.pValueExponent,\n    )\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.unique_variants_in_locus","title":"unique_variants_in_locus() -> DataFrame","text":"

All unique variants collected in a StudyLocus dataframe.

Returns:

Name Type Description DataFrame DataFrame

A dataframe containing variantId and chromosome columns.

Source code in src/otg/dataset/study_locus.py
def unique_variants_in_locus(self: StudyLocus) -> DataFrame:\n    \"\"\"All unique variants collected in a `StudyLocus` dataframe.\n\n    Returns:\n        DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n    \"\"\"\n    return (\n        self.df.withColumn(\n            \"variantId\",\n            # Joint array of variants in that studylocus. Locus can be null\n            f.explode(\n                f.array_union(\n                    f.array(f.col(\"variantId\")),\n                    f.coalesce(f.col(\"locus.variantId\"), f.array()),\n                )\n            ),\n        )\n        .select(\n            \"variantId\", f.split(f.col(\"variantId\"), \"_\")[0].alias(\"chromosome\")\n        )\n        .distinct()\n    )\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocusQualityCheck","title":"otg.dataset.study_locus.StudyLocusQualityCheck","text":"

Bases: Enum

Study-Locus quality control options listing concerns on the quality of the association.

Attributes:

Name Type Description SUBSIGNIFICANT_FLAG str

p-value below significance threshold

NO_GENOMIC_LOCATION_FLAG str

Incomplete genomic mapping

COMPOSITE_FLAG str

Composite association due to variant x variant interactions

VARIANT_INCONSISTENCY_FLAG str

Inconsistencies in the reported variants

NON_MAPPED_VARIANT_FLAG str

Variant not mapped to GnomAd

PALINDROMIC_ALLELE_FLAG str

Alleles are palindromic - cannot harmonize

AMBIGUOUS_STUDY str

Association with ambiguous study

UNRESOLVED_LD str

Variant not found in LD reference

LD_CLUMPED str

Explained by a more significant variant in high LD (clumped)

Source code in src/otg/dataset/study_locus.py
class StudyLocusQualityCheck(Enum):\n    \"\"\"Study-Locus quality control options listing concerns on the quality of the association.\n\n    Attributes:\n        SUBSIGNIFICANT_FLAG (str): p-value below significance threshold\n        NO_GENOMIC_LOCATION_FLAG (str): Incomplete genomic mapping\n        COMPOSITE_FLAG (str): Composite association due to variant x variant interactions\n        VARIANT_INCONSISTENCY_FLAG (str): Inconsistencies in the reported variants\n        NON_MAPPED_VARIANT_FLAG (str): Variant not mapped to GnomAd\n        PALINDROMIC_ALLELE_FLAG (str): Alleles are palindromic - cannot harmonize\n        AMBIGUOUS_STUDY (str): Association with ambiguous study\n        UNRESOLVED_LD (str): Variant not found in LD reference\n        LD_CLUMPED (str): Explained by a more significant variant in high LD (clumped)\n    \"\"\"\n\n    SUBSIGNIFICANT_FLAG = \"Subsignificant p-value\"\n    NO_GENOMIC_LOCATION_FLAG = \"Incomplete genomic mapping\"\n    COMPOSITE_FLAG = \"Composite association\"\n    INCONSISTENCY_FLAG = \"Variant inconsistency\"\n    NON_MAPPED_VARIANT_FLAG = \"No mapping in GnomAd\"\n    PALINDROMIC_ALLELE_FLAG = \"Palindrome alleles - cannot harmonize\"\n    AMBIGUOUS_STUDY = \"Association with ambiguous study\"\n    UNRESOLVED_LD = \"Variant not found in LD reference\"\n    LD_CLUMPED = \"Explained by a more significant variant in high LD (clumped)\"\n    NO_POPULATION = \"Study does not have population annotation to resolve LD\"\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.CredibleInterval","title":"otg.dataset.study_locus.CredibleInterval","text":"

Bases: Enum

Credible interval enum.

Interval within which an unobserved parameter value falls with a particular probability.

Attributes:

Name Type Description IS95 str

95% credible interval

IS99 str

99% credible interval

Source code in src/otg/dataset/study_locus.py
class CredibleInterval(Enum):\n    \"\"\"Credible interval enum.\n\n    Interval within which an unobserved parameter value falls with a particular probability.\n\n    Attributes:\n        IS95 (str): 95% credible interval\n        IS99 (str): 99% credible interval\n    \"\"\"\n\n    IS95 = \"is95CredibleSet\"\n    IS99 = \"is99CredibleSet\"\n
"},{"location":"python_api/dataset/study_locus/#schema","title":"Schema","text":"
root\n |-- studyLocusId: long (nullable = false)\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- studyId: string (nullable = false)\n |-- beta: double (nullable = true)\n |-- oddsRatio: double (nullable = true)\n |-- oddsRatioConfidenceIntervalLower: double (nullable = true)\n |-- oddsRatioConfidenceIntervalUpper: double (nullable = true)\n |-- betaConfidenceIntervalLower: double (nullable = true)\n |-- betaConfidenceIntervalUpper: double (nullable = true)\n |-- pValueMantissa: float (nullable = true)\n |-- pValueExponent: integer (nullable = true)\n |-- effectAlleleFrequencyFromSource: float (nullable = true)\n |-- standardError: double (nullable = true)\n |-- subStudyDescription: string (nullable = true)\n |-- qualityControls: array (nullable = true)\n |    |-- element: string (containsNull = false)\n |-- finemappingMethod: string (nullable = true)\n |-- ldSet: array (nullable = true)\n |    |-- element: struct (containsNull = true)\n |    |    |-- tagVariantId: string (nullable = true)\n |    |    |-- r2Overall: double (nullable = true)\n |-- locus: array (nullable = true)\n |    |-- element: struct (containsNull = true)\n |    |    |-- is95CredibleSet: boolean (nullable = true)\n |    |    |-- is99CredibleSet: boolean (nullable = true)\n |    |    |-- logABF: double (nullable = true)\n |    |    |-- posteriorProbability: double (nullable = true)\n |    |    |-- variantId: string (nullable = true)\n |    |    |-- pValueMantissa: float (nullable = true)\n |    |    |-- pValueExponent: integer (nullable = true)\n |    |    |-- pValueMantissaConditioned: float (nullable = true)\n |    |    |-- pValueExponentConditioned: integer (nullable = true)\n |    |    |-- beta: double (nullable = true)\n |    |    |-- standardError: double (nullable = true)\n |    |    |-- betaConditioned: double (nullable = true)\n |    |    |-- standardErrorConditioned: double (nullable = true)\n |    |    |-- r2Overall: double (nullable = true)\n
"},{"location":"python_api/dataset/study_locus_overlap/","title":"Study Locus Overlap","text":""},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap","title":"otg.dataset.study_locus_overlap.StudyLocusOverlap dataclass","text":"

Bases: Dataset

Study-Locus overlap.

This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.

Note

This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.

Source code in src/otg/dataset/study_locus_overlap.py
@dataclass\nclass StudyLocusOverlap(Dataset):\n    \"\"\"Study-Locus overlap.\n\n    This dataset captures pairs of overlapping `StudyLocus`: that is associations whose credible sets share at least one tagging variant.\n\n    !!! note\n        This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n        \"\"\"Provides the schema for the StudyLocusOverlap dataset.\n\n        Returns:\n            StructType: Schema for the StudyLocusOverlap dataset\n        \"\"\"\n        return parse_spark_schema(\"study_locus_overlap.json\")\n\n    @classmethod\n    def from_associations(\n        cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n    ) -> StudyLocusOverlap:\n        \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n        Args:\n            study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n            study_index (StudyIndex): Study index to find the overlapping signals\n\n        Returns:\n            StudyLocusOverlap: Study-locus overlap dataset\n        \"\"\"\n        return study_locus.find_overlaps(study_index)\n
"},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap.from_associations","title":"from_associations(study_locus: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap classmethod","text":"

Find the overlapping signals in a particular set of associations (StudyLocus dataset).

Parameters:

Name Type Description Default study_locus StudyLocus

Study-locus associations to find the overlapping signals

required study_index StudyIndex

Study index to find the overlapping signals

required

Returns:

Name Type Description StudyLocusOverlap StudyLocusOverlap

Study-locus overlap dataset

Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef from_associations(\n    cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n) -> StudyLocusOverlap:\n    \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n    Args:\n        study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n        study_index (StudyIndex): Study index to find the overlapping signals\n\n    Returns:\n        StudyLocusOverlap: Study-locus overlap dataset\n    \"\"\"\n    return study_locus.find_overlaps(study_index)\n
"},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the StudyLocusOverlap dataset.

Returns:

Name Type Description StructType StructType

Schema for the StudyLocusOverlap dataset

Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n    \"\"\"Provides the schema for the StudyLocusOverlap dataset.\n\n    Returns:\n        StructType: Schema for the StudyLocusOverlap dataset\n    \"\"\"\n    return parse_spark_schema(\"study_locus_overlap.json\")\n
"},{"location":"python_api/dataset/study_locus_overlap/#schema","title":"Schema","text":"
root\n |-- leftStudyLocusId: long (nullable = false)\n |-- rightStudyLocusId: long (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- tagVariantId: string (nullable = false)\n |-- statistics: struct (nullable = false)\n |    |-- left_pValueMantissa: float (nullable = true)\n |    |-- left_pValueExponent: integer (nullable = true)\n |    |-- right_pValueMantissa: float (nullable = true)\n |    |-- right_pValueExponent: integer (nullable = true)\n |    |-- left_beta: double (nullable = true)\n |    |-- right_beta: double (nullable = true)\n |    |-- left_logABF: double (nullable = true)\n |    |-- right_logABF: double (nullable = true)\n |    |-- left_posteriorProbability: double (nullable = true)\n |    |-- right_posteriorProbability: double (nullable = true)\n
"},{"location":"python_api/dataset/summary_statistics/","title":"Summary Statistics","text":""},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics","title":"otg.dataset.summary_statistics.SummaryStatistics dataclass","text":"

Bases: Dataset

Summary Statistics dataset.

A summary statistics dataset contains all single point statistics resulting from a GWAS.

Source code in src/otg/dataset/summary_statistics.py
@dataclass\nclass SummaryStatistics(Dataset):\n    \"\"\"Summary Statistics dataset.\n\n    A summary statistics dataset contains all single point statistics resulting from a GWAS.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[SummaryStatistics]) -> StructType:\n        \"\"\"Provides the schema for the SummaryStatistics dataset.\n\n        Returns:\n            StructType: Schema for the SummaryStatistics dataset\n        \"\"\"\n        return parse_spark_schema(\"summary_statistics.json\")\n\n    def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n        \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n        Args:\n            pvalue (float): upper limit of the p-value to be filtered upon.\n\n        Returns:\n            SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n        \"\"\"\n        # Converting p-value to mantissa and exponent:\n        (mantissa, exponent) = split_pvalue(pvalue)\n\n        # Applying filter:\n        df = self._df.filter(\n            (f.col(\"pValueExponent\") < exponent)\n            | (\n                (f.col(\"pValueExponent\") == exponent)\n                & (f.col(\"pValueMantissa\") <= mantissa)\n            )\n        )\n        return SummaryStatistics(_df=df, _schema=self._schema)\n\n    def window_based_clumping(\n        self: SummaryStatistics,\n        distance: int,\n        gwas_significance: float = 5e-8,\n        baseline_significance: float = 0.05,\n        locus_collect_distance: int | None = None,\n    ) -> StudyLocus:\n        \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n        Args:\n            distance (int): Distance in base pairs to be used for clumping.\n            gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n            baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n            locus_collect_distance (int | None): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n        Returns:\n            StudyLocus: Clumped study-locus containing variants based on window.\n        \"\"\"\n        # If locus collect distance is present, collect locus with the provided distance:\n        if locus_collect_distance:\n            clumped_df = WindowBasedClumping.clump_with_locus(\n                self,\n                window_length=distance,\n                p_value_significance=gwas_significance,\n                p_value_baseline=baseline_significance,\n                locus_window_length=locus_collect_distance,\n            )\n        else:\n            clumped_df = WindowBasedClumping.clump(\n                self, window_length=distance, p_value_significance=gwas_significance\n            )\n\n        return clumped_df\n\n    def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n        \"\"\"Exclude a region from the summary stats dataset.\n\n        Args:\n            region (str): region given in \"chr##:#####-####\" format\n\n        Returns:\n            SummaryStatistics: filtered summary statistics.\n        \"\"\"\n        (chromosome, start_position, end_position) = parse_region(region)\n\n        return SummaryStatistics(\n            _df=(\n                self.df.filter(\n                    ~(\n                        (f.col(\"chromosome\") == chromosome)\n                        & (\n                            (f.col(\"position\") >= start_position)\n                            & (f.col(\"position\") <= end_position)\n                        )\n                    )\n                )\n            ),\n            _schema=SummaryStatistics.get_schema(),\n        )\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.exclude_region","title":"exclude_region(region: str) -> SummaryStatistics","text":"

Exclude a region from the summary stats dataset.

Parameters:

Name Type Description Default region str

region given in \"chr##:#####-####\" format

required

Returns:

Name Type Description SummaryStatistics SummaryStatistics

filtered summary statistics.

Source code in src/otg/dataset/summary_statistics.py
def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n    \"\"\"Exclude a region from the summary stats dataset.\n\n    Args:\n        region (str): region given in \"chr##:#####-####\" format\n\n    Returns:\n        SummaryStatistics: filtered summary statistics.\n    \"\"\"\n    (chromosome, start_position, end_position) = parse_region(region)\n\n    return SummaryStatistics(\n        _df=(\n            self.df.filter(\n                ~(\n                    (f.col(\"chromosome\") == chromosome)\n                    & (\n                        (f.col(\"position\") >= start_position)\n                        & (f.col(\"position\") <= end_position)\n                    )\n                )\n            )\n        ),\n        _schema=SummaryStatistics.get_schema(),\n    )\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the SummaryStatistics dataset.

Returns:

Name Type Description StructType StructType

Schema for the SummaryStatistics dataset

Source code in src/otg/dataset/summary_statistics.py
@classmethod\ndef get_schema(cls: type[SummaryStatistics]) -> StructType:\n    \"\"\"Provides the schema for the SummaryStatistics dataset.\n\n    Returns:\n        StructType: Schema for the SummaryStatistics dataset\n    \"\"\"\n    return parse_spark_schema(\"summary_statistics.json\")\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.pvalue_filter","title":"pvalue_filter(pvalue: float) -> SummaryStatistics","text":"

Filter summary statistics based on the provided p-value threshold.

Parameters:

Name Type Description Default pvalue float

upper limit of the p-value to be filtered upon.

required

Returns:

Name Type Description SummaryStatistics SummaryStatistics

summary statistics object containing single point associations with p-values at least as significant as the provided threshold.

Source code in src/otg/dataset/summary_statistics.py
def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n    \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n    Args:\n        pvalue (float): upper limit of the p-value to be filtered upon.\n\n    Returns:\n        SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n    \"\"\"\n    # Converting p-value to mantissa and exponent:\n    (mantissa, exponent) = split_pvalue(pvalue)\n\n    # Applying filter:\n    df = self._df.filter(\n        (f.col(\"pValueExponent\") < exponent)\n        | (\n            (f.col(\"pValueExponent\") == exponent)\n            & (f.col(\"pValueMantissa\") <= mantissa)\n        )\n    )\n    return SummaryStatistics(_df=df, _schema=self._schema)\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.window_based_clumping","title":"window_based_clumping(distance: int, gwas_significance: float = 5e-08, baseline_significance: float = 0.05, locus_collect_distance: int | None = None) -> StudyLocus","text":"

Generate study-locus from summary statistics by distance based clumping + collect locus.

Parameters:

Name Type Description Default distance int

Distance in base pairs to be used for clumping.

required gwas_significance float

GWAS significance threshold. Defaults to 5e-8.

5e-08 baseline_significance float

Baseline significance threshold for inclusion in the locus. Defaults to 0.05.

0.05 locus_collect_distance int | None

The distance to collect locus around semi-indices. If not provided, defaults to distance.

None

Returns:

Name Type Description StudyLocus StudyLocus

Clumped study-locus containing variants based on window.

Source code in src/otg/dataset/summary_statistics.py
def window_based_clumping(\n    self: SummaryStatistics,\n    distance: int,\n    gwas_significance: float = 5e-8,\n    baseline_significance: float = 0.05,\n    locus_collect_distance: int | None = None,\n) -> StudyLocus:\n    \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n    Args:\n        distance (int): Distance in base pairs to be used for clumping.\n        gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n        baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n        locus_collect_distance (int | None): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n    Returns:\n        StudyLocus: Clumped study-locus containing variants based on window.\n    \"\"\"\n    # If locus collect distance is present, collect locus with the provided distance:\n    if locus_collect_distance:\n        clumped_df = WindowBasedClumping.clump_with_locus(\n            self,\n            window_length=distance,\n            p_value_significance=gwas_significance,\n            p_value_baseline=baseline_significance,\n            locus_window_length=locus_collect_distance,\n        )\n    else:\n        clumped_df = WindowBasedClumping.clump(\n            self, window_length=distance, p_value_significance=gwas_significance\n        )\n\n    return clumped_df\n
"},{"location":"python_api/dataset/summary_statistics/#schema","title":"Schema","text":"
root\n |-- studyId: string (nullable = false)\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- beta: double (nullable = false)\n |-- betaConfidenceIntervalLower: double (nullable = true)\n |-- betaConfidenceIntervalUpper: double (nullable = true)\n |-- pValueMantissa: float (nullable = false)\n |-- pValueExponent: integer (nullable = false)\n |-- effectAlleleFrequencyFromSource: float (nullable = true)\n |-- standardError: double (nullable = true)\n
"},{"location":"python_api/dataset/variant_annotation/","title":"Variant annotation","text":""},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation","title":"otg.dataset.variant_annotation.VariantAnnotation dataclass","text":"

Bases: Dataset

Dataset with variant-level annotations.

Source code in src/otg/dataset/variant_annotation.py
@dataclass\nclass VariantAnnotation(Dataset):\n    \"\"\"Dataset with variant-level annotations.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[VariantAnnotation]) -> StructType:\n        \"\"\"Provides the schema for the VariantAnnotation dataset.\n\n        Returns:\n            StructType: Schema for the VariantAnnotation dataset\n        \"\"\"\n        return parse_spark_schema(\"variant_annotation.json\")\n\n    def max_maf(self: VariantAnnotation) -> Column:\n        \"\"\"Maximum minor allele frequency accross all populations.\n\n        Returns:\n            Column: Maximum minor allele frequency accross all populations.\n        \"\"\"\n        return f.array_max(\n            f.transform(\n                self.df.alleleFrequencies,\n                lambda af: f.when(\n                    af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n                ).otherwise(af.alleleFrequency),\n            )\n        )\n\n    def filter_by_variant_df(\n        self: VariantAnnotation, df: DataFrame\n    ) -> VariantAnnotation:\n        \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n        Args:\n            df (DataFrame): A dataframe of variants\n\n        Returns:\n            VariantAnnotation: A filtered variant annotation dataset\n        \"\"\"\n        self.df = self._df.join(\n            f.broadcast(df.select(\"variantId\", \"chromosome\")),\n            on=[\"variantId\", \"chromosome\"],\n            how=\"inner\",\n        )\n        return self\n\n    def get_transcript_consequence_df(\n        self: VariantAnnotation, gene_index: GeneIndex | None = None\n    ) -> DataFrame:\n        \"\"\"Dataframe of exploded transcript consequences.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            gene_index (GeneIndex | None): A gene index. Defaults to None.\n\n        Returns:\n            DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n        \"\"\"\n        # exploding the array removes records without VEP annotation\n        transript_consequences = self.df.withColumn(\n            \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n        ).select(\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            \"transcriptConsequence\",\n            f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n        )\n        if gene_index:\n            transript_consequences = transript_consequences.join(\n                f.broadcast(gene_index.df),\n                on=[\"chromosome\", \"geneId\"],\n            )\n        return transript_consequences.persist()\n\n    def get_most_severe_vep_v2g(\n        self: VariantAnnotation,\n        vep_consequences: DataFrame,\n        gene_index: GeneIndex,\n    ) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            vep_consequences (DataFrame): A dataframe of VEP consequences\n            gene_index (GeneIndex): A gene index to filter by. Defaults to None.\n\n        Returns:\n            V2G: High and medium severity variant to gene assignments\n        \"\"\"\n        return V2G(\n            _df=self.get_transcript_consequence_df(gene_index)\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n                f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n            )\n            .join(\n                f.broadcast(vep_consequences),\n                on=\"label\",\n                how=\"inner\",\n            )\n            .drop(\"label\")\n            .filter(f.col(\"score\") != 0)\n            # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n            .transform(\n                lambda df: get_record_with_maximum_value(\n                    df, [\"variantId\", \"geneId\"], \"score\"\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_polyphen_v2g(\n        self: VariantAnnotation, gene_index: GeneIndex | None = None\n    ) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n        Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows:\n            - 0.0 to 0.15 -- Predicted to be benign.\n            - 0.15 to 1.0 -- Possibly damaging.\n            - 0.85 to 1.0 -- Predicted to be damaging.\n\n        Args:\n            gene_index (GeneIndex | None): A gene index to filter by. Defaults to None.\n\n        Returns:\n            V2G: variant to gene assignments with their polyphen scores\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"polyphen\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_sift_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n        SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows:\n            - 0.0 to 0.05 -- Likely to be deleterious.\n            - 0.05 to 1.0 -- Likely to be tolerated.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n\n        Returns:\n            V2G: variant to gene assignments with their SIFT scores\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"sift\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_plof_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n\n        Returns:\n            V2G: variant to gene assignments from the LOFTEE algorithm\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n                .withColumn(\n                    \"isHighQualityPlof\",\n                    f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n                        f.col(\"transcriptConsequence.lof\") == \"LC\", False\n                    ),\n                )\n                .withColumn(\n                    \"score\",\n                    f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n                        ~f.col(\"isHighQualityPlof\"), 0\n                    ),\n                )\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    \"isHighQualityPlof\",\n                    f.col(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"loftee\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_distance_to_tss(\n        self: VariantAnnotation,\n        gene_index: GeneIndex,\n        max_distance: int = 500_000,\n    ) -> V2G:\n        \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n            max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n        Returns:\n            V2G: variant to gene assignments with their distance to the TSS\n        \"\"\"\n        return V2G(\n            _df=(\n                self.df.alias(\"variant\")\n                .join(\n                    f.broadcast(gene_index.locations_lut()).alias(\"gene\"),\n                    on=[\n                        f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n                        f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                        <= max_distance,\n                    ],\n                    how=\"inner\",\n                )\n                .withColumn(\n                    \"distance\", f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                )\n                .withColumn(\n                    \"inverse_distance\",\n                    max_distance - f.col(\"distance\"),\n                )\n                .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n                .select(\n                    \"variantId\",\n                    f.col(\"variant.chromosome\").alias(\"chromosome\"),\n                    \"distance\",\n                    \"geneId\",\n                    \"score\",\n                    f.lit(\"distance\").alias(\"datatypeId\"),\n                    f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.filter_by_variant_df","title":"filter_by_variant_df(df: DataFrame) -> VariantAnnotation","text":"

Filter variant annotation dataset by a variant dataframe.

Parameters:

Name Type Description Default df DataFrame

A dataframe of variants

required

Returns:

Name Type Description VariantAnnotation VariantAnnotation

A filtered variant annotation dataset

Source code in src/otg/dataset/variant_annotation.py
def filter_by_variant_df(\n    self: VariantAnnotation, df: DataFrame\n) -> VariantAnnotation:\n    \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n    Args:\n        df (DataFrame): A dataframe of variants\n\n    Returns:\n        VariantAnnotation: A filtered variant annotation dataset\n    \"\"\"\n    self.df = self._df.join(\n        f.broadcast(df.select(\"variantId\", \"chromosome\")),\n        on=[\"variantId\", \"chromosome\"],\n        how=\"inner\",\n    )\n    return self\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_distance_to_tss","title":"get_distance_to_tss(gene_index: GeneIndex, max_distance: int = 500000) -> V2G","text":"

Extracts variant to gene assignments for variants falling within a window of a gene's TSS.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required max_distance int

The maximum distance from the TSS to consider. Defaults to 500_000.

500000

Returns:

Name Type Description V2G V2G

variant to gene assignments with their distance to the TSS

Source code in src/otg/dataset/variant_annotation.py
def get_distance_to_tss(\n    self: VariantAnnotation,\n    gene_index: GeneIndex,\n    max_distance: int = 500_000,\n) -> V2G:\n    \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n        max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n    Returns:\n        V2G: variant to gene assignments with their distance to the TSS\n    \"\"\"\n    return V2G(\n        _df=(\n            self.df.alias(\"variant\")\n            .join(\n                f.broadcast(gene_index.locations_lut()).alias(\"gene\"),\n                on=[\n                    f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n                    f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                    <= max_distance,\n                ],\n                how=\"inner\",\n            )\n            .withColumn(\n                \"distance\", f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n            )\n            .withColumn(\n                \"inverse_distance\",\n                max_distance - f.col(\"distance\"),\n            )\n            .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n            .select(\n                \"variantId\",\n                f.col(\"variant.chromosome\").alias(\"chromosome\"),\n                \"distance\",\n                \"geneId\",\n                \"score\",\n                f.lit(\"distance\").alias(\"datatypeId\"),\n                f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_most_severe_vep_v2g","title":"get_most_severe_vep_v2g(vep_consequences: DataFrame, gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default vep_consequences DataFrame

A dataframe of VEP consequences

required gene_index GeneIndex

A gene index to filter by. Defaults to None.

required

Returns:

Name Type Description V2G V2G

High and medium severity variant to gene assignments

Source code in src/otg/dataset/variant_annotation.py
def get_most_severe_vep_v2g(\n    self: VariantAnnotation,\n    vep_consequences: DataFrame,\n    gene_index: GeneIndex,\n) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        vep_consequences (DataFrame): A dataframe of VEP consequences\n        gene_index (GeneIndex): A gene index to filter by. Defaults to None.\n\n    Returns:\n        V2G: High and medium severity variant to gene assignments\n    \"\"\"\n    return V2G(\n        _df=self.get_transcript_consequence_df(gene_index)\n        .select(\n            \"variantId\",\n            \"chromosome\",\n            f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n            f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n            f.lit(\"vep\").alias(\"datatypeId\"),\n            f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n        )\n        .join(\n            f.broadcast(vep_consequences),\n            on=\"label\",\n            how=\"inner\",\n        )\n        .drop(\"label\")\n        .filter(f.col(\"score\") != 0)\n        # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n        .transform(\n            lambda df: get_record_with_maximum_value(\n                df, [\"variantId\", \"geneId\"], \"score\"\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_plof_v2g","title":"get_plof_v2g(gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required

Returns:

Name Type Description V2G V2G

variant to gene assignments from the LOFTEE algorithm

Source code in src/otg/dataset/variant_annotation.py
def get_plof_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n\n    Returns:\n        V2G: variant to gene assignments from the LOFTEE algorithm\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n            .withColumn(\n                \"isHighQualityPlof\",\n                f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n                    f.col(\"transcriptConsequence.lof\") == \"LC\", False\n                ),\n            )\n            .withColumn(\n                \"score\",\n                f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n                    ~f.col(\"isHighQualityPlof\"), 0\n                ),\n            )\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                \"isHighQualityPlof\",\n                f.col(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"loftee\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_polyphen_v2g","title":"get_polyphen_v2g(gene_index: GeneIndex | None = None) -> V2G","text":"

Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.

Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows: - 0.0 to 0.15 -- Predicted to be benign. - 0.15 to 1.0 -- Possibly damaging. - 0.85 to 1.0 -- Predicted to be damaging.

Parameters:

Name Type Description Default gene_index GeneIndex | None

A gene index to filter by. Defaults to None.

None

Returns:

Name Type Description V2G V2G

variant to gene assignments with their polyphen scores

Source code in src/otg/dataset/variant_annotation.py
def get_polyphen_v2g(\n    self: VariantAnnotation, gene_index: GeneIndex | None = None\n) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n    Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows:\n        - 0.0 to 0.15 -- Predicted to be benign.\n        - 0.15 to 1.0 -- Possibly damaging.\n        - 0.85 to 1.0 -- Predicted to be damaging.\n\n    Args:\n        gene_index (GeneIndex | None): A gene index to filter by. Defaults to None.\n\n    Returns:\n        V2G: variant to gene assignments with their polyphen scores\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"polyphen\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the VariantAnnotation dataset.

Returns:

Name Type Description StructType StructType

Schema for the VariantAnnotation dataset

Source code in src/otg/dataset/variant_annotation.py
@classmethod\ndef get_schema(cls: type[VariantAnnotation]) -> StructType:\n    \"\"\"Provides the schema for the VariantAnnotation dataset.\n\n    Returns:\n        StructType: Schema for the VariantAnnotation dataset\n    \"\"\"\n    return parse_spark_schema(\"variant_annotation.json\")\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_sift_v2g","title":"get_sift_v2g(gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.

SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows: - 0.0 to 0.05 -- Likely to be deleterious. - 0.05 to 1.0 -- Likely to be tolerated.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required

Returns:

Name Type Description V2G V2G

variant to gene assignments with their SIFT scores

Source code in src/otg/dataset/variant_annotation.py
def get_sift_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n    SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows:\n        - 0.0 to 0.05 -- Likely to be deleterious.\n        - 0.05 to 1.0 -- Likely to be tolerated.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n\n    Returns:\n        V2G: variant to gene assignments with their SIFT scores\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"sift\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_transcript_consequence_df","title":"get_transcript_consequence_df(gene_index: GeneIndex | None = None) -> DataFrame","text":"

Dataframe of exploded transcript consequences.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default gene_index GeneIndex | None

A gene index. Defaults to None.

None

Returns:

Name Type Description DataFrame DataFrame

A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence

Source code in src/otg/dataset/variant_annotation.py
def get_transcript_consequence_df(\n    self: VariantAnnotation, gene_index: GeneIndex | None = None\n) -> DataFrame:\n    \"\"\"Dataframe of exploded transcript consequences.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        gene_index (GeneIndex | None): A gene index. Defaults to None.\n\n    Returns:\n        DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n    \"\"\"\n    # exploding the array removes records without VEP annotation\n    transript_consequences = self.df.withColumn(\n        \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n    ).select(\n        \"variantId\",\n        \"chromosome\",\n        \"position\",\n        \"transcriptConsequence\",\n        f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n    )\n    if gene_index:\n        transript_consequences = transript_consequences.join(\n            f.broadcast(gene_index.df),\n            on=[\"chromosome\", \"geneId\"],\n        )\n    return transript_consequences.persist()\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.max_maf","title":"max_maf() -> Column","text":"

Maximum minor allele frequency accross all populations.

Returns:

Name Type Description Column Column

Maximum minor allele frequency accross all populations.

Source code in src/otg/dataset/variant_annotation.py
def max_maf(self: VariantAnnotation) -> Column:\n    \"\"\"Maximum minor allele frequency accross all populations.\n\n    Returns:\n        Column: Maximum minor allele frequency accross all populations.\n    \"\"\"\n    return f.array_max(\n        f.transform(\n            self.df.alleleFrequencies,\n            lambda af: f.when(\n                af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n            ).otherwise(af.alleleFrequency),\n        )\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- gnomad3VariantId: string (nullable = false)\n |-- referenceAllele: string (nullable = false)\n |-- alternateAllele: string (nullable = false)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- alleleType: string (nullable = true)\n |-- rsIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- alleleFrequencies: array (nullable = false)\n |    |-- element: struct (containsNull = true)\n |    |    |-- populationName: string (nullable = true)\n |    |    |-- alleleFrequency: double (nullable = true)\n |-- cadd: struct (nullable = true)\n |    |-- phred: float (nullable = true)\n |    |-- raw: float (nullable = true)\n |-- vep: struct (nullable = false)\n |    |-- mostSevereConsequence: string (nullable = true)\n |    |-- transcriptConsequences: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- aminoAcids: string (nullable = true)\n |    |    |    |-- consequenceTerms: array (nullable = true)\n |    |    |    |    |-- element: string (containsNull = true)\n |    |    |    |-- geneId: string (nullable = true)\n |    |    |    |-- lof: string (nullable = true)\n |    |    |    |-- polyphenScore: double (nullable = true)\n |    |    |    |-- polyphenPrediction: string (nullable = true)\n |    |    |    |-- siftScore: double (nullable = true)\n |    |    |    |-- siftPrediction: string (nullable = true)\n
"},{"location":"python_api/dataset/variant_index/","title":"Variant index","text":""},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex","title":"otg.dataset.variant_index.VariantIndex dataclass","text":"

Bases: Dataset

Variant index dataset.

Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.

Source code in src/otg/dataset/variant_index.py
@dataclass\nclass VariantIndex(Dataset):\n    \"\"\"Variant index dataset.\n\n    Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[VariantIndex]) -> StructType:\n        \"\"\"Provides the schema for the VariantIndex dataset.\n\n        Returns:\n            StructType: Schema for the VariantIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"variant_index.json\")\n\n    @classmethod\n    def from_variant_annotation(\n        cls: type[VariantIndex],\n        variant_annotation: VariantAnnotation,\n        study_locus: StudyLocus,\n    ) -> VariantIndex:\n        \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\n\n        Args:\n            variant_annotation (VariantAnnotation): Variant annotation dataset\n            study_locus (StudyLocus): Study locus dataset with the variants to intersect with the variant annotation dataset\n\n        Returns:\n            VariantIndex: Variant index dataset\n        \"\"\"\n        unchanged_cols = [\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"chromosomeB37\",\n            \"positionB37\",\n            \"alleleType\",\n            \"alleleFrequencies\",\n            \"cadd\",\n        ]\n        va_slimmed = variant_annotation.filter_by_variant_df(\n            study_locus.unique_variants_in_locus()\n        )\n        return cls(\n            _df=(\n                va_slimmed.df.select(\n                    *unchanged_cols,\n                    f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n                    # filters/rsid are arrays that can be empty, in this case we convert them to null\n                    nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n                )\n                .repartition(400, \"chromosome\")\n                .sortWithinPartitions(\"chromosome\", \"position\")\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex.from_variant_annotation","title":"from_variant_annotation(variant_annotation: VariantAnnotation, study_locus: StudyLocus) -> VariantIndex classmethod","text":"

Initialise VariantIndex from pre-existing variant annotation dataset.

Parameters:

Name Type Description Default variant_annotation VariantAnnotation

Variant annotation dataset

required study_locus StudyLocus

Study locus dataset with the variants to intersect with the variant annotation dataset

required

Returns:

Name Type Description VariantIndex VariantIndex

Variant index dataset

Source code in src/otg/dataset/variant_index.py
@classmethod\ndef from_variant_annotation(\n    cls: type[VariantIndex],\n    variant_annotation: VariantAnnotation,\n    study_locus: StudyLocus,\n) -> VariantIndex:\n    \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\n\n    Args:\n        variant_annotation (VariantAnnotation): Variant annotation dataset\n        study_locus (StudyLocus): Study locus dataset with the variants to intersect with the variant annotation dataset\n\n    Returns:\n        VariantIndex: Variant index dataset\n    \"\"\"\n    unchanged_cols = [\n        \"variantId\",\n        \"chromosome\",\n        \"position\",\n        \"referenceAllele\",\n        \"alternateAllele\",\n        \"chromosomeB37\",\n        \"positionB37\",\n        \"alleleType\",\n        \"alleleFrequencies\",\n        \"cadd\",\n    ]\n    va_slimmed = variant_annotation.filter_by_variant_df(\n        study_locus.unique_variants_in_locus()\n    )\n    return cls(\n        _df=(\n            va_slimmed.df.select(\n                *unchanged_cols,\n                f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n                # filters/rsid are arrays that can be empty, in this case we convert them to null\n                nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n            )\n            .repartition(400, \"chromosome\")\n            .sortWithinPartitions(\"chromosome\", \"position\")\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the VariantIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the VariantIndex dataset

Source code in src/otg/dataset/variant_index.py
@classmethod\ndef get_schema(cls: type[VariantIndex]) -> StructType:\n    \"\"\"Provides the schema for the VariantIndex dataset.\n\n    Returns:\n        StructType: Schema for the VariantIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"variant_index.json\")\n
"},{"location":"python_api/dataset/variant_index/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- referenceAllele: string (nullable = false)\n |-- alternateAllele: string (nullable = false)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- alleleType: string (nullable = false)\n |-- alleleFrequencies: array (nullable = false)\n |    |-- element: struct (containsNull = true)\n |    |    |-- populationName: string (nullable = true)\n |    |    |-- alleleFrequency: double (nullable = true)\n |-- cadd: struct (nullable = true)\n |    |-- phred: float (nullable = true)\n |    |-- raw: float (nullable = true)\n |-- mostSevereConsequence: string (nullable = true)\n |-- rsIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n
"},{"location":"python_api/dataset/variant_to_gene/","title":"Variant-to-gene","text":""},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G","title":"otg.dataset.v2g.V2G dataclass","text":"

Bases: Dataset

Variant-to-gene (V2G) evidence dataset.

A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)

Source code in src/otg/dataset/v2g.py
@dataclass\nclass V2G(Dataset):\n    \"\"\"Variant-to-gene (V2G) evidence dataset.\n\n    A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific `biofeatures` (e.g. cell types)\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[V2G]) -> StructType:\n        \"\"\"Provides the schema for the V2G dataset.\n\n        Returns:\n            StructType: Schema for the V2G dataset\n        \"\"\"\n        return parse_spark_schema(\"v2g.json\")\n\n    def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n        \"\"\"Filter by V2G dataset by genes.\n\n        Args:\n            genes (GeneIndex): Gene index dataset to filter by\n\n        Returns:\n            V2G: V2G dataset filtered by genes\n        \"\"\"\n        self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n        return self\n
"},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G.filter_by_genes","title":"filter_by_genes(genes: GeneIndex) -> V2G","text":"

Filter by V2G dataset by genes.

Parameters:

Name Type Description Default genes GeneIndex

Gene index dataset to filter by

required

Returns:

Name Type Description V2G V2G

V2G dataset filtered by genes

Source code in src/otg/dataset/v2g.py
def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n    \"\"\"Filter by V2G dataset by genes.\n\n    Args:\n        genes (GeneIndex): Gene index dataset to filter by\n\n    Returns:\n        V2G: V2G dataset filtered by genes\n    \"\"\"\n    self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n    return self\n
"},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the V2G dataset.

Returns:

Name Type Description StructType StructType

Schema for the V2G dataset

Source code in src/otg/dataset/v2g.py
@classmethod\ndef get_schema(cls: type[V2G]) -> StructType:\n    \"\"\"Provides the schema for the V2G dataset.\n\n    Returns:\n        StructType: Schema for the V2G dataset\n    \"\"\"\n    return parse_spark_schema(\"v2g.json\")\n
"},{"location":"python_api/dataset/variant_to_gene/#schema","title":"Schema","text":"
root\n |-- geneId: string (nullable = false)\n |-- variantId: string (nullable = false)\n |-- distance: long (nullable = true)\n |-- chromosome: string (nullable = false)\n |-- datatypeId: string (nullable = false)\n |-- datasourceId: string (nullable = false)\n |-- score: double (nullable = true)\n |-- resourceScore: double (nullable = true)\n |-- pmid: string (nullable = true)\n |-- biofeature: string (nullable = true)\n |-- variantFunctionalConsequenceId: string (nullable = true)\n |-- isHighQualityPlof: boolean (nullable = true)\n
"},{"location":"python_api/datasource/_datasource/","title":"Data Source","text":"

TBC

"},{"location":"python_api/datasource/finngen/_finngen/","title":"FinnGen","text":""},{"location":"python_api/datasource/finngen/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/finngen/study_index/#otg.datasource.finngen.study_index.FinnGenStudyIndex","title":"otg.datasource.finngen.study_index.FinnGenStudyIndex","text":"

Bases: StudyIndex

Study index dataset from FinnGen.

The following information is aggregated/extracted:

  • Study ID in the special format (FINNGEN_R9_*)
  • Trait name (for example, Amoebiasis)
  • Number of cases and controls
  • Link to the summary statistics location

Some fields are also populated as constants, such as study type and the initial sample size.

Source code in src/otg/datasource/finngen/study_index.py
class FinnGenStudyIndex(StudyIndex):\n    \"\"\"Study index dataset from FinnGen.\n\n    The following information is aggregated/extracted:\n\n    - Study ID in the special format (FINNGEN_R9_*)\n    - Trait name (for example, Amoebiasis)\n    - Number of cases and controls\n    - Link to the summary statistics location\n\n    Some fields are also populated as constants, such as study type and the initial sample size.\n    \"\"\"\n\n    @classmethod\n    def from_source(\n        cls: type[FinnGenStudyIndex],\n        finngen_studies: DataFrame,\n        finngen_release_prefix: str,\n        finngen_summary_stats_url_prefix: str,\n        finngen_summary_stats_url_suffix: str,\n    ) -> FinnGenStudyIndex:\n        \"\"\"This function ingests study level metadata from FinnGen.\n\n        Args:\n            finngen_studies (DataFrame): FinnGen raw study table\n            finngen_release_prefix (str): Release prefix pattern.\n            finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n            finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n        Returns:\n            FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n        \"\"\"\n        return FinnGenStudyIndex(\n            _df=finngen_studies.select(\n                f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n                    \"studyId\"\n                ),\n                f.col(\"phenostring\").alias(\"traitFromSource\"),\n                f.col(\"num_cases\").alias(\"nCases\"),\n                f.col(\"num_controls\").alias(\"nControls\"),\n                (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n                f.lit(finngen_release_prefix).alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.lit(True).alias(\"hasSumstats\"),\n                f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n                    \"initialSampleSize\"\n                ),\n                f.array(\n                    f.struct(\n                        f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n                        f.lit(\"Finnish\").alias(\"ancestry\"),\n                    )\n                ).alias(\"discoverySamples\"),\n                f.concat(\n                    f.lit(finngen_summary_stats_url_prefix),\n                    f.col(\"phenocode\"),\n                    f.lit(finngen_summary_stats_url_suffix),\n                ).alias(\"summarystatsLocation\"),\n            ).withColumn(\n                \"ldPopulationStructure\",\n                cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/finngen/study_index/#otg.datasource.finngen.study_index.FinnGenStudyIndex.from_source","title":"from_source(finngen_studies: DataFrame, finngen_release_prefix: str, finngen_summary_stats_url_prefix: str, finngen_summary_stats_url_suffix: str) -> FinnGenStudyIndex classmethod","text":"

This function ingests study level metadata from FinnGen.

Parameters:

Name Type Description Default finngen_studies DataFrame

FinnGen raw study table

required finngen_release_prefix str

Release prefix pattern.

required finngen_summary_stats_url_prefix str

URL prefix for summary statistics location.

required finngen_summary_stats_url_suffix str

URL prefix suffix for summary statistics location.

required

Returns:

Name Type Description FinnGenStudyIndex FinnGenStudyIndex

Parsed and annotated FinnGen study table.

Source code in src/otg/datasource/finngen/study_index.py
@classmethod\ndef from_source(\n    cls: type[FinnGenStudyIndex],\n    finngen_studies: DataFrame,\n    finngen_release_prefix: str,\n    finngen_summary_stats_url_prefix: str,\n    finngen_summary_stats_url_suffix: str,\n) -> FinnGenStudyIndex:\n    \"\"\"This function ingests study level metadata from FinnGen.\n\n    Args:\n        finngen_studies (DataFrame): FinnGen raw study table\n        finngen_release_prefix (str): Release prefix pattern.\n        finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n        finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n    Returns:\n        FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n    \"\"\"\n    return FinnGenStudyIndex(\n        _df=finngen_studies.select(\n            f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n                \"studyId\"\n            ),\n            f.col(\"phenostring\").alias(\"traitFromSource\"),\n            f.col(\"num_cases\").alias(\"nCases\"),\n            f.col(\"num_controls\").alias(\"nControls\"),\n            (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n            f.lit(finngen_release_prefix).alias(\"projectId\"),\n            f.lit(\"gwas\").alias(\"studyType\"),\n            f.lit(True).alias(\"hasSumstats\"),\n            f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n                \"initialSampleSize\"\n            ),\n            f.array(\n                f.struct(\n                    f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n                    f.lit(\"Finnish\").alias(\"ancestry\"),\n                )\n            ).alias(\"discoverySamples\"),\n            f.concat(\n                f.lit(finngen_summary_stats_url_prefix),\n                f.col(\"phenocode\"),\n                f.lit(finngen_summary_stats_url_suffix),\n            ).alias(\"summarystatsLocation\"),\n        ).withColumn(\n            \"ldPopulationStructure\",\n            cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gnomad/_gnomad/","title":"GnomAD","text":""},{"location":"python_api/datasource/gnomad/gnomad_ld/","title":"LD Matrix","text":""},{"location":"python_api/datasource/gnomad/gnomad_ld/#otg.datasource.gnomad.ld.GnomADLDMatrix","title":"otg.datasource.gnomad.ld.GnomADLDMatrix","text":"

Importer of LD information from GnomAD.

The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:

  1. Convert a LD matrix to a Spark DataFrame.
  2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
  3. Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
class GnomADLDMatrix:\n    \"\"\"Importer of LD information from GnomAD.\n\n    The information comes from LD matrices [made available by GnomAD](https://gnomad.broadinstitute.org/downloads/#v2-linkage-disequilibrium) in Hail's native format. We aggregate the LD information across 8 ancestries.\n    The basic steps to generate the LDIndex are:\n\n    1. Convert a LD matrix to a Spark DataFrame.\n    2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.\n    3. Aggregate the LD information across populations.\n\n    \"\"\"\n\n    @staticmethod\n    def _aggregate_ld_index_across_populations(\n        unaggregated_ld_index: DataFrame,\n    ) -> DataFrame:\n        \"\"\"Aggregate LDIndex across populations.\n\n        Args:\n            unaggregated_ld_index (DataFrame): Unaggregate LDIndex index dataframe  each row is a variant pair in a population\n\n        Returns:\n            DataFrame: Aggregated LDIndex index dataframe  each row is a variant with the LD set across populations\n\n        Examples:\n            >>> data = [(\"1.0\", \"var1\", \"X\", \"var1\", \"pop1\"), (\"1.0\", \"X\", \"var2\", \"var2\", \"pop1\"),\n            ...         (\"0.5\", \"var1\", \"X\", \"var2\", \"pop1\"), (\"0.5\", \"var1\", \"X\", \"var2\", \"pop2\"),\n            ...         (\"0.5\", \"var2\", \"X\", \"var1\", \"pop1\"), (\"0.5\", \"X\", \"var2\", \"var1\", \"pop2\")]\n            >>> df = spark.createDataFrame(data, [\"r\", \"variantId\", \"chromosome\", \"tagvariantId\", \"population\"])\n            >>> GnomADLDMatrix._aggregate_ld_index_across_populations(df).printSchema()\n            root\n             |-- variantId: string (nullable = true)\n             |-- chromosome: string (nullable = true)\n             |-- ldSet: array (nullable = false)\n             |    |-- element: struct (containsNull = false)\n             |    |    |-- tagVariantId: string (nullable = true)\n             |    |    |-- rValues: array (nullable = false)\n             |    |    |    |-- element: struct (containsNull = false)\n             |    |    |    |    |-- population: string (nullable = true)\n             |    |    |    |    |-- r: string (nullable = true)\n            <BLANKLINE>\n        \"\"\"\n        return (\n            unaggregated_ld_index\n            # First level of aggregation: get r/population for each variant/tagVariant pair\n            .withColumn(\"r_pop_struct\", f.struct(\"population\", \"r\"))\n            .groupBy(\"chromosome\", \"variantId\", \"tagVariantId\")\n            .agg(\n                f.collect_set(\"r_pop_struct\").alias(\"rValues\"),\n            )\n            # Second level of aggregation: get r/population for each variant\n            .withColumn(\"r_pop_tag_struct\", f.struct(\"tagVariantId\", \"rValues\"))\n            .groupBy(\"variantId\", \"chromosome\")\n            .agg(\n                f.collect_set(\"r_pop_tag_struct\").alias(\"ldSet\"),\n            )\n        )\n\n    @staticmethod\n    def _convert_ld_matrix_to_table(\n        block_matrix: BlockMatrix, min_r2: float\n    ) -> DataFrame:\n        \"\"\"Convert LD matrix to table.\n\n        Args:\n            block_matrix (BlockMatrix): LD matrix\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            DataFrame: LD matrix as a Spark DataFrame\n        \"\"\"\n        table = block_matrix.entries(keyed=False)\n        return (\n            table.filter(hl.abs(table.entry) >= min_r2**0.5)\n            .to_spark()\n            .withColumnRenamed(\"entry\", \"r\")\n        )\n\n    @staticmethod\n    def _create_ldindex_for_population(\n        population_id: str,\n        ld_matrix_path: str,\n        ld_index_raw_path: str,\n        grch37_to_grch38_chain_path: str,\n        min_r2: float,\n    ) -> DataFrame:\n        \"\"\"Create LDIndex for a specific population.\n\n        Args:\n            population_id (str): Population ID\n            ld_matrix_path (str): Path to the LD matrix\n            ld_index_raw_path (str): Path to the LD index\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            DataFrame: LDIndex for a specific population\n        \"\"\"\n        # Prepare LD Block matrix\n        ld_matrix = GnomADLDMatrix._convert_ld_matrix_to_table(\n            BlockMatrix.read(ld_matrix_path), min_r2\n        )\n\n        # Prepare table with variant indices\n        ld_index = GnomADLDMatrix._process_variant_indices(\n            hl.read_table(ld_index_raw_path),\n            grch37_to_grch38_chain_path,\n        )\n\n        return GnomADLDMatrix._resolve_variant_indices(ld_index, ld_matrix).select(\n            \"*\",\n            f.lit(population_id).alias(\"population\"),\n        )\n\n    @staticmethod\n    def _process_variant_indices(\n        ld_index_raw: hl.Table, grch37_to_grch38_chain_path: str\n    ) -> DataFrame:\n        \"\"\"Creates a look up table between variants and their coordinates in the LD Matrix.\n\n        !!! info \"Gnomad's LD Matrix and Index are based on GRCh37 coordinates. This function will lift over the coordinates to GRCh38 to build the lookup table.\"\n\n        Args:\n            ld_index_raw (hl.Table): LD index table from GnomAD\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n\n        Returns:\n            DataFrame: Look up table between variants in build hg38 and their coordinates in the LD Matrix\n        \"\"\"\n        ld_index_38 = _liftover_loci(\n            ld_index_raw, grch37_to_grch38_chain_path, \"GRCh38\"\n        )\n\n        return (\n            ld_index_38.to_spark()\n            # Filter out variants where the liftover failed\n            .filter(f.col(\"`locus_GRCh38.position`\").isNotNull())\n            .withColumn(\n                \"chromosome\", f.regexp_replace(\"`locus_GRCh38.contig`\", \"chr\", \"\")\n            )\n            .withColumn(\n                \"position\",\n                convert_gnomad_position_to_ensembl(\n                    f.col(\"`locus_GRCh38.position`\"),\n                    f.col(\"`alleles`\").getItem(0),\n                    f.col(\"`alleles`\").getItem(1),\n                ),\n            )\n            .select(\n                \"chromosome\",\n                f.concat_ws(\n                    \"_\",\n                    f.col(\"chromosome\"),\n                    f.col(\"position\"),\n                    f.col(\"`alleles`\").getItem(0),\n                    f.col(\"`alleles`\").getItem(1),\n                ).alias(\"variantId\"),\n                f.col(\"idx\"),\n            )\n            # Filter out ambiguous liftover results: multiple indices for the same variant\n            .withColumn(\"count\", f.count(\"*\").over(Window.partitionBy([\"variantId\"])))\n            .filter(f.col(\"count\") == 1)\n            .drop(\"count\")\n        )\n\n    @staticmethod\n    def _resolve_variant_indices(\n        ld_index: DataFrame, ld_matrix: DataFrame\n    ) -> DataFrame:\n        \"\"\"Resolve the `i` and `j` indices of the block matrix to variant IDs (build 38).\n\n        Args:\n            ld_index (DataFrame): Dataframe with resolved variant indices\n            ld_matrix (DataFrame): Dataframe with the filtered LD matrix\n\n        Returns:\n            DataFrame: Dataframe with variant IDs instead of `i` and `j` indices\n        \"\"\"\n        ld_index_i = ld_index.selectExpr(\n            \"idx as i\", \"variantId as variantId_i\", \"chromosome\"\n        )\n        ld_index_j = ld_index.selectExpr(\"idx as j\", \"variantId as variantId_j\")\n        return (\n            ld_matrix.join(ld_index_i, on=\"i\", how=\"inner\")\n            .join(ld_index_j, on=\"j\", how=\"inner\")\n            .drop(\"i\", \"j\")\n        )\n\n    @staticmethod\n    def _transpose_ld_matrix(ld_matrix: DataFrame) -> DataFrame:\n        \"\"\"Transpose LD matrix to a square matrix format.\n\n        Args:\n            ld_matrix (DataFrame): Triangular LD matrix converted to a Spark DataFrame\n\n        Returns:\n            DataFrame: Square LD matrix without diagonal duplicates\n\n        Examples:\n            >>> df = spark.createDataFrame(\n            ...     [\n            ...         (1, 1, 1.0, \"1\", \"AFR\"),\n            ...         (1, 2, 0.5, \"1\", \"AFR\"),\n            ...         (2, 2, 1.0, \"1\", \"AFR\"),\n            ...     ],\n            ...     [\"variantId_i\", \"variantId_j\", \"r\", \"chromosome\", \"population\"],\n            ... )\n            >>> GnomADLDMatrix._transpose_ld_matrix(df).show()\n            +-----------+-----------+---+----------+----------+\n            |variantId_i|variantId_j|  r|chromosome|population|\n            +-----------+-----------+---+----------+----------+\n            |          1|          2|0.5|         1|       AFR|\n            |          1|          1|1.0|         1|       AFR|\n            |          2|          1|0.5|         1|       AFR|\n            |          2|          2|1.0|         1|       AFR|\n            +-----------+-----------+---+----------+----------+\n            <BLANKLINE>\n        \"\"\"\n        ld_matrix_transposed = ld_matrix.selectExpr(\n            \"variantId_i as variantId_j\",\n            \"variantId_j as variantId_i\",\n            \"r\",\n            \"chromosome\",\n            \"population\",\n        )\n        return ld_matrix.filter(\n            f.col(\"variantId_i\") != f.col(\"variantId_j\")\n        ).unionByName(ld_matrix_transposed)\n\n    @classmethod\n    def as_ld_index(\n        cls: type[GnomADLDMatrix],\n        ld_populations: list[str],\n        ld_matrix_template: str,\n        ld_index_raw_template: str,\n        grch37_to_grch38_chain_path: str,\n        min_r2: float,\n    ) -> LDIndex:\n        \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\n\n        Args:\n            ld_populations (list[str]): List of populations to aggregate\n            ld_matrix_template (str): Template path to the LD matrix\n            ld_index_raw_template (str): Template path to the LD variants index\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            LDIndex: LDIndex dataset\n        \"\"\"\n        ld_indices_unaggregated = []\n        for pop in ld_populations:\n            try:\n                ld_matrix_path = ld_matrix_template.format(POP=pop)\n                ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n                pop_ld_index = cls._create_ldindex_for_population(\n                    pop,\n                    ld_matrix_path,\n                    ld_index_raw_path.format(pop),\n                    grch37_to_grch38_chain_path,\n                    min_r2,\n                )\n                ld_indices_unaggregated.append(pop_ld_index)\n            except Exception as e:\n                print(f\"Failed to create LDIndex for population {pop}: {e}\")\n                sys.exit(1)\n\n        ld_index_unaggregated = (\n            GnomADLDMatrix._transpose_ld_matrix(\n                reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n            )\n            .withColumnRenamed(\"variantId_i\", \"variantId\")\n            .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n        )\n        return LDIndex(\n            _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n            _schema=LDIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gnomad/gnomad_ld/#otg.datasource.gnomad.ld.GnomADLDMatrix.as_ld_index","title":"as_ld_index(ld_populations: list[str], ld_matrix_template: str, ld_index_raw_template: str, grch37_to_grch38_chain_path: str, min_r2: float) -> LDIndex classmethod","text":"

Create LDIndex dataset aggregating the LD information across a set of populations.

Parameters:

Name Type Description Default ld_populations list[str]

List of populations to aggregate

required ld_matrix_template str

Template path to the LD matrix

required ld_index_raw_template str

Template path to the LD variants index

required grch37_to_grch38_chain_path str

Path to the chain file used to lift over the coordinates

required min_r2 float

Minimum r2 value to keep in the table

required

Returns:

Name Type Description LDIndex LDIndex

LDIndex dataset

Source code in src/otg/datasource/gnomad/ld.py
@classmethod\ndef as_ld_index(\n    cls: type[GnomADLDMatrix],\n    ld_populations: list[str],\n    ld_matrix_template: str,\n    ld_index_raw_template: str,\n    grch37_to_grch38_chain_path: str,\n    min_r2: float,\n) -> LDIndex:\n    \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\n\n    Args:\n        ld_populations (list[str]): List of populations to aggregate\n        ld_matrix_template (str): Template path to the LD matrix\n        ld_index_raw_template (str): Template path to the LD variants index\n        grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n        min_r2 (float): Minimum r2 value to keep in the table\n\n    Returns:\n        LDIndex: LDIndex dataset\n    \"\"\"\n    ld_indices_unaggregated = []\n    for pop in ld_populations:\n        try:\n            ld_matrix_path = ld_matrix_template.format(POP=pop)\n            ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n            pop_ld_index = cls._create_ldindex_for_population(\n                pop,\n                ld_matrix_path,\n                ld_index_raw_path.format(pop),\n                grch37_to_grch38_chain_path,\n                min_r2,\n            )\n            ld_indices_unaggregated.append(pop_ld_index)\n        except Exception as e:\n            print(f\"Failed to create LDIndex for population {pop}: {e}\")\n            sys.exit(1)\n\n    ld_index_unaggregated = (\n        GnomADLDMatrix._transpose_ld_matrix(\n            reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n        )\n        .withColumnRenamed(\"variantId_i\", \"variantId\")\n        .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n    )\n    return LDIndex(\n        _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n        _schema=LDIndex.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gnomad/gnomad_variants/","title":"Variants","text":""},{"location":"python_api/datasource/gnomad/gnomad_variants/#otg.datasource.gnomad.variants.GnomADVariants","title":"otg.datasource.gnomad.variants.GnomADVariants","text":"

GnomAD variants included in the GnomAD genomes dataset.

Source code in src/otg/datasource/gnomad/variants.py
class GnomADVariants:\n    \"\"\"GnomAD variants included in the GnomAD genomes dataset.\"\"\"\n\n    @staticmethod\n    def _convert_gnomad_position_to_ensembl_hail(\n        position: Int32Expression,\n        reference: StringExpression,\n        alternate: StringExpression,\n    ) -> Int32Expression:\n        \"\"\"Convert GnomAD variant position to Ensembl variant position in hail table.\n\n        For indels (the reference or alternate allele is longer than 1), then adding 1 to the position, for SNPs, the position is unchanged.\n        More info about the problem: https://www.biostars.org/p/84686/\n\n        Args:\n            position (Int32Expression): Position of the variant in the GnomAD genome.\n            reference (StringExpression): The reference allele.\n            alternate (StringExpression): The alternate allele\n\n        Returns:\n            Int32Expression: The position of the variant according to Ensembl genome.\n        \"\"\"\n        return hl.if_else(\n            (reference.length() > 1) | (alternate.length() > 1), position + 1, position\n        )\n\n    @classmethod\n    def as_variant_annotation(\n        cls: type[GnomADVariants],\n        gnomad_file: str,\n        grch38_to_grch37_chain: str,\n        populations: list,\n    ) -> VariantAnnotation:\n        \"\"\"Generate variant annotation dataset from gnomAD.\n\n        Some relevant modifications to the original dataset are:\n\n        1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n        2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n        3. Field names are converted to camel case to follow the convention.\n\n        Args:\n            gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n            grch38_to_grch37_chain (str): Path to chain file for liftover\n            populations (list): List of populations to include in the dataset\n\n        Returns:\n            VariantAnnotation: Variant annotation dataset\n        \"\"\"\n        # Load variants dataset\n        ht = hl.read_table(\n            gnomad_file,\n            _load_refs=False,\n        )\n\n        # Liftover\n        grch37 = hl.get_reference(\"GRCh37\")\n        grch38 = hl.get_reference(\"GRCh38\")\n        grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n        # Drop non biallelic variants\n        ht = ht.filter(ht.alleles.length() == 2)\n        # Liftover\n        ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n        # Select relevant fields and nested records to create class\n        return VariantAnnotation(\n            _df=(\n                ht.select(\n                    gnomad3VariantId=hl.str(\"-\").join(\n                        [\n                            ht.locus.contig.replace(\"chr\", \"\"),\n                            hl.str(ht.locus.position),\n                            ht.alleles[0],\n                            ht.alleles[1],\n                        ]\n                    ),\n                    chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n                    position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                        ht.locus.position, ht.alleles[0], ht.alleles[1]\n                    ),\n                    variantId=hl.str(\"_\").join(\n                        [\n                            ht.locus.contig.replace(\"chr\", \"\"),\n                            hl.str(\n                                GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                                    ht.locus.position, ht.alleles[0], ht.alleles[1]\n                                )\n                            ),\n                            ht.alleles[0],\n                            ht.alleles[1],\n                        ]\n                    ),\n                    chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n                    positionB37=ht.locus_GRCh37.position,\n                    referenceAllele=ht.alleles[0],\n                    alternateAllele=ht.alleles[1],\n                    rsIds=ht.rsid,\n                    alleleType=ht.allele_info.allele_type,\n                    cadd=hl.struct(\n                        phred=ht.cadd.phred,\n                        raw=ht.cadd.raw_score,\n                    ),\n                    alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n                        lambda p: hl.struct(\n                            populationName=p,\n                            alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n                        )\n                    ),\n                    vep=hl.struct(\n                        mostSevereConsequence=ht.vep.most_severe_consequence,\n                        transcriptConsequences=hl.map(\n                            lambda x: hl.struct(\n                                aminoAcids=x.amino_acids,\n                                consequenceTerms=x.consequence_terms,\n                                geneId=x.gene_id,\n                                lof=x.lof,\n                                polyphenScore=x.polyphen_score,\n                                polyphenPrediction=x.polyphen_prediction,\n                                siftScore=x.sift_score,\n                                siftPrediction=x.sift_prediction,\n                            ),\n                            # Only keeping canonical transcripts\n                            ht.vep.transcript_consequences.filter(\n                                lambda x: (x.canonical == 1)\n                                & (x.gene_symbol_source == \"HGNC\")\n                            ),\n                        ),\n                    ),\n                )\n                .key_by(\"chromosome\", \"position\")\n                .drop(\"locus\", \"alleles\")\n                .select_globals()\n                .to_spark(flatten=False)\n            ),\n            _schema=VariantAnnotation.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gnomad/gnomad_variants/#otg.datasource.gnomad.variants.GnomADVariants.as_variant_annotation","title":"as_variant_annotation(gnomad_file: str, grch38_to_grch37_chain: str, populations: list) -> VariantAnnotation classmethod","text":"

Generate variant annotation dataset from gnomAD.

Some relevant modifications to the original dataset are:

  1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.
  2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.
  3. Field names are converted to camel case to follow the convention.

Parameters:

Name Type Description Default gnomad_file str

Path to gnomad.genomes.vX.X.X.sites.ht gnomAD dataset

required grch38_to_grch37_chain str

Path to chain file for liftover

required populations list

List of populations to include in the dataset

required

Returns:

Name Type Description VariantAnnotation VariantAnnotation

Variant annotation dataset

Source code in src/otg/datasource/gnomad/variants.py
@classmethod\ndef as_variant_annotation(\n    cls: type[GnomADVariants],\n    gnomad_file: str,\n    grch38_to_grch37_chain: str,\n    populations: list,\n) -> VariantAnnotation:\n    \"\"\"Generate variant annotation dataset from gnomAD.\n\n    Some relevant modifications to the original dataset are:\n\n    1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n    2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n    3. Field names are converted to camel case to follow the convention.\n\n    Args:\n        gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n        grch38_to_grch37_chain (str): Path to chain file for liftover\n        populations (list): List of populations to include in the dataset\n\n    Returns:\n        VariantAnnotation: Variant annotation dataset\n    \"\"\"\n    # Load variants dataset\n    ht = hl.read_table(\n        gnomad_file,\n        _load_refs=False,\n    )\n\n    # Liftover\n    grch37 = hl.get_reference(\"GRCh37\")\n    grch38 = hl.get_reference(\"GRCh38\")\n    grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n    # Drop non biallelic variants\n    ht = ht.filter(ht.alleles.length() == 2)\n    # Liftover\n    ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n    # Select relevant fields and nested records to create class\n    return VariantAnnotation(\n        _df=(\n            ht.select(\n                gnomad3VariantId=hl.str(\"-\").join(\n                    [\n                        ht.locus.contig.replace(\"chr\", \"\"),\n                        hl.str(ht.locus.position),\n                        ht.alleles[0],\n                        ht.alleles[1],\n                    ]\n                ),\n                chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n                position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                    ht.locus.position, ht.alleles[0], ht.alleles[1]\n                ),\n                variantId=hl.str(\"_\").join(\n                    [\n                        ht.locus.contig.replace(\"chr\", \"\"),\n                        hl.str(\n                            GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                                ht.locus.position, ht.alleles[0], ht.alleles[1]\n                            )\n                        ),\n                        ht.alleles[0],\n                        ht.alleles[1],\n                    ]\n                ),\n                chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n                positionB37=ht.locus_GRCh37.position,\n                referenceAllele=ht.alleles[0],\n                alternateAllele=ht.alleles[1],\n                rsIds=ht.rsid,\n                alleleType=ht.allele_info.allele_type,\n                cadd=hl.struct(\n                    phred=ht.cadd.phred,\n                    raw=ht.cadd.raw_score,\n                ),\n                alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n                    lambda p: hl.struct(\n                        populationName=p,\n                        alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n                    )\n                ),\n                vep=hl.struct(\n                    mostSevereConsequence=ht.vep.most_severe_consequence,\n                    transcriptConsequences=hl.map(\n                        lambda x: hl.struct(\n                            aminoAcids=x.amino_acids,\n                            consequenceTerms=x.consequence_terms,\n                            geneId=x.gene_id,\n                            lof=x.lof,\n                            polyphenScore=x.polyphen_score,\n                            polyphenPrediction=x.polyphen_prediction,\n                            siftScore=x.sift_score,\n                            siftPrediction=x.sift_prediction,\n                        ),\n                        # Only keeping canonical transcripts\n                        ht.vep.transcript_consequences.filter(\n                            lambda x: (x.canonical == 1)\n                            & (x.gene_symbol_source == \"HGNC\")\n                        ),\n                    ),\n                ),\n            )\n            .key_by(\"chromosome\", \"position\")\n            .drop(\"locus\", \"alleles\")\n            .select_globals()\n            .to_spark(flatten=False)\n        ),\n        _schema=VariantAnnotation.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/_gwas_catalog/","title":"GWAS Catalog","text":"GWAS Catalog"},{"location":"python_api/datasource/gwas_catalog/associations/","title":"Associations","text":""},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations","title":"otg.datasource.gwas_catalog.associations.GWASCatalogAssociations dataclass","text":"

Bases: StudyLocus

Study-locus dataset derived from GWAS Catalog.

Source code in src/otg/datasource/gwas_catalog/associations.py
@dataclass\nclass GWASCatalogAssociations(StudyLocus):\n    \"\"\"Study-locus dataset derived from GWAS Catalog.\"\"\"\n\n    @staticmethod\n    def _parse_pvalue(pvalue: Column) -> tuple[Column, Column]:\n        \"\"\"Parse p-value column.\n\n        Args:\n            pvalue (Column): p-value [string]\n\n        Returns:\n            tuple[Column, Column]: p-value mantissa and exponent\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"1.0\"), (\"0.5\"), (\"1E-20\"), (\"3E-3\"), (\"1E-1000\")]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.select('value',*GWASCatalogAssociations._parse_pvalue(f.col('value'))).show()\n            +-------+--------------+--------------+\n            |  value|pValueMantissa|pValueExponent|\n            +-------+--------------+--------------+\n            |    1.0|           1.0|             1|\n            |    0.5|           0.5|             1|\n            |  1E-20|           1.0|           -20|\n            |   3E-3|           3.0|            -3|\n            |1E-1000|           1.0|         -1000|\n            +-------+--------------+--------------+\n            <BLANKLINE>\n\n        \"\"\"\n        split = f.split(pvalue, \"E\")\n        return split.getItem(0).cast(\"float\").alias(\"pValueMantissa\"), f.coalesce(\n            split.getItem(1).cast(\"integer\"), f.lit(1)\n        ).alias(\"pValueExponent\")\n\n    @staticmethod\n    def _normalise_pvaluetext(p_value_text: Column) -> Column:\n        \"\"\"Normalised p-value text column to a standardised format.\n\n        For cases where there is no mapping, the value is set to null.\n\n        Args:\n            p_value_text (Column): `pValueText` column from GWASCatalog\n\n        Returns:\n            Column: Array column after using GWAS Catalog mappings. There might be multiple mappings for a single p-value text.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"European Ancestry\"), (\"African ancestry\"), (\"Alzheimer\u2019s Disease\"), (\"(progression)\"), (\"\"), (None)]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.withColumn('normalised', GWASCatalogAssociations._normalise_pvaluetext(f.col('value'))).show()\n            +-------------------+----------+\n            |              value|normalised|\n            +-------------------+----------+\n            |  European Ancestry|      [EA]|\n            |   African ancestry|      [AA]|\n            |Alzheimer\u2019s Disease|      [AD]|\n            |      (progression)|      null|\n            |                   |      null|\n            |               null|      null|\n            +-------------------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # GWAS Catalog to p-value mapping\n        json_dict = json.loads(\n            pkg_resources.read_text(data, \"gwas_pValueText_map.json\", encoding=\"utf-8\")\n        )\n        map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n        splitted_col = f.split(f.regexp_replace(p_value_text, r\"[\\(\\)]\", \"\"), \",\")\n        mapped_col = f.transform(splitted_col, lambda x: map_expr[x])\n        return f.when(f.forall(mapped_col, lambda x: x.isNull()), None).otherwise(\n            mapped_col\n        )\n\n    @staticmethod\n    def _normalise_risk_allele(risk_allele: Column) -> Column:\n        \"\"\"Normalised risk allele column to a standardised format.\n\n        If multiple risk alleles are present, the first one is returned.\n\n        Args:\n            risk_allele (Column): `riskAllele` column from GWASCatalog\n\n        Returns:\n            Column: mapped using GWAS Catalog mapping\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"rs1234-A-G\"), (\"rs1234-A\"), (\"rs1234-A; rs1235-G\")]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.withColumn('normalised', GWASCatalogAssociations._normalise_risk_allele(f.col('value'))).show()\n            +------------------+----------+\n            |             value|normalised|\n            +------------------+----------+\n            |        rs1234-A-G|         A|\n            |          rs1234-A|         A|\n            |rs1234-A; rs1235-G|         A|\n            +------------------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # GWAS Catalog to risk allele mapping\n        return f.split(f.split(risk_allele, \"; \").getItem(0), \"-\").getItem(1)\n\n    @staticmethod\n    def _collect_rsids(\n        snp_id: Column, snp_id_current: Column, risk_allele: Column\n    ) -> Column:\n        \"\"\"It takes three columns, and returns an array of distinct values from those columns.\n\n        Args:\n            snp_id (Column): The original snp id from the GWAS catalog.\n            snp_id_current (Column): The current snp id field is just a number at the moment (stored as a string). Adding 'rs' prefix if looks good.\n            risk_allele (Column): The risk allele for the SNP.\n\n        Returns:\n            Column: An array of distinct values.\n        \"\"\"\n        # The current snp id field is just a number at the moment (stored as a string). Adding 'rs' prefix if looks good.\n        snp_id_current = f.when(\n            snp_id_current.rlike(\"^[0-9]*$\"),\n            f.format_string(\"rs%s\", snp_id_current),\n        )\n        # Cleaning risk allele:\n        risk_allele = f.split(risk_allele, \"-\").getItem(0)\n\n        # Collecting all values:\n        return f.array_distinct(f.array(snp_id, snp_id_current, risk_allele))\n\n    @staticmethod\n    def _map_to_variant_annotation_variants(\n        gwas_associations: DataFrame, variant_annotation: VariantAnnotation\n    ) -> DataFrame:\n        \"\"\"Add variant metadata in associations.\n\n        Args:\n            gwas_associations (DataFrame): raw GWAS Catalog associations\n            variant_annotation (VariantAnnotation): variant annotation dataset\n\n        Returns:\n            DataFrame: GWAS Catalog associations data including `variantId`, `referenceAllele`,\n            `alternateAllele`, `chromosome`, `position` with variant metadata\n        \"\"\"\n        # Subset of GWAS Catalog associations required for resolving variant IDs:\n        gwas_associations_subset = gwas_associations.select(\n            \"studyLocusId\",\n            f.col(\"CHR_ID\").alias(\"chromosome\"),\n            f.col(\"CHR_POS\").cast(IntegerType()).alias(\"position\"),\n            # List of all SNPs associated with the variant\n            GWASCatalogAssociations._collect_rsids(\n                f.split(f.col(\"SNPS\"), \"; \").getItem(0),\n                f.col(\"SNP_ID_CURRENT\"),\n                f.split(f.col(\"STRONGEST SNP-RISK ALLELE\"), \"; \").getItem(0),\n            ).alias(\"rsIdsGwasCatalog\"),\n            GWASCatalogAssociations._normalise_risk_allele(\n                f.col(\"STRONGEST SNP-RISK ALLELE\")\n            ).alias(\"riskAllele\"),\n        )\n\n        # Subset of variant annotation required for GWAS Catalog annotations:\n        va_subset = variant_annotation.df.select(\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            f.col(\"rsIds\").alias(\"rsIdsGnomad\"),\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"alleleFrequencies\",\n            variant_annotation.max_maf().alias(\"maxMaf\"),\n        ).join(\n            f.broadcast(\n                gwas_associations_subset.select(\"chromosome\", \"position\").distinct()\n            ),\n            on=[\"chromosome\", \"position\"],\n            how=\"inner\",\n        )\n\n        # Semi-resolved ids (still contains duplicates when conclusion was not possible to make\n        # based on rsIds or allele concordance)\n        filtered_associations = (\n            gwas_associations_subset.join(\n                f.broadcast(va_subset),\n                on=[\"chromosome\", \"position\"],\n                how=\"left\",\n            )\n            .withColumn(\n                \"rsIdFilter\",\n                GWASCatalogAssociations._flag_mappings_to_retain(\n                    f.col(\"studyLocusId\"),\n                    GWASCatalogAssociations._compare_rsids(\n                        f.col(\"rsIdsGnomad\"), f.col(\"rsIdsGwasCatalog\")\n                    ),\n                ),\n            )\n            .withColumn(\n                \"concordanceFilter\",\n                GWASCatalogAssociations._flag_mappings_to_retain(\n                    f.col(\"studyLocusId\"),\n                    GWASCatalogAssociations._check_concordance(\n                        f.col(\"riskAllele\"),\n                        f.col(\"referenceAllele\"),\n                        f.col(\"alternateAllele\"),\n                    ),\n                ),\n            )\n            .filter(\n                # Filter out rows where GWAS Catalog rsId does not match with GnomAD rsId,\n                # but there is corresponding variant for the same association\n                f.col(\"rsIdFilter\")\n                # or filter out rows where GWAS Catalog alleles are not concordant with GnomAD alleles,\n                # but there is corresponding variant for the same association\n                | f.col(\"concordanceFilter\")\n            )\n        )\n\n        # Keep only highest maxMaf variant per studyLocusId\n        fully_mapped_associations = get_record_with_maximum_value(\n            filtered_associations, grouping_col=\"studyLocusId\", sorting_col=\"maxMaf\"\n        ).select(\n            \"studyLocusId\",\n            \"variantId\",\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"chromosome\",\n            \"position\",\n        )\n\n        return gwas_associations.join(\n            fully_mapped_associations, on=\"studyLocusId\", how=\"left\"\n        )\n\n    @staticmethod\n    def _compare_rsids(gnomad: Column, gwas: Column) -> Column:\n        \"\"\"If the intersection of the two arrays is greater than 0, return True, otherwise return False.\n\n        Args:\n            gnomad (Column): rsids from gnomad\n            gwas (Column): rsids from the GWAS Catalog\n\n        Returns:\n            Column: A boolean column that is true if the GnomAD rsIDs can be found in the GWAS rsIDs.\n\n        Examples:\n            >>> d = [\n            ...    (1, [\"rs123\", \"rs523\"], [\"rs123\"]),\n            ...    (2, [], [\"rs123\"]),\n            ...    (3, [\"rs123\", \"rs523\"], []),\n            ...    (4, [], []),\n            ... ]\n            >>> df = spark.createDataFrame(d, ['associationId', 'gnomad', 'gwas'])\n            >>> df.withColumn(\"rsid_matches\", GWASCatalogAssociations._compare_rsids(f.col(\"gnomad\"),f.col('gwas'))).show()\n            +-------------+--------------+-------+------------+\n            |associationId|        gnomad|   gwas|rsid_matches|\n            +-------------+--------------+-------+------------+\n            |            1|[rs123, rs523]|[rs123]|        true|\n            |            2|            []|[rs123]|       false|\n            |            3|[rs123, rs523]|     []|       false|\n            |            4|            []|     []|       false|\n            +-------------+--------------+-------+------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return f.when(f.size(f.array_intersect(gnomad, gwas)) > 0, True).otherwise(\n            False\n        )\n\n    @staticmethod\n    def _flag_mappings_to_retain(\n        association_id: Column, filter_column: Column\n    ) -> Column:\n        \"\"\"Flagging mappings to drop for each association.\n\n        Some associations have multiple mappings. Some has matching rsId others don't. We only\n        want to drop the non-matching mappings, when a matching is available for the given association.\n        This logic can be generalised for other measures eg. allele concordance.\n\n        Args:\n            association_id (Column): association identifier column\n            filter_column (Column): boolean col indicating to keep a mapping\n\n        Returns:\n            Column: A column with a boolean value.\n\n        Examples:\n        >>> d = [\n        ...    (1, False),\n        ...    (1, False),\n        ...    (2, False),\n        ...    (2, True),\n        ...    (3, True),\n        ...    (3, True),\n        ... ]\n        >>> df = spark.createDataFrame(d, ['associationId', 'filter'])\n        >>> df.withColumn(\"isConcordant\", GWASCatalogAssociations._flag_mappings_to_retain(f.col(\"associationId\"),f.col('filter'))).show()\n        +-------------+------+------------+\n        |associationId|filter|isConcordant|\n        +-------------+------+------------+\n        |            1| false|        true|\n        |            1| false|        true|\n        |            2| false|       false|\n        |            2|  true|        true|\n        |            3|  true|        true|\n        |            3|  true|        true|\n        +-------------+------+------------+\n        <BLANKLINE>\n\n        \"\"\"\n        w = Window.partitionBy(association_id)\n\n        # Generating a boolean column informing if the filter column contains true anywhere for the association:\n        aggregated_filter = f.when(\n            f.array_contains(f.collect_set(filter_column).over(w), True), True\n        ).otherwise(False)\n\n        # Generate a filter column:\n        return f.when(aggregated_filter & (~filter_column), False).otherwise(True)\n\n    @staticmethod\n    def _check_concordance(\n        risk_allele: Column, reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the risk allele is concordant with the alt or ref allele.\n\n        If the risk allele is the same as the reference or alternate allele, or if the reverse complement of\n        the risk allele is the same as the reference or alternate allele, then the allele is concordant.\n        If no mapping is available (ref/alt is null), the function returns True.\n\n        Args:\n            risk_allele (Column): The allele that is associated with the risk of the disease.\n            reference_allele (Column): The reference allele from the GWAS catalog\n            alternate_allele (Column): The alternate allele of the variant.\n\n        Returns:\n            Column: A boolean column that is True if the risk allele is the same as the reference or alternate allele,\n            or if the reverse complement of the risk allele is the same as the reference or alternate allele.\n\n        Examples:\n            >>> d = [\n            ...     ('A', 'A', 'G'),\n            ...     ('A', 'T', 'G'),\n            ...     ('A', 'C', 'G'),\n            ...     ('A', 'A', '?'),\n            ...     (None, None, 'A'),\n            ... ]\n            >>> df = spark.createDataFrame(d, ['riskAllele', 'referenceAllele', 'alternateAllele'])\n            >>> df.withColumn(\"isConcordant\", GWASCatalogAssociations._check_concordance(f.col(\"riskAllele\"),f.col('referenceAllele'), f.col('alternateAllele'))).show()\n            +----------+---------------+---------------+------------+\n            |riskAllele|referenceAllele|alternateAllele|isConcordant|\n            +----------+---------------+---------------+------------+\n            |         A|              A|              G|        true|\n            |         A|              T|              G|        true|\n            |         A|              C|              G|       false|\n            |         A|              A|              ?|        true|\n            |      null|           null|              A|        true|\n            +----------+---------------+---------------+------------+\n            <BLANKLINE>\n\n        \"\"\"\n        # Calculating the reverse complement of the risk allele:\n        risk_allele_reverse_complement = f.when(\n            risk_allele.rlike(r\"^[ACTG]+$\"),\n            f.reverse(f.translate(risk_allele, \"ACTG\", \"TGAC\")),\n        ).otherwise(risk_allele)\n\n        # OK, is the risk allele or the reverse complent is the same as the mapped alleles:\n        return (\n            f.when(\n                (risk_allele == reference_allele) | (risk_allele == alternate_allele),\n                True,\n            )\n            # If risk allele is found on the negative strand:\n            .when(\n                (risk_allele_reverse_complement == reference_allele)\n                | (risk_allele_reverse_complement == alternate_allele),\n                True,\n            )\n            # If risk allele is ambiguous, still accepted: < This condition could be reconsidered\n            .when(risk_allele == \"?\", True)\n            # If the association could not be mapped we keep it:\n            .when(reference_allele.isNull(), True)\n            # Allele is discordant:\n            .otherwise(False)\n        )\n\n    @staticmethod\n    def _get_reverse_complement(allele_col: Column) -> Column:\n        \"\"\"A function to return the reverse complement of an allele column.\n\n        It takes a string and returns the reverse complement of that string if it's a DNA sequence,\n        otherwise it returns the original string. Assumes alleles in upper case.\n\n        Args:\n            allele_col (Column): The column containing the allele to reverse complement.\n\n        Returns:\n            Column: A column that is the reverse complement of the allele column.\n\n        Examples:\n            >>> d = [{\"allele\": 'A'}, {\"allele\": 'T'},{\"allele\": 'G'}, {\"allele\": 'C'},{\"allele\": 'AC'}, {\"allele\": 'GTaatc'},{\"allele\": '?'}, {\"allele\": None}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"revcom_allele\", GWASCatalogAssociations._get_reverse_complement(f.col(\"allele\"))).show()\n            +------+-------------+\n            |allele|revcom_allele|\n            +------+-------------+\n            |     A|            T|\n            |     T|            A|\n            |     G|            C|\n            |     C|            G|\n            |    AC|           GT|\n            |GTaatc|       GATTAC|\n            |     ?|            ?|\n            |  null|         null|\n            +------+-------------+\n            <BLANKLINE>\n\n        \"\"\"\n        allele_col = f.upper(allele_col)\n        return f.when(\n            allele_col.rlike(\"[ACTG]+\"),\n            f.reverse(f.translate(allele_col, \"ACTG\", \"TGAC\")),\n        ).otherwise(allele_col)\n\n    @staticmethod\n    def _effect_needs_harmonisation(\n        risk_allele: Column, reference_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the effect allele needs to be harmonised.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Effect allele column\n\n        Returns:\n            Column: A boolean column indicating if the effect allele needs to be harmonised.\n\n        Examples:\n            >>> d = [{\"risk\": 'A', \"reference\": 'A'}, {\"risk\": 'A', \"reference\": 'T'}, {\"risk\": 'AT', \"reference\": 'TA'}, {\"risk\": 'AT', \"reference\": 'AT'}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"needs_harmonisation\", GWASCatalogAssociations._effect_needs_harmonisation(f.col(\"risk\"), f.col(\"reference\"))).show()\n            +---------+----+-------------------+\n            |reference|risk|needs_harmonisation|\n            +---------+----+-------------------+\n            |        A|   A|               true|\n            |        T|   A|               true|\n            |       TA|  AT|              false|\n            |       AT|  AT|               true|\n            +---------+----+-------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return (risk_allele == reference_allele) | (\n            risk_allele\n            == GWASCatalogAssociations._get_reverse_complement(reference_allele)\n        )\n\n    @staticmethod\n    def _are_alleles_palindromic(\n        reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the alleles are palindromic.\n\n        Args:\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n\n        Returns:\n            Column: A boolean column indicating if the alleles are palindromic.\n\n        Examples:\n            >>> d = [{\"reference\": 'A', \"alternate\": 'T'}, {\"reference\": 'AT', \"alternate\": 'AG'}, {\"reference\": 'AT', \"alternate\": 'AT'}, {\"reference\": 'CATATG', \"alternate\": 'CATATG'}, {\"reference\": '-', \"alternate\": None}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"is_palindromic\", GWASCatalogAssociations._are_alleles_palindromic(f.col(\"reference\"), f.col(\"alternate\"))).show()\n            +---------+---------+--------------+\n            |alternate|reference|is_palindromic|\n            +---------+---------+--------------+\n            |        T|        A|          true|\n            |       AG|       AT|         false|\n            |       AT|       AT|          true|\n            |   CATATG|   CATATG|          true|\n            |     null|        -|         false|\n            +---------+---------+--------------+\n            <BLANKLINE>\n\n        \"\"\"\n        revcomp = GWASCatalogAssociations._get_reverse_complement(alternate_allele)\n        return (\n            f.when(reference_allele == revcomp, True)\n            .when(revcomp.isNull(), False)\n            .otherwise(False)\n        )\n\n    @staticmethod\n    def _harmonise_beta(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n    ) -> Column:\n        \"\"\"A function to extract the beta value from the effect size and confidence interval.\n\n        If the confidence interval contains the word \"increase\" or \"decrease\" it indicates, we are dealing with betas.\n        If it's \"increase\" and the effect size needs to be harmonized, then multiply the effect size by -1\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n\n        Returns:\n            Column: A column containing the beta value.\n        \"\"\"\n        return (\n            f.when(\n                GWASCatalogAssociations._are_alleles_palindromic(\n                    reference_allele, alternate_allele\n                ),\n                None,\n            )\n            .when(\n                (\n                    GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & confidence_interval.contains(\"increase\")\n                )\n                | (\n                    ~GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & confidence_interval.contains(\"decrease\")\n                ),\n                -effect_size,\n            )\n            .otherwise(effect_size)\n            .cast(DoubleType())\n        )\n\n    @staticmethod\n    def _harmonise_beta_ci(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n        p_value: Column,\n        direction: str,\n    ) -> Column:\n        \"\"\"Calculating confidence intervals for beta values.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n            p_value (Column): GWAS Catalog p-value column\n            direction (str): This is the direction of the confidence interval. It can be either \"upper\" or \"lower\".\n\n        Returns:\n            Column: The upper and lower bounds of the confidence interval for the beta coefficient.\n        \"\"\"\n        zscore_95 = f.lit(1.96)\n        beta = GWASCatalogAssociations._harmonise_beta(\n            risk_allele,\n            reference_allele,\n            alternate_allele,\n            effect_size,\n            confidence_interval,\n        )\n        zscore = pvalue_to_zscore(p_value)\n        return (\n            f.when(f.lit(direction) == \"upper\", beta + f.abs(zscore_95 * beta) / zscore)\n            .when(f.lit(direction) == \"lower\", beta - f.abs(zscore_95 * beta) / zscore)\n            .otherwise(None)\n        )\n\n    @staticmethod\n    def _harmonise_odds_ratio(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n    ) -> Column:\n        \"\"\"Harmonizing odds ratio.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n\n        Returns:\n            Column: A column with the odds ratio, or 1/odds_ratio if harmonization required.\n        \"\"\"\n        return (\n            f.when(\n                GWASCatalogAssociations._are_alleles_palindromic(\n                    reference_allele, alternate_allele\n                ),\n                None,\n            )\n            .when(\n                (\n                    GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & ~confidence_interval.rlike(\"|\".join([\"decrease\", \"increase\"]))\n                ),\n                1 / effect_size,\n            )\n            .otherwise(effect_size)\n            .cast(DoubleType())\n        )\n\n    @staticmethod\n    def _harmonise_odds_ratio_ci(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n        p_value: Column,\n        direction: str,\n    ) -> Column:\n        \"\"\"Calculating confidence intervals for beta values.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n            p_value (Column): GWAS Catalog p-value column\n            direction (str): This is the direction of the confidence interval. It can be either \"upper\" or \"lower\".\n\n        Returns:\n            Column: The upper and lower bounds of the 95% confidence interval for the odds ratio.\n        \"\"\"\n        zscore_95 = f.lit(1.96)\n        odds_ratio = GWASCatalogAssociations._harmonise_odds_ratio(\n            risk_allele,\n            reference_allele,\n            alternate_allele,\n            effect_size,\n            confidence_interval,\n        )\n        odds_ratio_estimate = f.log(odds_ratio)\n        zscore = pvalue_to_zscore(p_value)\n        odds_ratio_se = odds_ratio_estimate / zscore\n        return f.when(\n            f.lit(direction) == \"upper\",\n            f.exp(odds_ratio_estimate + f.abs(zscore_95 * odds_ratio_se)),\n        ).when(\n            f.lit(direction) == \"lower\",\n            f.exp(odds_ratio_estimate - f.abs(zscore_95 * odds_ratio_se)),\n        )\n\n    @staticmethod\n    def _concatenate_substudy_description(\n        association_trait: Column, pvalue_text: Column, mapped_trait_uri: Column\n    ) -> Column:\n        \"\"\"Substudy description parsing. Complex string containing metadata about the substudy (e.g. QTL, specific EFO, etc.).\n\n        Args:\n            association_trait (Column): GWAS Catalog association trait column\n            pvalue_text (Column): GWAS Catalog p-value text column\n            mapped_trait_uri (Column): GWAS Catalog mapped trait URI column\n\n        Returns:\n            Column: A column with the substudy description in the shape trait|pvaluetext1_pvaluetext2|EFO1_EFO2.\n\n        Examples:\n        >>> df = spark.createDataFrame([\n        ...    (\"Height\", \"http://www.ebi.ac.uk/efo/EFO_0000001,http://www.ebi.ac.uk/efo/EFO_0000002\", \"European Ancestry\"),\n        ...    (\"Schizophrenia\", \"http://www.ebi.ac.uk/efo/MONDO_0005090\", None)],\n        ...    [\"association_trait\", \"mapped_trait_uri\", \"pvalue_text\"]\n        ... )\n        >>> df.withColumn('substudy_description', GWASCatalogAssociations._concatenate_substudy_description(df.association_trait, df.pvalue_text, df.mapped_trait_uri)).show(truncate=False)\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        |association_trait|mapped_trait_uri                                                         |pvalue_text      |substudy_description                      |\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        |Height           |http://www.ebi.ac.uk/efo/EFO_0000001,http://www.ebi.ac.uk/efo/EFO_0000002|European Ancestry|Height|EA|EFO_0000001/EFO_0000002         |\n        |Schizophrenia    |http://www.ebi.ac.uk/efo/MONDO_0005090                                   |null             |Schizophrenia|no_pvalue_text|MONDO_0005090|\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        <BLANKLINE>\n        \"\"\"\n        p_value_text = f.coalesce(\n            GWASCatalogAssociations._normalise_pvaluetext(pvalue_text),\n            f.array(f.lit(\"no_pvalue_text\")),\n        )\n        return f.concat_ws(\n            \"|\",\n            association_trait,\n            f.concat_ws(\n                \"/\",\n                p_value_text,\n            ),\n            f.concat_ws(\n                \"/\",\n                parse_efos(mapped_trait_uri),\n            ),\n        )\n\n    @staticmethod\n    def _qc_all(\n        qc: Column,\n        chromosome: Column,\n        position: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        strongest_snp_risk_allele: Column,\n        p_value_mantissa: Column,\n        p_value_exponent: Column,\n        p_value_cutoff: float,\n    ) -> Column:\n        \"\"\"Flag associations that fail any QC.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column\n            position (Column): Position column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            strongest_snp_risk_allele (Column): Strongest SNP risk allele column\n            p_value_mantissa (Column): P-value mantissa column\n            p_value_exponent (Column): P-value exponent column\n            p_value_cutoff (float): P-value cutoff\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        qc = GWASCatalogAssociations._qc_variant_interactions(\n            qc, strongest_snp_risk_allele\n        )\n        qc = GWASCatalogAssociations._qc_subsignificant_associations(\n            qc, p_value_mantissa, p_value_exponent, p_value_cutoff\n        )\n        qc = GWASCatalogAssociations._qc_genomic_location(qc, chromosome, position)\n        qc = GWASCatalogAssociations._qc_variant_inconsistencies(\n            qc, chromosome, position, strongest_snp_risk_allele\n        )\n        qc = GWASCatalogAssociations._qc_unmapped_variants(qc, alternate_allele)\n        qc = GWASCatalogAssociations._qc_palindromic_alleles(\n            qc, reference_allele, alternate_allele\n        )\n        return qc\n\n    @staticmethod\n    def _qc_variant_interactions(\n        qc: Column, strongest_snp_risk_allele: Column\n    ) -> Column:\n        \"\"\"Flag associations based on variant x variant interactions.\n\n        Args:\n            qc (Column): QC column\n            strongest_snp_risk_allele (Column): Column with the strongest SNP risk allele\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            strongest_snp_risk_allele.contains(\";\"),\n            StudyLocusQualityCheck.COMPOSITE_FLAG,\n        )\n\n    @staticmethod\n    def _qc_subsignificant_associations(\n        qc: Column,\n        p_value_mantissa: Column,\n        p_value_exponent: Column,\n        pvalue_cutoff: float,\n    ) -> Column:\n        \"\"\"Flag associations below significant threshold.\n\n        Args:\n            qc (Column): QC column\n            p_value_mantissa (Column): P-value mantissa column\n            p_value_exponent (Column): P-value exponent column\n            pvalue_cutoff (float): association p-value cut-off\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Examples:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -7}, {'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -8}, {'qc': None, 'p_value_mantissa': 5, 'p_value_exponent': -8}, {'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -9}]\n            >>> df = spark.createDataFrame(d, t.StructType([t.StructField('qc', t.ArrayType(t.StringType()), True), t.StructField('p_value_mantissa', t.IntegerType()), t.StructField('p_value_exponent', t.IntegerType())]))\n            >>> df.withColumn('qc', GWASCatalogAssociations._qc_subsignificant_associations(f.col(\"qc\"), f.col(\"p_value_mantissa\"), f.col(\"p_value_exponent\"), 5e-8)).show(truncate = False)\n            +------------------------+----------------+----------------+\n            |qc                      |p_value_mantissa|p_value_exponent|\n            +------------------------+----------------+----------------+\n            |[Subsignificant p-value]|1               |-7              |\n            |[]                      |1               |-8              |\n            |[]                      |5               |-8              |\n            |[]                      |1               |-9              |\n            +------------------------+----------------+----------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            calculate_neglog_pvalue(p_value_mantissa, p_value_exponent)\n            < f.lit(-np.log10(pvalue_cutoff)),\n            StudyLocusQualityCheck.SUBSIGNIFICANT_FLAG,\n        )\n\n    @staticmethod\n    def _qc_genomic_location(\n        qc: Column, chromosome: Column, position: Column\n    ) -> Column:\n        \"\"\"Flag associations without genomic location in GWAS Catalog.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column in GWAS Catalog\n            position (Column): Position column in GWAS Catalog\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Examples:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'qc': None, 'chromosome': None, 'position': None}, {'qc': None, 'chromosome': '1', 'position': None}, {'qc': None, 'chromosome': None, 'position': 1}, {'qc': None, 'chromosome': '1', 'position': 1}]\n            >>> df = spark.createDataFrame(d, schema=t.StructType([t.StructField('qc', t.ArrayType(t.StringType()), True), t.StructField('chromosome', t.StringType()), t.StructField('position', t.IntegerType())]))\n            >>> df.withColumn('qc', GWASCatalogAssociations._qc_genomic_location(df.qc, df.chromosome, df.position)).show(truncate=False)\n            +----------------------------+----------+--------+\n            |qc                          |chromosome|position|\n            +----------------------------+----------+--------+\n            |[Incomplete genomic mapping]|null      |null    |\n            |[Incomplete genomic mapping]|1         |null    |\n            |[Incomplete genomic mapping]|null      |1       |\n            |[]                          |1         |1       |\n            +----------------------------+----------+--------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            position.isNull() | chromosome.isNull(),\n            StudyLocusQualityCheck.NO_GENOMIC_LOCATION_FLAG,\n        )\n\n    @staticmethod\n    def _qc_variant_inconsistencies(\n        qc: Column,\n        chromosome: Column,\n        position: Column,\n        strongest_snp_risk_allele: Column,\n    ) -> Column:\n        \"\"\"Flag associations with inconsistencies in the variant annotation.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column in GWAS Catalog\n            position (Column): Position column in GWAS Catalog\n            strongest_snp_risk_allele (Column): Strongest SNP risk allele column in GWAS Catalog\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            # Number of chromosomes does not correspond to the number of positions:\n            (f.size(f.split(chromosome, \";\")) != f.size(f.split(position, \";\")))\n            # Number of chromosome values different from riskAllele values:\n            | (\n                f.size(f.split(chromosome, \";\"))\n                != f.size(f.split(strongest_snp_risk_allele, \";\"))\n            ),\n            StudyLocusQualityCheck.INCONSISTENCY_FLAG,\n        )\n\n    @staticmethod\n    def _qc_unmapped_variants(qc: Column, alternate_allele: Column) -> Column:\n        \"\"\"Flag associations with variants not mapped to variantAnnotation.\n\n        Args:\n            qc (Column): QC column\n            alternate_allele (Column): alternate allele\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'alternate_allele': 'A', 'qc': None}, {'alternate_allele': None, 'qc': None}]\n            >>> schema = t.StructType([t.StructField('alternate_allele', t.StringType(), True), t.StructField('qc', t.ArrayType(t.StringType()), True)])\n            >>> df = spark.createDataFrame(data=d, schema=schema)\n            >>> df.withColumn(\"new_qc\", GWASCatalogAssociations._qc_unmapped_variants(f.col(\"qc\"), f.col(\"alternate_allele\"))).show()\n            +----------------+----+--------------------+\n            |alternate_allele|  qc|              new_qc|\n            +----------------+----+--------------------+\n            |               A|null|                  []|\n            |            null|null|[No mapping in Gn...|\n            +----------------+----+--------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            alternate_allele.isNull(),\n            StudyLocusQualityCheck.NON_MAPPED_VARIANT_FLAG,\n        )\n\n    @staticmethod\n    def _qc_palindromic_alleles(\n        qc: Column, reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"Flag associations with palindromic variants which effects can not be harmonised.\n\n        Args:\n            qc (Column): QC column\n            reference_allele (Column): reference allele\n            alternate_allele (Column): alternate allele\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> schema = t.StructType([t.StructField('reference_allele', t.StringType(), True), t.StructField('alternate_allele', t.StringType(), True), t.StructField('qc', t.ArrayType(t.StringType()), True)])\n            >>> d = [{'reference_allele': 'A', 'alternate_allele': 'T', 'qc': None}, {'reference_allele': 'AT', 'alternate_allele': 'TA', 'qc': None}, {'reference_allele': 'AT', 'alternate_allele': 'AT', 'qc': None}]\n            >>> df = spark.createDataFrame(data=d, schema=schema)\n            >>> df.withColumn(\"qc\", GWASCatalogAssociations._qc_palindromic_alleles(f.col(\"qc\"), f.col(\"reference_allele\"), f.col(\"alternate_allele\"))).show(truncate=False)\n            +----------------+----------------+---------------------------------------+\n            |reference_allele|alternate_allele|qc                                     |\n            +----------------+----------------+---------------------------------------+\n            |A               |T               |[Palindrome alleles - cannot harmonize]|\n            |AT              |TA              |[]                                     |\n            |AT              |AT              |[Palindrome alleles - cannot harmonize]|\n            +----------------+----------------+---------------------------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            GWASCatalogAssociations._are_alleles_palindromic(\n                reference_allele, alternate_allele\n            ),\n            StudyLocusQualityCheck.PALINDROMIC_ALLELE_FLAG,\n        )\n\n    @classmethod\n    def from_source(\n        cls: type[GWASCatalogAssociations],\n        gwas_associations: DataFrame,\n        variant_annotation: VariantAnnotation,\n        pvalue_threshold: float = 5e-8,\n    ) -> GWASCatalogAssociations:\n        \"\"\"Read GWASCatalog associations.\n\n        It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and\n        applies some pre-defined filters on the data:\n\n        Args:\n            gwas_associations (DataFrame): GWAS Catalog raw associations dataset\n            variant_annotation (VariantAnnotation): Variant annotation dataset\n            pvalue_threshold (float): P-value threshold for flagging associations\n\n        Returns:\n            GWASCatalogAssociations: GWASCatalogAssociations dataset\n        \"\"\"\n        return GWASCatalogAssociations(\n            _df=gwas_associations.withColumn(\n                \"studyLocusId\", f.monotonically_increasing_id().cast(LongType())\n            )\n            .transform(\n                # Map/harmonise variants to variant annotation dataset:\n                # This function adds columns: variantId, referenceAllele, alternateAllele, chromosome, position\n                lambda df: GWASCatalogAssociations._map_to_variant_annotation_variants(\n                    df, variant_annotation\n                )\n            )\n            .withColumn(\n                # Perform all quality control checks:\n                \"qualityControls\",\n                GWASCatalogAssociations._qc_all(\n                    f.array().alias(\"qualityControls\"),\n                    f.col(\"CHR_ID\"),\n                    f.col(\"CHR_POS\").cast(IntegerType()),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"STRONGEST SNP-RISK ALLELE\"),\n                    *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                    pvalue_threshold,\n                ),\n            )\n            .select(\n                # INSIDE STUDY-LOCUS SCHEMA:\n                \"studyLocusId\",\n                \"variantId\",\n                # Mapped genomic location of the variant (; separated list)\n                \"chromosome\",\n                \"position\",\n                f.col(\"STUDY ACCESSION\").alias(\"studyId\"),\n                # beta value of the association\n                GWASCatalogAssociations._harmonise_beta(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                ).alias(\"beta\"),\n                # odds ratio of the association\n                GWASCatalogAssociations._harmonise_odds_ratio(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                ).alias(\"oddsRatio\"),\n                # CI lower of the beta value\n                GWASCatalogAssociations._harmonise_beta_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"lower\",\n                ).alias(\"betaConfidenceIntervalLower\"),\n                # CI upper for the beta value\n                GWASCatalogAssociations._harmonise_beta_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"upper\",\n                ).alias(\"betaConfidenceIntervalUpper\"),\n                # CI lower of the odds ratio value\n                GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"lower\",\n                ).alias(\"oddsRatioConfidenceIntervalLower\"),\n                # CI upper of the odds ratio value\n                GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"upper\",\n                ).alias(\"oddsRatioConfidenceIntervalUpper\"),\n                # p-value of the association, string: split into exponent and mantissa.\n                *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                # Capturing phenotype granularity at the association level\n                GWASCatalogAssociations._concatenate_substudy_description(\n                    f.col(\"DISEASE/TRAIT\"),\n                    f.col(\"P-VALUE (TEXT)\"),\n                    f.col(\"MAPPED_TRAIT_URI\"),\n                ).alias(\"subStudyDescription\"),\n                # Quality controls (array of strings)\n                \"qualityControls\",\n            ),\n            _schema=GWASCatalogAssociations.get_schema(),\n        )\n\n    def update_study_id(\n        self: GWASCatalogAssociations, study_annotation: DataFrame\n    ) -> GWASCatalogAssociations:\n        \"\"\"Update final studyId and studyLocusId with a dataframe containing study annotation.\n\n        Args:\n            study_annotation (DataFrame): Dataframe containing `updatedStudyId` and key columns `studyId` and `subStudyDescription`.\n\n        Returns:\n            GWASCatalogAssociations: Updated study locus with new `studyId` and `studyLocusId`.\n        \"\"\"\n        self.df = (\n            self._df.join(\n                study_annotation, on=[\"studyId\", \"subStudyDescription\"], how=\"left\"\n            )\n            .withColumn(\"studyId\", f.coalesce(\"updatedStudyId\", \"studyId\"))\n            .drop(\"subStudyDescription\", \"updatedStudyId\")\n        ).withColumn(\n            \"studyLocusId\",\n            StudyLocus.assign_study_locus_id(f.col(\"studyId\"), f.col(\"variantId\")),\n        )\n        return self\n\n    def _qc_ambiguous_study(self: GWASCatalogAssociations) -> GWASCatalogAssociations:\n        \"\"\"Flag associations with variants that can not be unambiguously associated with one study.\n\n        Returns:\n            GWASCatalogAssociations: Updated study locus.\n        \"\"\"\n        assoc_ambiguity_window = Window.partitionBy(\n            f.col(\"studyId\"), f.col(\"variantId\")\n        )\n\n        self._df.withColumn(\n            \"qualityControls\",\n            StudyLocus._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.count(f.col(\"variantId\")).over(assoc_ambiguity_window) > 1,\n                StudyLocusQualityCheck.AMBIGUOUS_STUDY,\n            ),\n        )\n        return self\n
"},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations.from_source","title":"from_source(gwas_associations: DataFrame, variant_annotation: VariantAnnotation, pvalue_threshold: float = 5e-08) -> GWASCatalogAssociations classmethod","text":"

Read GWASCatalog associations.

It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and applies some pre-defined filters on the data:

Parameters:

Name Type Description Default gwas_associations DataFrame

GWAS Catalog raw associations dataset

required variant_annotation VariantAnnotation

Variant annotation dataset

required pvalue_threshold float

P-value threshold for flagging associations

5e-08

Returns:

Name Type Description GWASCatalogAssociations GWASCatalogAssociations

GWASCatalogAssociations dataset

Source code in src/otg/datasource/gwas_catalog/associations.py
@classmethod\ndef from_source(\n    cls: type[GWASCatalogAssociations],\n    gwas_associations: DataFrame,\n    variant_annotation: VariantAnnotation,\n    pvalue_threshold: float = 5e-8,\n) -> GWASCatalogAssociations:\n    \"\"\"Read GWASCatalog associations.\n\n    It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and\n    applies some pre-defined filters on the data:\n\n    Args:\n        gwas_associations (DataFrame): GWAS Catalog raw associations dataset\n        variant_annotation (VariantAnnotation): Variant annotation dataset\n        pvalue_threshold (float): P-value threshold for flagging associations\n\n    Returns:\n        GWASCatalogAssociations: GWASCatalogAssociations dataset\n    \"\"\"\n    return GWASCatalogAssociations(\n        _df=gwas_associations.withColumn(\n            \"studyLocusId\", f.monotonically_increasing_id().cast(LongType())\n        )\n        .transform(\n            # Map/harmonise variants to variant annotation dataset:\n            # This function adds columns: variantId, referenceAllele, alternateAllele, chromosome, position\n            lambda df: GWASCatalogAssociations._map_to_variant_annotation_variants(\n                df, variant_annotation\n            )\n        )\n        .withColumn(\n            # Perform all quality control checks:\n            \"qualityControls\",\n            GWASCatalogAssociations._qc_all(\n                f.array().alias(\"qualityControls\"),\n                f.col(\"CHR_ID\"),\n                f.col(\"CHR_POS\").cast(IntegerType()),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"STRONGEST SNP-RISK ALLELE\"),\n                *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                pvalue_threshold,\n            ),\n        )\n        .select(\n            # INSIDE STUDY-LOCUS SCHEMA:\n            \"studyLocusId\",\n            \"variantId\",\n            # Mapped genomic location of the variant (; separated list)\n            \"chromosome\",\n            \"position\",\n            f.col(\"STUDY ACCESSION\").alias(\"studyId\"),\n            # beta value of the association\n            GWASCatalogAssociations._harmonise_beta(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n            ).alias(\"beta\"),\n            # odds ratio of the association\n            GWASCatalogAssociations._harmonise_odds_ratio(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n            ).alias(\"oddsRatio\"),\n            # CI lower of the beta value\n            GWASCatalogAssociations._harmonise_beta_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"lower\",\n            ).alias(\"betaConfidenceIntervalLower\"),\n            # CI upper for the beta value\n            GWASCatalogAssociations._harmonise_beta_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"upper\",\n            ).alias(\"betaConfidenceIntervalUpper\"),\n            # CI lower of the odds ratio value\n            GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"lower\",\n            ).alias(\"oddsRatioConfidenceIntervalLower\"),\n            # CI upper of the odds ratio value\n            GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"upper\",\n            ).alias(\"oddsRatioConfidenceIntervalUpper\"),\n            # p-value of the association, string: split into exponent and mantissa.\n            *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n            # Capturing phenotype granularity at the association level\n            GWASCatalogAssociations._concatenate_substudy_description(\n                f.col(\"DISEASE/TRAIT\"),\n                f.col(\"P-VALUE (TEXT)\"),\n                f.col(\"MAPPED_TRAIT_URI\"),\n            ).alias(\"subStudyDescription\"),\n            # Quality controls (array of strings)\n            \"qualityControls\",\n        ),\n        _schema=GWASCatalogAssociations.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations.update_study_id","title":"update_study_id(study_annotation: DataFrame) -> GWASCatalogAssociations","text":"

Update final studyId and studyLocusId with a dataframe containing study annotation.

Parameters:

Name Type Description Default study_annotation DataFrame

Dataframe containing updatedStudyId and key columns studyId and subStudyDescription.

required

Returns:

Name Type Description GWASCatalogAssociations GWASCatalogAssociations

Updated study locus with new studyId and studyLocusId.

Source code in src/otg/datasource/gwas_catalog/associations.py
def update_study_id(\n    self: GWASCatalogAssociations, study_annotation: DataFrame\n) -> GWASCatalogAssociations:\n    \"\"\"Update final studyId and studyLocusId with a dataframe containing study annotation.\n\n    Args:\n        study_annotation (DataFrame): Dataframe containing `updatedStudyId` and key columns `studyId` and `subStudyDescription`.\n\n    Returns:\n        GWASCatalogAssociations: Updated study locus with new `studyId` and `studyLocusId`.\n    \"\"\"\n    self.df = (\n        self._df.join(\n            study_annotation, on=[\"studyId\", \"subStudyDescription\"], how=\"left\"\n        )\n        .withColumn(\"studyId\", f.coalesce(\"updatedStudyId\", \"studyId\"))\n        .drop(\"subStudyDescription\", \"updatedStudyId\")\n    ).withColumn(\n        \"studyLocusId\",\n        StudyLocus.assign_study_locus_id(f.col(\"studyId\"), f.col(\"variantId\")),\n    )\n    return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex","title":"otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex dataclass","text":"

Bases: StudyIndex

Study index from GWAS Catalog.

The following information is harmonised from the GWAS Catalog:

  • All publication related information retained.
  • Mapped measured and background traits parsed.
  • Flagged if harmonized summary statistics datasets available.
  • If available, the ftp path to these files presented.
  • Ancestries from the discovery and replication stages are structured with sample counts.
  • Case/control counts extracted.
  • The number of samples with European ancestry extracted.
Source code in src/otg/datasource/gwas_catalog/study_index.py
@dataclass\nclass GWASCatalogStudyIndex(StudyIndex):\n    \"\"\"Study index from GWAS Catalog.\n\n    The following information is harmonised from the GWAS Catalog:\n\n    - All publication related information retained.\n    - Mapped measured and background traits parsed.\n    - Flagged if harmonized summary statistics datasets available.\n    - If available, the ftp path to these files presented.\n    - Ancestries from the discovery and replication stages are structured with sample counts.\n    - Case/control counts extracted.\n    - The number of samples with European ancestry extracted.\n\n    \"\"\"\n\n    @staticmethod\n    def _parse_discovery_samples(discovery_samples: Column) -> Column:\n        \"\"\"Parse discovery sample sizes from GWAS Catalog.\n\n        This is a curated field. From publication sometimes it is not clear how the samples were split\n        across the reported ancestries. In such cases we are assuming the ancestries were evenly presented\n        and the total sample size is split:\n\n        [\"European, African\", 100] -> [\"European, 50], [\"African\", 50]\n\n        Args:\n            discovery_samples (Column): Raw discovery sample sizes\n\n        Returns:\n            Column: Parsed and de-duplicated list of discovery ancestries with sample size.\n\n        Examples:\n            >>> data = [('s1', \"European\", 10), ('s1', \"African\", 10), ('s2', \"European, African, Asian\", 100), ('s2', \"European\", 50)]\n            >>> df = (\n            ...    spark.createDataFrame(data, ['studyId', 'ancestry', 'sampleSize'])\n            ...    .groupBy('studyId')\n            ...    .agg(\n            ...        f.collect_set(\n            ...            f.struct('ancestry', 'sampleSize')\n            ...        ).alias('discoverySampleSize')\n            ...    )\n            ...    .orderBy('studyId')\n            ...    .withColumn('discoverySampleSize', GWASCatalogStudyIndex._parse_discovery_samples(f.col('discoverySampleSize')))\n            ...    .select('discoverySampleSize')\n            ...    .show(truncate=False)\n            ... )\n            +--------------------------------------------+\n            |discoverySampleSize                         |\n            +--------------------------------------------+\n            |[{African, 10}, {European, 10}]             |\n            |[{European, 83}, {African, 33}, {Asian, 33}]|\n            +--------------------------------------------+\n            <BLANKLINE>\n        \"\"\"\n        # To initialize return objects for aggregate functions, schema has to be definied:\n        schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"ancestry\", t.StringType(), True),\n                    t.StructField(\"sampleSize\", t.IntegerType(), True),\n                ]\n            )\n        )\n\n        # Splitting comma separated ancestries:\n        exploded_ancestries = f.transform(\n            discovery_samples,\n            lambda sample: f.split(sample.ancestry, r\",\\s(?![^()]*\\))\"),\n        )\n\n        # Initialize discoverySample object from unique list of ancestries:\n        unique_ancestries = f.transform(\n            f.aggregate(\n                exploded_ancestries,\n                f.array().cast(t.ArrayType(t.StringType())),\n                lambda x, y: f.array_union(x, y),\n                f.array_distinct,\n            ),\n            lambda ancestry: f.struct(\n                ancestry.alias(\"ancestry\"),\n                f.lit(0).cast(t.LongType()).alias(\"sampleSize\"),\n            ),\n        )\n\n        # Computing sample sizes for ancestries when splitting is needed:\n        resolved_sample_count = f.transform(\n            f.arrays_zip(\n                f.transform(exploded_ancestries, lambda pop: f.size(pop)).alias(\n                    \"pop_size\"\n                ),\n                f.transform(discovery_samples, lambda pop: pop.sampleSize).alias(\n                    \"pop_count\"\n                ),\n            ),\n            lambda pop: (pop.pop_count / pop.pop_size).cast(t.IntegerType()),\n        )\n\n        # Flattening out ancestries with sample sizes:\n        parsed_sample_size = f.aggregate(\n            f.transform(\n                f.arrays_zip(\n                    exploded_ancestries.alias(\"ancestries\"),\n                    resolved_sample_count.alias(\"sample_count\"),\n                ),\n                GWASCatalogStudyIndex._merge_ancestries_and_counts,\n            ),\n            f.array().cast(schema),\n            lambda x, y: f.array_union(x, y),\n        )\n\n        # Normalize ancestries:\n        return f.aggregate(\n            parsed_sample_size,\n            unique_ancestries,\n            GWASCatalogStudyIndex._normalize_ancestries,\n        )\n\n    @staticmethod\n    def _normalize_ancestries(merged: Column, ancestry: Column) -> Column:\n        \"\"\"Normalize ancestries from a list of structs.\n\n        As some ancestry label might be repeated with different sample counts,\n        these counts need to be collected.\n\n        Args:\n            merged (Column): Resulting list of struct with unique ancestries.\n            ancestry (Column): One ancestry object coming from raw.\n\n        Returns:\n            Column: Unique list of ancestries with the sample counts.\n        \"\"\"\n        # Iterating over the list of unique ancestries and adding the sample size if label matches:\n        return f.transform(\n            merged,\n            lambda a: f.when(\n                a.ancestry == ancestry.ancestry,\n                f.struct(\n                    a.ancestry.alias(\"ancestry\"),\n                    (a.sampleSize + ancestry.sampleSize)\n                    .cast(t.LongType())\n                    .alias(\"sampleSize\"),\n                ),\n            ).otherwise(a),\n        )\n\n    @staticmethod\n    def _merge_ancestries_and_counts(ancestry_group: Column) -> Column:\n        \"\"\"Merge ancestries with sample sizes.\n\n        After splitting ancestry annotations, all resulting ancestries needs to be assigned\n        with the proper sample size.\n\n        Args:\n            ancestry_group (Column): Each element is a struct with `sample_count` (int) and `ancestries` (list)\n\n        Returns:\n            Column: a list of structs with `ancestry` and `sampleSize` fields.\n\n        Examples:\n            >>> data = [(12, ['African', 'European']),(12, ['African'])]\n            >>> (\n            ...     spark.createDataFrame(data, ['sample_count', 'ancestries'])\n            ...     .select(GWASCatalogStudyIndex._merge_ancestries_and_counts(f.struct('sample_count', 'ancestries')).alias('test'))\n            ...     .show(truncate=False)\n            ... )\n            +-------------------------------+\n            |test                           |\n            +-------------------------------+\n            |[{African, 12}, {European, 12}]|\n            |[{African, 12}]                |\n            +-------------------------------+\n            <BLANKLINE>\n        \"\"\"\n        # Extract sample size for the ancestry group:\n        count = ancestry_group.sample_count\n\n        # We need to loop through the ancestries:\n        return f.transform(\n            ancestry_group.ancestries,\n            lambda ancestry: f.struct(\n                ancestry.alias(\"ancestry\"),\n                count.alias(\"sampleSize\"),\n            ),\n        )\n\n    @classmethod\n    def _parse_study_table(\n        cls: type[GWASCatalogStudyIndex], catalog_studies: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Harmonise GWASCatalog study table with `StudyIndex` schema.\n\n        Args:\n            catalog_studies (DataFrame): GWAS Catalog study table\n\n        Returns:\n            GWASCatalogStudyIndex: Parsed and annotated GWAS Catalog study table.\n        \"\"\"\n        return GWASCatalogStudyIndex(\n            _df=catalog_studies.select(\n                f.coalesce(\n                    f.col(\"STUDY ACCESSION\"), f.monotonically_increasing_id()\n                ).alias(\"studyId\"),\n                f.lit(\"GCST\").alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.col(\"PUBMED ID\").alias(\"pubmedId\"),\n                f.col(\"FIRST AUTHOR\").alias(\"publicationFirstAuthor\"),\n                f.col(\"DATE\").alias(\"publicationDate\"),\n                f.col(\"JOURNAL\").alias(\"publicationJournal\"),\n                f.col(\"STUDY\").alias(\"publicationTitle\"),\n                f.coalesce(f.col(\"DISEASE/TRAIT\"), f.lit(\"Unreported\")).alias(\n                    \"traitFromSource\"\n                ),\n                f.col(\"INITIAL SAMPLE SIZE\").alias(\"initialSampleSize\"),\n                parse_efos(f.col(\"MAPPED_TRAIT_URI\")).alias(\"traitFromSourceMappedIds\"),\n                parse_efos(f.col(\"MAPPED BACKGROUND TRAIT URI\")).alias(\n                    \"backgroundTraitFromSourceMappedIds\"\n                ),\n            ),\n            _schema=GWASCatalogStudyIndex.get_schema(),\n        )\n\n    @classmethod\n    def from_source(\n        cls: type[GWASCatalogStudyIndex],\n        catalog_studies: DataFrame,\n        ancestry_file: DataFrame,\n        sumstats_lut: DataFrame,\n    ) -> StudyIndex:\n        \"\"\"Ingests study level metadata from the GWAS Catalog.\n\n        Args:\n            catalog_studies (DataFrame): GWAS Catalog raw study table\n            ancestry_file (DataFrame): GWAS Catalog ancestry table.\n            sumstats_lut (DataFrame): GWAS Catalog summary statistics list.\n\n        Returns:\n            StudyIndex: Parsed and annotated GWAS Catalog study table.\n        \"\"\"\n        # Read GWAS Catalogue raw data\n        return (\n            cls._parse_study_table(catalog_studies)\n            ._annotate_ancestries(ancestry_file)\n            ._annotate_sumstats_info(sumstats_lut)\n            ._annotate_discovery_sample_sizes()\n        )\n\n    def update_study_id(\n        self: GWASCatalogStudyIndex, study_annotation: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Update studyId with a dataframe containing study.\n\n        Args:\n            study_annotation (DataFrame): Dataframe containing `updatedStudyId`, `traitFromSource`, `traitFromSourceMappedIds` and key column `studyId`.\n\n        Returns:\n            GWASCatalogStudyIndex: Updated study table.\n        \"\"\"\n        self.df = (\n            self._df.join(\n                study_annotation.select(\n                    *[\n                        f.col(c).alias(f\"updated{c}\")\n                        if c not in [\"studyId\", \"updatedStudyId\"]\n                        else f.col(c)\n                        for c in study_annotation.columns\n                    ]\n                ),\n                on=\"studyId\",\n                how=\"left\",\n            )\n            .withColumn(\n                \"studyId\",\n                f.coalesce(f.col(\"updatedStudyId\"), f.col(\"studyId\")),\n            )\n            .withColumn(\n                \"traitFromSource\",\n                f.coalesce(f.col(\"updatedtraitFromSource\"), f.col(\"traitFromSource\")),\n            )\n            .withColumn(\n                \"traitFromSourceMappedIds\",\n                f.coalesce(\n                    f.col(\"updatedtraitFromSourceMappedIds\"),\n                    f.col(\"traitFromSourceMappedIds\"),\n                ),\n            )\n            .select(self._df.columns)\n        )\n\n        return self\n\n    def _annotate_ancestries(\n        self: GWASCatalogStudyIndex, ancestry_lut: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Extracting sample sizes and ancestry information.\n\n        This function parses the ancestry data. Also get counts for the europeans in the same\n        discovery stage.\n\n        Args:\n            ancestry_lut (DataFrame): Ancestry table as downloaded from the GWAS Catalog\n\n        Returns:\n            GWASCatalogStudyIndex: Slimmed and cleaned version of the ancestry annotation.\n        \"\"\"\n        ancestry = (\n            ancestry_lut\n            # Convert column headers to camelcase:\n            .transform(\n                lambda df: df.select(\n                    *[f.expr(column2camel_case(x)) for x in df.columns]\n                )\n            ).withColumnRenamed(\n                \"studyAccession\", \"studyId\"\n            )  # studyId has not been split yet\n        )\n\n        # Get a high resolution dataset on experimental stage:\n        ancestry_stages = (\n            ancestry.groupBy(\"studyId\")\n            .pivot(\"stage\")\n            .agg(\n                f.collect_set(\n                    f.struct(\n                        f.col(\"broadAncestralCategory\").alias(\"ancestry\"),\n                        f.col(\"numberOfIndividuals\")\n                        .cast(t.LongType())\n                        .alias(\"sampleSize\"),\n                    )\n                )\n            )\n            .withColumn(\n                \"discoverySamples\", self._parse_discovery_samples(f.col(\"initial\"))\n            )\n            .withColumnRenamed(\"replication\", \"replicationSamples\")\n            # Mapping discovery stage ancestries to LD reference:\n            .withColumn(\n                \"ldPopulationStructure\",\n                self.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            )\n            .drop(\"initial\")\n            .persist()\n        )\n\n        # Generate information on the ancestry composition of the discovery stage, and calculate\n        # the proportion of the Europeans:\n        europeans_deconvoluted = (\n            ancestry\n            # Focus on discovery stage:\n            .filter(f.col(\"stage\") == \"initial\")\n            # Sorting ancestries if European:\n            .withColumn(\n                \"ancestryFlag\",\n                # Excluding finnish:\n                f.when(\n                    f.col(\"initialSampleDescription\").contains(\"Finnish\"),\n                    f.lit(\"other\"),\n                )\n                # Excluding Icelandic population:\n                .when(\n                    f.col(\"initialSampleDescription\").contains(\"Icelandic\"),\n                    f.lit(\"other\"),\n                )\n                # Including European ancestry:\n                .when(f.col(\"broadAncestralCategory\") == \"European\", f.lit(\"european\"))\n                # Exclude all other population:\n                .otherwise(\"other\"),\n            )\n            # Grouping by study accession and initial sample description:\n            .groupBy(\"studyId\")\n            .pivot(\"ancestryFlag\")\n            .agg(\n                # Summarizing sample sizes for all ancestries:\n                f.sum(f.col(\"numberOfIndividuals\"))\n            )\n            # Do arithmetics to make sure we have the right proportion of european in the set:\n            .withColumn(\n                \"initialSampleCountEuropean\",\n                f.when(f.col(\"european\").isNull(), f.lit(0)).otherwise(\n                    f.col(\"european\")\n                ),\n            )\n            .withColumn(\n                \"initialSampleCountOther\",\n                f.when(f.col(\"other\").isNull(), f.lit(0)).otherwise(f.col(\"other\")),\n            )\n            .withColumn(\n                \"initialSampleCount\",\n                f.col(\"initialSampleCountEuropean\") + f.col(\"other\"),\n            )\n            .drop(\n                \"european\",\n                \"other\",\n                \"initialSampleCount\",\n                \"initialSampleCountEuropean\",\n                \"initialSampleCountOther\",\n            )\n        )\n\n        parsed_ancestry_lut = ancestry_stages.join(\n            europeans_deconvoluted, on=\"studyId\", how=\"outer\"\n        )\n\n        self.df = self.df.join(parsed_ancestry_lut, on=\"studyId\", how=\"left\")\n        return self\n\n    def _annotate_sumstats_info(\n        self: GWASCatalogStudyIndex, sumstats_lut: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Annotate summary stat locations.\n\n        Args:\n            sumstats_lut (DataFrame): listing GWAS Catalog summary stats paths\n\n        Returns:\n            GWASCatalogStudyIndex: including `summarystatsLocation` and `hasSumstats` columns\n        \"\"\"\n        gwas_sumstats_base_uri = (\n            \"ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/\"\n        )\n\n        parsed_sumstats_lut = sumstats_lut.withColumn(\n            \"summarystatsLocation\",\n            f.concat(\n                f.lit(gwas_sumstats_base_uri),\n                f.regexp_replace(f.col(\"_c0\"), r\"^\\.\\/\", \"\"),\n            ),\n        ).select(\n            f.regexp_extract(f.col(\"summarystatsLocation\"), r\"\\/(GCST\\d+)\\/\", 1).alias(\n                \"studyId\"\n            ),\n            \"summarystatsLocation\",\n            f.lit(True).alias(\"hasSumstats\"),\n        )\n\n        self.df = (\n            self.df.drop(\"hasSumstats\")\n            .join(parsed_sumstats_lut, on=\"studyId\", how=\"left\")\n            .withColumn(\"hasSumstats\", f.coalesce(f.col(\"hasSumstats\"), f.lit(False)))\n        )\n        return self\n\n    def _annotate_discovery_sample_sizes(\n        self: GWASCatalogStudyIndex,\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Extract the sample size of the discovery stage of the study as annotated in the GWAS Catalog.\n\n        For some studies that measure quantitative traits, nCases and nControls can't be extracted. Therefore, we assume these are 0.\n\n        Returns:\n            GWASCatalogStudyIndex: object with columns `nCases`, `nControls`, and `nSamples` per `studyId` correctly extracted.\n        \"\"\"\n        sample_size_lut = (\n            self.df.select(\n                \"studyId\",\n                f.explode_outer(f.split(f.col(\"initialSampleSize\"), r\",\\s+\")).alias(\n                    \"samples\"\n                ),\n            )\n            # Extracting the sample size from the string:\n            .withColumn(\n                \"sampleSize\",\n                f.regexp_extract(\n                    f.regexp_replace(f.col(\"samples\"), \",\", \"\"), r\"[0-9,]+\", 0\n                ).cast(t.IntegerType()),\n            )\n            .select(\n                \"studyId\",\n                \"sampleSize\",\n                f.when(f.col(\"samples\").contains(\"cases\"), f.col(\"sampleSize\"))\n                .otherwise(f.lit(0))\n                .alias(\"nCases\"),\n                f.when(f.col(\"samples\").contains(\"controls\"), f.col(\"sampleSize\"))\n                .otherwise(f.lit(0))\n                .alias(\"nControls\"),\n            )\n            # Aggregating sample sizes for all ancestries:\n            .groupBy(\"studyId\")  # studyId has not been split yet\n            .agg(\n                f.sum(\"nCases\").alias(\"nCases\"),\n                f.sum(\"nControls\").alias(\"nControls\"),\n                f.sum(\"sampleSize\").alias(\"nSamples\"),\n            )\n        )\n        self.df = self.df.join(sample_size_lut, on=\"studyId\", how=\"left\")\n        return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex.from_source","title":"from_source(catalog_studies: DataFrame, ancestry_file: DataFrame, sumstats_lut: DataFrame) -> StudyIndex classmethod","text":"

Ingests study level metadata from the GWAS Catalog.

Parameters:

Name Type Description Default catalog_studies DataFrame

GWAS Catalog raw study table

required ancestry_file DataFrame

GWAS Catalog ancestry table.

required sumstats_lut DataFrame

GWAS Catalog summary statistics list.

required

Returns:

Name Type Description StudyIndex StudyIndex

Parsed and annotated GWAS Catalog study table.

Source code in src/otg/datasource/gwas_catalog/study_index.py
@classmethod\ndef from_source(\n    cls: type[GWASCatalogStudyIndex],\n    catalog_studies: DataFrame,\n    ancestry_file: DataFrame,\n    sumstats_lut: DataFrame,\n) -> StudyIndex:\n    \"\"\"Ingests study level metadata from the GWAS Catalog.\n\n    Args:\n        catalog_studies (DataFrame): GWAS Catalog raw study table\n        ancestry_file (DataFrame): GWAS Catalog ancestry table.\n        sumstats_lut (DataFrame): GWAS Catalog summary statistics list.\n\n    Returns:\n        StudyIndex: Parsed and annotated GWAS Catalog study table.\n    \"\"\"\n    # Read GWAS Catalogue raw data\n    return (\n        cls._parse_study_table(catalog_studies)\n        ._annotate_ancestries(ancestry_file)\n        ._annotate_sumstats_info(sumstats_lut)\n        ._annotate_discovery_sample_sizes()\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex.update_study_id","title":"update_study_id(study_annotation: DataFrame) -> GWASCatalogStudyIndex","text":"

Update studyId with a dataframe containing study.

Parameters:

Name Type Description Default study_annotation DataFrame

Dataframe containing updatedStudyId, traitFromSource, traitFromSourceMappedIds and key column studyId.

required

Returns:

Name Type Description GWASCatalogStudyIndex GWASCatalogStudyIndex

Updated study table.

Source code in src/otg/datasource/gwas_catalog/study_index.py
def update_study_id(\n    self: GWASCatalogStudyIndex, study_annotation: DataFrame\n) -> GWASCatalogStudyIndex:\n    \"\"\"Update studyId with a dataframe containing study.\n\n    Args:\n        study_annotation (DataFrame): Dataframe containing `updatedStudyId`, `traitFromSource`, `traitFromSourceMappedIds` and key column `studyId`.\n\n    Returns:\n        GWASCatalogStudyIndex: Updated study table.\n    \"\"\"\n    self.df = (\n        self._df.join(\n            study_annotation.select(\n                *[\n                    f.col(c).alias(f\"updated{c}\")\n                    if c not in [\"studyId\", \"updatedStudyId\"]\n                    else f.col(c)\n                    for c in study_annotation.columns\n                ]\n            ),\n            on=\"studyId\",\n            how=\"left\",\n        )\n        .withColumn(\n            \"studyId\",\n            f.coalesce(f.col(\"updatedStudyId\"), f.col(\"studyId\")),\n        )\n        .withColumn(\n            \"traitFromSource\",\n            f.coalesce(f.col(\"updatedtraitFromSource\"), f.col(\"traitFromSource\")),\n        )\n        .withColumn(\n            \"traitFromSourceMappedIds\",\n            f.coalesce(\n                f.col(\"updatedtraitFromSourceMappedIds\"),\n                f.col(\"traitFromSourceMappedIds\"),\n            ),\n        )\n        .select(self._df.columns)\n    )\n\n    return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_splitter/","title":"Study Splitter","text":""},{"location":"python_api/datasource/gwas_catalog/study_splitter/#otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter","title":"otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter","text":"

Splitting multi-trait GWAS Catalog studies.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
class GWASCatalogStudySplitter:\n    \"\"\"Splitting multi-trait GWAS Catalog studies.\"\"\"\n\n    @staticmethod\n    def _resolve_trait(\n        study_trait: Column, association_trait: Column, p_value_text: Column\n    ) -> Column:\n        \"\"\"Resolve trait names by consolidating association-level and study-level trait names.\n\n        Args:\n            study_trait (Column): Study-level trait name.\n            association_trait (Column): Association-level trait name.\n            p_value_text (Column): P-value text.\n\n        Returns:\n            Column: Resolved trait name.\n        \"\"\"\n        return (\n            f.when(\n                (p_value_text.isNotNull()) & (p_value_text != (\"no_pvalue_text\")),\n                f.concat(\n                    association_trait,\n                    f.lit(\" [\"),\n                    p_value_text,\n                    f.lit(\"]\"),\n                ),\n            )\n            .when(\n                association_trait.isNotNull(),\n                association_trait,\n            )\n            .otherwise(study_trait)\n        )\n\n    @staticmethod\n    def _resolve_efo(association_efo: Column, study_efo: Column) -> Column:\n        \"\"\"Resolve EFOs by consolidating association-level and study-level EFOs.\n\n        Args:\n            association_efo (Column): EFO column from the association table.\n            study_efo (Column): EFO column from the study table.\n\n        Returns:\n            Column: Consolidated EFO column.\n        \"\"\"\n        return f.coalesce(f.split(association_efo, r\"\\/\"), study_efo)\n\n    @staticmethod\n    def _resolve_study_id(study_id: Column, sub_study_description: Column) -> Column:\n        \"\"\"Resolve study IDs by exploding association-level information (e.g. pvalue_text, EFO).\n\n        Args:\n            study_id (Column): Study ID column.\n            sub_study_description (Column): Sub-study description column from the association table.\n\n        Returns:\n            Column: Resolved study ID column.\n        \"\"\"\n        split_w = Window.partitionBy(study_id).orderBy(sub_study_description)\n        row_number = f.dense_rank().over(split_w)\n        substudy_count = f.count(row_number).over(split_w)\n        return f.when(substudy_count == 1, study_id).otherwise(\n            f.concat_ws(\"_\", study_id, row_number)\n        )\n\n    @classmethod\n    def split(\n        cls: type[GWASCatalogStudySplitter],\n        studies: GWASCatalogStudyIndex,\n        associations: GWASCatalogAssociations,\n    ) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n        \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n        If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n        Then disease EFOs, trait names and study ID are consolidated\n\n        Args:\n            studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n            associations (GWASCatalogAssociations): GWAS Catalog associations.\n\n        Returns:\n            Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.\n        \"\"\"\n        # Composite of studies and associations to resolve scattered information\n        st_ass = (\n            associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n            .select(\n                \"studyId\",\n                \"subStudyDescription\",\n                cls._resolve_study_id(\n                    f.col(\"studyId\"), f.col(\"subStudyDescription\")\n                ).alias(\"updatedStudyId\"),\n                cls._resolve_trait(\n                    f.col(\"traitFromSource\"),\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n                ).alias(\"traitFromSource\"),\n                cls._resolve_efo(\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n                    f.col(\"traitFromSourceMappedIds\"),\n                ).alias(\"traitFromSourceMappedIds\"),\n            )\n            .persist()\n        )\n\n        return (\n            studies.update_study_id(\n                st_ass.select(\n                    \"studyId\",\n                    \"updatedStudyId\",\n                    \"traitFromSource\",\n                    \"traitFromSourceMappedIds\",\n                ).distinct()\n            ),\n            associations.update_study_id(\n                st_ass.select(\n                    \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n                ).distinct()\n            )._qc_ambiguous_study(),\n        )\n
"},{"location":"python_api/datasource/gwas_catalog/study_splitter/#otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter.split","title":"split(studies: GWASCatalogStudyIndex, associations: GWASCatalogAssociations) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations] classmethod","text":"

Splitting multi-trait GWAS Catalog studies.

If assigned disease of the study and the association don't agree, we assume the study needs to be split. Then disease EFOs, trait names and study ID are consolidated

Parameters:

Name Type Description Default studies GWASCatalogStudyIndex

GWAS Catalog studies.

required associations GWASCatalogAssociations

GWAS Catalog associations.

required

Returns:

Type Description Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]

Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
@classmethod\ndef split(\n    cls: type[GWASCatalogStudySplitter],\n    studies: GWASCatalogStudyIndex,\n    associations: GWASCatalogAssociations,\n) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n    \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n    If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n    Then disease EFOs, trait names and study ID are consolidated\n\n    Args:\n        studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n        associations (GWASCatalogAssociations): GWAS Catalog associations.\n\n    Returns:\n        Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.\n    \"\"\"\n    # Composite of studies and associations to resolve scattered information\n    st_ass = (\n        associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n        .select(\n            \"studyId\",\n            \"subStudyDescription\",\n            cls._resolve_study_id(\n                f.col(\"studyId\"), f.col(\"subStudyDescription\")\n            ).alias(\"updatedStudyId\"),\n            cls._resolve_trait(\n                f.col(\"traitFromSource\"),\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n            ).alias(\"traitFromSource\"),\n            cls._resolve_efo(\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n                f.col(\"traitFromSourceMappedIds\"),\n            ).alias(\"traitFromSourceMappedIds\"),\n        )\n        .persist()\n    )\n\n    return (\n        studies.update_study_id(\n            st_ass.select(\n                \"studyId\",\n                \"updatedStudyId\",\n                \"traitFromSource\",\n                \"traitFromSourceMappedIds\",\n            ).distinct()\n        ),\n        associations.update_study_id(\n            st_ass.select(\n                \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n            ).distinct()\n        )._qc_ambiguous_study(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/summary_statistics/","title":"Summary statistics","text":""},{"location":"python_api/datasource/gwas_catalog/summary_statistics/#otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics","title":"otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics dataclass","text":"

Bases: SummaryStatistics

GWAS Catalog Summary Statistics reader.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
@dataclass\nclass GWASCatalogSummaryStatistics(SummaryStatistics):\n    \"\"\"GWAS Catalog Summary Statistics reader.\"\"\"\n\n    @classmethod\n    def from_gwas_harmonized_summary_stats(\n        cls: type[GWASCatalogSummaryStatistics],\n        sumstats_df: DataFrame,\n        study_id: str,\n    ) -> GWASCatalogSummaryStatistics:\n        \"\"\"Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.\n\n        Args:\n            sumstats_df (DataFrame): Harmonized dataset read as a spark dataframe from GWAS Catalog.\n            study_id (str): GWAS Catalog study accession.\n\n        Returns:\n            GWASCatalogSummaryStatistics: Summary statistics object.\n        \"\"\"\n        # The effect allele frequency is an optional column, we have to test if it is there:\n        allele_frequency_expression = (\n            f.col(\"hm_effect_allele_frequency\").cast(t.FloatType())\n            if \"hm_effect_allele_frequency\" in sumstats_df.columns\n            else f.lit(None)\n        )\n\n        # Processing columns of interest:\n        processed_sumstats_df = (\n            sumstats_df\n            # Dropping rows which doesn't have proper position:\n            .filter(f.col(\"hm_pos\").cast(t.IntegerType()).isNotNull())\n            .select(\n                # Adding study identifier:\n                f.lit(study_id).cast(t.StringType()).alias(\"studyId\"),\n                # Adding variant identifier:\n                f.col(\"hm_variant_id\").alias(\"variantId\"),\n                f.col(\"hm_chrom\").alias(\"chromosome\"),\n                f.col(\"hm_pos\").cast(t.IntegerType()).alias(\"position\"),\n                # Parsing p-value mantissa and exponent:\n                *parse_pvalue(f.col(\"p_value\")),\n                # Converting/calculating effect and confidence interval:\n                *convert_odds_ratio_to_beta(\n                    f.col(\"hm_beta\").cast(t.DoubleType()),\n                    f.col(\"hm_odds_ratio\").cast(t.DoubleType()),\n                    f.col(\"standard_error\").cast(t.DoubleType()),\n                ),\n                allele_frequency_expression.alias(\"effectAlleleFrequencyFromSource\"),\n            )\n            # The previous select expression generated the necessary fields for calculating the confidence intervals:\n            .select(\n                \"*\",\n                *calculate_confidence_interval(\n                    f.col(\"pValueMantissa\"),\n                    f.col(\"pValueExponent\"),\n                    f.col(\"beta\"),\n                    f.col(\"standardError\"),\n                ),\n            )\n            .repartition(200, \"chromosome\")\n            .sortWithinPartitions(\"position\")\n        )\n\n        # Initializing summary statistics object:\n        return cls(\n            _df=processed_sumstats_df,\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gwas_catalog/summary_statistics/#otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics.from_gwas_harmonized_summary_stats","title":"from_gwas_harmonized_summary_stats(sumstats_df: DataFrame, study_id: str) -> GWASCatalogSummaryStatistics classmethod","text":"

Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.

Parameters:

Name Type Description Default sumstats_df DataFrame

Harmonized dataset read as a spark dataframe from GWAS Catalog.

required study_id str

GWAS Catalog study accession.

required

Returns:

Name Type Description GWASCatalogSummaryStatistics GWASCatalogSummaryStatistics

Summary statistics object.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
@classmethod\ndef from_gwas_harmonized_summary_stats(\n    cls: type[GWASCatalogSummaryStatistics],\n    sumstats_df: DataFrame,\n    study_id: str,\n) -> GWASCatalogSummaryStatistics:\n    \"\"\"Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.\n\n    Args:\n        sumstats_df (DataFrame): Harmonized dataset read as a spark dataframe from GWAS Catalog.\n        study_id (str): GWAS Catalog study accession.\n\n    Returns:\n        GWASCatalogSummaryStatistics: Summary statistics object.\n    \"\"\"\n    # The effect allele frequency is an optional column, we have to test if it is there:\n    allele_frequency_expression = (\n        f.col(\"hm_effect_allele_frequency\").cast(t.FloatType())\n        if \"hm_effect_allele_frequency\" in sumstats_df.columns\n        else f.lit(None)\n    )\n\n    # Processing columns of interest:\n    processed_sumstats_df = (\n        sumstats_df\n        # Dropping rows which doesn't have proper position:\n        .filter(f.col(\"hm_pos\").cast(t.IntegerType()).isNotNull())\n        .select(\n            # Adding study identifier:\n            f.lit(study_id).cast(t.StringType()).alias(\"studyId\"),\n            # Adding variant identifier:\n            f.col(\"hm_variant_id\").alias(\"variantId\"),\n            f.col(\"hm_chrom\").alias(\"chromosome\"),\n            f.col(\"hm_pos\").cast(t.IntegerType()).alias(\"position\"),\n            # Parsing p-value mantissa and exponent:\n            *parse_pvalue(f.col(\"p_value\")),\n            # Converting/calculating effect and confidence interval:\n            *convert_odds_ratio_to_beta(\n                f.col(\"hm_beta\").cast(t.DoubleType()),\n                f.col(\"hm_odds_ratio\").cast(t.DoubleType()),\n                f.col(\"standard_error\").cast(t.DoubleType()),\n            ),\n            allele_frequency_expression.alias(\"effectAlleleFrequencyFromSource\"),\n        )\n        # The previous select expression generated the necessary fields for calculating the confidence intervals:\n        .select(\n            \"*\",\n            *calculate_confidence_interval(\n                f.col(\"pValueMantissa\"),\n                f.col(\"pValueExponent\"),\n                f.col(\"beta\"),\n                f.col(\"standardError\"),\n            ),\n        )\n        .repartition(200, \"chromosome\")\n        .sortWithinPartitions(\"position\")\n    )\n\n    # Initializing summary statistics object:\n    return cls(\n        _df=processed_sumstats_df,\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/_intervals/","title":"Chromatin intervals","text":"

TBC

"},{"location":"python_api/datasource/intervals/andersson/","title":"Andersson et al.","text":""},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson","title":"otg.datasource.intervals.andersson.IntervalsAndersson","text":"

Bases: Intervals

Interval dataset from Andersson et al. 2014.

Source code in src/otg/datasource/intervals/andersson.py
class IntervalsAndersson(Intervals):\n    \"\"\"Interval dataset from Andersson et al. 2014.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read andersson2014 dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to the dataset\n\n        Returns:\n            DataFrame: Raw Andersson et al. dataframe\n        \"\"\"\n        input_schema = t.StructType.fromJson(\n            json.loads(\n                pkg_resources.read_text(schemas, \"andersson2014.json\", encoding=\"utf-8\")\n            )\n        )\n        return (\n            spark.read.option(\"delimiter\", \"\\t\")\n            .option(\"mode\", \"DROPMALFORMED\")\n            .option(\"header\", \"true\")\n            .schema(input_schema)\n            .csv(path)\n        )\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsAndersson],\n        raw_anderson_df: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse Andersson et al. 2014 dataset.\n\n        Args:\n            raw_anderson_df (DataFrame): Raw Andersson et al. dataset\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Intervals dataset\n        \"\"\"\n        # Constant values:\n        dataset_name = \"andersson2014\"\n        experiment_type = \"fantom5\"\n        pmid = \"24670763\"\n        bio_feature = \"aggregate\"\n        twosided_threshold = 2.45e6  # <-  this needs to phased out. Filter by percentile instead of absolute value.\n\n        # Read the anderson file:\n        parsed_anderson_df = (\n            raw_anderson_df\n            # Parsing score column and casting as float:\n            .withColumn(\"score\", f.col(\"score\").cast(\"float\") / f.lit(1000))\n            # Parsing the 'name' column:\n            .withColumn(\"parsedName\", f.split(f.col(\"name\"), \";\"))\n            .withColumn(\"gene_symbol\", f.col(\"parsedName\")[2])\n            .withColumn(\"location\", f.col(\"parsedName\")[0])\n            .withColumn(\n                \"chrom\",\n                f.regexp_replace(f.split(f.col(\"location\"), \":|-\")[0], \"chr\", \"\"),\n            )\n            .withColumn(\n                \"start\", f.split(f.col(\"location\"), \":|-\")[1].cast(t.IntegerType())\n            )\n            .withColumn(\n                \"end\", f.split(f.col(\"location\"), \":|-\")[2].cast(t.IntegerType())\n            )\n            # Select relevant columns:\n            .select(\"chrom\", \"start\", \"end\", \"gene_symbol\", \"score\")\n            # Drop rows with non-canonical chromosomes:\n            .filter(\n                f.col(\"chrom\").isin([str(x) for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"])\n            )\n            # For each region/gene, keep only one row with the highest score:\n            .groupBy(\"chrom\", \"start\", \"end\", \"gene_symbol\")\n            .agg(f.max(\"score\").alias(\"resourceScore\"))\n            .orderBy(\"chrom\", \"start\")\n        )\n\n        return cls(\n            _df=(\n                # Lift over the intervals:\n                lift.convert_intervals(parsed_anderson_df, \"chrom\", \"start\", \"end\")\n                .drop(\"start\", \"end\")\n                .withColumnRenamed(\"mapped_start\", \"start\")\n                .withColumnRenamed(\"mapped_end\", \"end\")\n                .distinct()\n                # Joining with the gene index\n                .alias(\"intervals\")\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[\n                        f.col(\"intervals.gene_symbol\") == f.col(\"genes.geneSymbol\"),\n                        # Drop rows where the TSS is far from the start of the region\n                        f.abs(\n                            (f.col(\"intervals.start\") + f.col(\"intervals.end\")) / 2\n                            - f.col(\"tss\")\n                        )\n                        <= twosided_threshold,\n                    ],\n                    how=\"left\",\n                )\n                # Select relevant columns:\n                .select(\n                    f.col(\"chrom\").alias(\"chromosome\"),\n                    f.col(\"intervals.start\").alias(\"start\"),\n                    f.col(\"intervals.end\").alias(\"end\"),\n                    \"geneId\",\n                    \"resourceScore\",\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                    f.lit(bio_feature).alias(\"biofeature\"),\n                )\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson.parse","title":"parse(raw_anderson_df: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse Andersson et al. 2014 dataset.

Parameters:

Name Type Description Default raw_anderson_df DataFrame

Raw Andersson et al. dataset

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Intervals dataset

Source code in src/otg/datasource/intervals/andersson.py
@classmethod\ndef parse(\n    cls: type[IntervalsAndersson],\n    raw_anderson_df: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse Andersson et al. 2014 dataset.\n\n    Args:\n        raw_anderson_df (DataFrame): Raw Andersson et al. dataset\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Intervals dataset\n    \"\"\"\n    # Constant values:\n    dataset_name = \"andersson2014\"\n    experiment_type = \"fantom5\"\n    pmid = \"24670763\"\n    bio_feature = \"aggregate\"\n    twosided_threshold = 2.45e6  # <-  this needs to phased out. Filter by percentile instead of absolute value.\n\n    # Read the anderson file:\n    parsed_anderson_df = (\n        raw_anderson_df\n        # Parsing score column and casting as float:\n        .withColumn(\"score\", f.col(\"score\").cast(\"float\") / f.lit(1000))\n        # Parsing the 'name' column:\n        .withColumn(\"parsedName\", f.split(f.col(\"name\"), \";\"))\n        .withColumn(\"gene_symbol\", f.col(\"parsedName\")[2])\n        .withColumn(\"location\", f.col(\"parsedName\")[0])\n        .withColumn(\n            \"chrom\",\n            f.regexp_replace(f.split(f.col(\"location\"), \":|-\")[0], \"chr\", \"\"),\n        )\n        .withColumn(\n            \"start\", f.split(f.col(\"location\"), \":|-\")[1].cast(t.IntegerType())\n        )\n        .withColumn(\n            \"end\", f.split(f.col(\"location\"), \":|-\")[2].cast(t.IntegerType())\n        )\n        # Select relevant columns:\n        .select(\"chrom\", \"start\", \"end\", \"gene_symbol\", \"score\")\n        # Drop rows with non-canonical chromosomes:\n        .filter(\n            f.col(\"chrom\").isin([str(x) for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"])\n        )\n        # For each region/gene, keep only one row with the highest score:\n        .groupBy(\"chrom\", \"start\", \"end\", \"gene_symbol\")\n        .agg(f.max(\"score\").alias(\"resourceScore\"))\n        .orderBy(\"chrom\", \"start\")\n    )\n\n    return cls(\n        _df=(\n            # Lift over the intervals:\n            lift.convert_intervals(parsed_anderson_df, \"chrom\", \"start\", \"end\")\n            .drop(\"start\", \"end\")\n            .withColumnRenamed(\"mapped_start\", \"start\")\n            .withColumnRenamed(\"mapped_end\", \"end\")\n            .distinct()\n            # Joining with the gene index\n            .alias(\"intervals\")\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.gene_symbol\") == f.col(\"genes.geneSymbol\"),\n                    # Drop rows where the TSS is far from the start of the region\n                    f.abs(\n                        (f.col(\"intervals.start\") + f.col(\"intervals.end\")) / 2\n                        - f.col(\"tss\")\n                    )\n                    <= twosided_threshold,\n                ],\n                how=\"left\",\n            )\n            # Select relevant columns:\n            .select(\n                f.col(\"chrom\").alias(\"chromosome\"),\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                \"geneId\",\n                \"resourceScore\",\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n                f.lit(bio_feature).alias(\"biofeature\"),\n            )\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read andersson2014 dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to the dataset

required

Returns:

Name Type Description DataFrame DataFrame

Raw Andersson et al. dataframe

Source code in src/otg/datasource/intervals/andersson.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read andersson2014 dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to the dataset\n\n    Returns:\n        DataFrame: Raw Andersson et al. dataframe\n    \"\"\"\n    input_schema = t.StructType.fromJson(\n        json.loads(\n            pkg_resources.read_text(schemas, \"andersson2014.json\", encoding=\"utf-8\")\n        )\n    )\n    return (\n        spark.read.option(\"delimiter\", \"\\t\")\n        .option(\"mode\", \"DROPMALFORMED\")\n        .option(\"header\", \"true\")\n        .schema(input_schema)\n        .csv(path)\n    )\n
"},{"location":"python_api/datasource/intervals/javierre/","title":"Javierre et al.","text":""},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre","title":"otg.datasource.intervals.javierre.IntervalsJavierre","text":"

Bases: Intervals

Interval dataset from Javierre et al. 2016.

Source code in src/otg/datasource/intervals/javierre.py
class IntervalsJavierre(Intervals):\n    \"\"\"Interval dataset from Javierre et al. 2016.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read Javierre dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: Raw Javierre dataset\n        \"\"\"\n        return spark.read.parquet(path)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsJavierre],\n        javierre_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse Javierre et al. 2016 dataset.\n\n        Args:\n            javierre_raw (DataFrame): Raw Javierre data\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Javierre et al. 2016 interval data\n        \"\"\"\n        # Constant values:\n        dataset_name = \"javierre2016\"\n        experiment_type = \"pchic\"\n        pmid = \"27863249\"\n        twosided_threshold = 2.45e6\n\n        # Read Javierre data:\n        javierre_parsed = (\n            javierre_raw\n            # Splitting name column into chromosome, start, end, and score:\n            .withColumn(\"name_split\", f.split(f.col(\"name\"), r\":|-|,\"))\n            .withColumn(\n                \"name_chr\",\n                f.regexp_replace(f.col(\"name_split\")[0], \"chr\", \"\").cast(\n                    t.StringType()\n                ),\n            )\n            .withColumn(\"name_start\", f.col(\"name_split\")[1].cast(t.IntegerType()))\n            .withColumn(\"name_end\", f.col(\"name_split\")[2].cast(t.IntegerType()))\n            .withColumn(\"name_score\", f.col(\"name_split\")[3].cast(t.FloatType()))\n            # Cleaning up chromosome:\n            .withColumn(\n                \"chrom\",\n                f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").cast(t.StringType()),\n            )\n            .drop(\"name_split\", \"name\", \"annotation\")\n            # Keep canonical chromosomes and consistent chromosomes with scores:\n            .filter(\n                (f.col(\"name_score\").isNotNull())\n                & (f.col(\"chrom\") == f.col(\"name_chr\"))\n                & f.col(\"name_chr\").isin(\n                    [f\"{x}\" for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"]\n                )\n            )\n        )\n\n        # Lifting over intervals:\n        javierre_remapped = (\n            javierre_parsed\n            # Lifting over to GRCh38 interval 1:\n            .transform(lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\"))\n            .drop(\"start\", \"end\")\n            .withColumnRenamed(\"mapped_chrom\", \"chrom\")\n            .withColumnRenamed(\"mapped_start\", \"start\")\n            .withColumnRenamed(\"mapped_end\", \"end\")\n            # Lifting over interval 2 to GRCh38:\n            .transform(\n                lambda df: lift.convert_intervals(\n                    df, \"name_chr\", \"name_start\", \"name_end\"\n                )\n            )\n            .drop(\"name_start\", \"name_end\")\n            .withColumnRenamed(\"mapped_name_chr\", \"name_chr\")\n            .withColumnRenamed(\"mapped_name_start\", \"name_start\")\n            .withColumnRenamed(\"mapped_name_end\", \"name_end\")\n        )\n\n        # Once the intervals are lifted, extracting the unique intervals:\n        unique_intervals_with_genes = (\n            javierre_remapped.select(\n                f.col(\"chrom\"),\n                f.col(\"start\").cast(t.IntegerType()),\n                f.col(\"end\").cast(t.IntegerType()),\n            )\n            .distinct()\n            .alias(\"intervals\")\n            .join(\n                gene_index.locations_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                    (\n                        (f.col(\"intervals.start\") >= f.col(\"genes.start\"))\n                        & (f.col(\"intervals.start\") <= f.col(\"genes.end\"))\n                    )\n                    | (\n                        (f.col(\"intervals.end\") >= f.col(\"genes.start\"))\n                        & (f.col(\"intervals.end\") <= f.col(\"genes.end\"))\n                    ),\n                ],\n                how=\"left\",\n            )\n            .select(\n                f.col(\"intervals.chrom\").alias(\"chrom\"),\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                f.col(\"genes.geneId\").alias(\"geneId\"),\n                f.col(\"genes.tss\").alias(\"tss\"),\n            )\n        )\n\n        # Joining back the data:\n        return cls(\n            _df=(\n                javierre_remapped.join(\n                    unique_intervals_with_genes,\n                    on=[\"chrom\", \"start\", \"end\"],\n                    how=\"left\",\n                )\n                .filter(\n                    # Drop rows where the TSS is far from the start of the region\n                    f.abs((f.col(\"start\") + f.col(\"end\")) / 2 - f.col(\"tss\"))\n                    <= twosided_threshold\n                )\n                # For each gene, keep only the highest scoring interval:\n                .groupBy(\"name_chr\", \"name_start\", \"name_end\", \"geneId\", \"bio_feature\")\n                .agg(f.max(f.col(\"name_score\")).alias(\"resourceScore\"))\n                # Create the output:\n                .select(\n                    f.col(\"name_chr\").alias(\"chromosome\"),\n                    f.col(\"name_start\").alias(\"start\"),\n                    f.col(\"name_end\").alias(\"end\"),\n                    f.col(\"resourceScore\").cast(t.DoubleType()),\n                    f.col(\"geneId\"),\n                    f.col(\"bio_feature\").alias(\"biofeature\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre.parse","title":"parse(javierre_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse Javierre et al. 2016 dataset.

Parameters:

Name Type Description Default javierre_raw DataFrame

Raw Javierre data

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Javierre et al. 2016 interval data

Source code in src/otg/datasource/intervals/javierre.py
@classmethod\ndef parse(\n    cls: type[IntervalsJavierre],\n    javierre_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse Javierre et al. 2016 dataset.\n\n    Args:\n        javierre_raw (DataFrame): Raw Javierre data\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Javierre et al. 2016 interval data\n    \"\"\"\n    # Constant values:\n    dataset_name = \"javierre2016\"\n    experiment_type = \"pchic\"\n    pmid = \"27863249\"\n    twosided_threshold = 2.45e6\n\n    # Read Javierre data:\n    javierre_parsed = (\n        javierre_raw\n        # Splitting name column into chromosome, start, end, and score:\n        .withColumn(\"name_split\", f.split(f.col(\"name\"), r\":|-|,\"))\n        .withColumn(\n            \"name_chr\",\n            f.regexp_replace(f.col(\"name_split\")[0], \"chr\", \"\").cast(\n                t.StringType()\n            ),\n        )\n        .withColumn(\"name_start\", f.col(\"name_split\")[1].cast(t.IntegerType()))\n        .withColumn(\"name_end\", f.col(\"name_split\")[2].cast(t.IntegerType()))\n        .withColumn(\"name_score\", f.col(\"name_split\")[3].cast(t.FloatType()))\n        # Cleaning up chromosome:\n        .withColumn(\n            \"chrom\",\n            f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").cast(t.StringType()),\n        )\n        .drop(\"name_split\", \"name\", \"annotation\")\n        # Keep canonical chromosomes and consistent chromosomes with scores:\n        .filter(\n            (f.col(\"name_score\").isNotNull())\n            & (f.col(\"chrom\") == f.col(\"name_chr\"))\n            & f.col(\"name_chr\").isin(\n                [f\"{x}\" for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"]\n            )\n        )\n    )\n\n    # Lifting over intervals:\n    javierre_remapped = (\n        javierre_parsed\n        # Lifting over to GRCh38 interval 1:\n        .transform(lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\"))\n        .drop(\"start\", \"end\")\n        .withColumnRenamed(\"mapped_chrom\", \"chrom\")\n        .withColumnRenamed(\"mapped_start\", \"start\")\n        .withColumnRenamed(\"mapped_end\", \"end\")\n        # Lifting over interval 2 to GRCh38:\n        .transform(\n            lambda df: lift.convert_intervals(\n                df, \"name_chr\", \"name_start\", \"name_end\"\n            )\n        )\n        .drop(\"name_start\", \"name_end\")\n        .withColumnRenamed(\"mapped_name_chr\", \"name_chr\")\n        .withColumnRenamed(\"mapped_name_start\", \"name_start\")\n        .withColumnRenamed(\"mapped_name_end\", \"name_end\")\n    )\n\n    # Once the intervals are lifted, extracting the unique intervals:\n    unique_intervals_with_genes = (\n        javierre_remapped.select(\n            f.col(\"chrom\"),\n            f.col(\"start\").cast(t.IntegerType()),\n            f.col(\"end\").cast(t.IntegerType()),\n        )\n        .distinct()\n        .alias(\"intervals\")\n        .join(\n            gene_index.locations_lut().alias(\"genes\"),\n            on=[\n                f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                (\n                    (f.col(\"intervals.start\") >= f.col(\"genes.start\"))\n                    & (f.col(\"intervals.start\") <= f.col(\"genes.end\"))\n                )\n                | (\n                    (f.col(\"intervals.end\") >= f.col(\"genes.start\"))\n                    & (f.col(\"intervals.end\") <= f.col(\"genes.end\"))\n                ),\n            ],\n            how=\"left\",\n        )\n        .select(\n            f.col(\"intervals.chrom\").alias(\"chrom\"),\n            f.col(\"intervals.start\").alias(\"start\"),\n            f.col(\"intervals.end\").alias(\"end\"),\n            f.col(\"genes.geneId\").alias(\"geneId\"),\n            f.col(\"genes.tss\").alias(\"tss\"),\n        )\n    )\n\n    # Joining back the data:\n    return cls(\n        _df=(\n            javierre_remapped.join(\n                unique_intervals_with_genes,\n                on=[\"chrom\", \"start\", \"end\"],\n                how=\"left\",\n            )\n            .filter(\n                # Drop rows where the TSS is far from the start of the region\n                f.abs((f.col(\"start\") + f.col(\"end\")) / 2 - f.col(\"tss\"))\n                <= twosided_threshold\n            )\n            # For each gene, keep only the highest scoring interval:\n            .groupBy(\"name_chr\", \"name_start\", \"name_end\", \"geneId\", \"bio_feature\")\n            .agg(f.max(f.col(\"name_score\")).alias(\"resourceScore\"))\n            # Create the output:\n            .select(\n                f.col(\"name_chr\").alias(\"chromosome\"),\n                f.col(\"name_start\").alias(\"start\"),\n                f.col(\"name_end\").alias(\"end\"),\n                f.col(\"resourceScore\").cast(t.DoubleType()),\n                f.col(\"geneId\"),\n                f.col(\"bio_feature\").alias(\"biofeature\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read Javierre dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

Raw Javierre dataset

Source code in src/otg/datasource/intervals/javierre.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read Javierre dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: Raw Javierre dataset\n    \"\"\"\n    return spark.read.parquet(path)\n
"},{"location":"python_api/datasource/intervals/jung/","title":"Jung et al.","text":""},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung","title":"otg.datasource.intervals.jung.IntervalsJung","text":"

Bases: Intervals

Interval dataset from Jung et al. 2019.

Source code in src/otg/datasource/intervals/jung.py
class IntervalsJung(Intervals):\n    \"\"\"Interval dataset from Jung et al. 2019.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read jung dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: DataFrame with raw jung data\n        \"\"\"\n        return spark.read.csv(path, sep=\",\", header=True)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsJung],\n        jung_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse the Jung et al. 2019 dataset.\n\n        Args:\n            jung_raw (DataFrame): raw Jung et al. 2019 dataset\n            gene_index (GeneIndex): gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Interval dataset containing Jung et al. 2019 data\n        \"\"\"\n        dataset_name = \"jung2019\"\n        experiment_type = \"pchic\"\n        pmid = \"31501517\"\n\n        # Lifting over the coordinates:\n        return cls(\n            _df=(\n                jung_raw.withColumn(\n                    \"interval\", f.split(f.col(\"Interacting_fragment\"), r\"\\.\")\n                )\n                .select(\n                    # Parsing intervals:\n                    f.regexp_replace(f.col(\"interval\")[0], \"chr\", \"\").alias(\"chrom\"),\n                    f.col(\"interval\")[1].cast(t.IntegerType()).alias(\"start\"),\n                    f.col(\"interval\")[2].cast(t.IntegerType()).alias(\"end\"),\n                    # Extract other columns:\n                    f.col(\"Promoter\").alias(\"gene_name\"),\n                    f.col(\"Tissue_type\").alias(\"tissue\"),\n                )\n                # Lifting over to GRCh38 interval 1:\n                .transform(\n                    lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n                )\n                .select(\n                    \"chrom\",\n                    f.col(\"mapped_start\").alias(\"start\"),\n                    f.col(\"mapped_end\").alias(\"end\"),\n                    f.explode(f.split(f.col(\"gene_name\"), \";\")).alias(\"gene_name\"),\n                    \"tissue\",\n                )\n                .alias(\"intervals\")\n                # Joining with genes:\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\")],\n                    how=\"inner\",\n                )\n                # Finalize dataset:\n                .select(\n                    \"chromosome\",\n                    f.col(\"intervals.start\").alias(\"start\"),\n                    f.col(\"intervals.end\").alias(\"end\"),\n                    \"geneId\",\n                    f.col(\"tissue\").alias(\"biofeature\"),\n                    f.lit(1.0).alias(\"score\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n                .drop_duplicates()\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung.parse","title":"parse(jung_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse the Jung et al. 2019 dataset.

Parameters:

Name Type Description Default jung_raw DataFrame

raw Jung et al. 2019 dataset

required gene_index GeneIndex

gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Interval dataset containing Jung et al. 2019 data

Source code in src/otg/datasource/intervals/jung.py
@classmethod\ndef parse(\n    cls: type[IntervalsJung],\n    jung_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse the Jung et al. 2019 dataset.\n\n    Args:\n        jung_raw (DataFrame): raw Jung et al. 2019 dataset\n        gene_index (GeneIndex): gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Interval dataset containing Jung et al. 2019 data\n    \"\"\"\n    dataset_name = \"jung2019\"\n    experiment_type = \"pchic\"\n    pmid = \"31501517\"\n\n    # Lifting over the coordinates:\n    return cls(\n        _df=(\n            jung_raw.withColumn(\n                \"interval\", f.split(f.col(\"Interacting_fragment\"), r\"\\.\")\n            )\n            .select(\n                # Parsing intervals:\n                f.regexp_replace(f.col(\"interval\")[0], \"chr\", \"\").alias(\"chrom\"),\n                f.col(\"interval\")[1].cast(t.IntegerType()).alias(\"start\"),\n                f.col(\"interval\")[2].cast(t.IntegerType()).alias(\"end\"),\n                # Extract other columns:\n                f.col(\"Promoter\").alias(\"gene_name\"),\n                f.col(\"Tissue_type\").alias(\"tissue\"),\n            )\n            # Lifting over to GRCh38 interval 1:\n            .transform(\n                lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n            )\n            .select(\n                \"chrom\",\n                f.col(\"mapped_start\").alias(\"start\"),\n                f.col(\"mapped_end\").alias(\"end\"),\n                f.explode(f.split(f.col(\"gene_name\"), \";\")).alias(\"gene_name\"),\n                \"tissue\",\n            )\n            .alias(\"intervals\")\n            # Joining with genes:\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\")],\n                how=\"inner\",\n            )\n            # Finalize dataset:\n            .select(\n                \"chromosome\",\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                \"geneId\",\n                f.col(\"tissue\").alias(\"biofeature\"),\n                f.lit(1.0).alias(\"score\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n            .drop_duplicates()\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read jung dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

DataFrame with raw jung data

Source code in src/otg/datasource/intervals/jung.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read jung dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: DataFrame with raw jung data\n    \"\"\"\n    return spark.read.csv(path, sep=\",\", header=True)\n
"},{"location":"python_api/datasource/intervals/thurman/","title":"Thurman et al.","text":""},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman","title":"otg.datasource.intervals.thurman.IntervalsThurman","text":"

Bases: Intervals

Interval dataset from Thurman et al. 2012.

Source code in src/otg/datasource/intervals/thurman.py
class IntervalsThurman(Intervals):\n    \"\"\"Interval dataset from Thurman et al. 2012.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read thurman dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: DataFrame with raw thurman data\n        \"\"\"\n        thurman_schema = t.StructType(\n            [\n                t.StructField(\"gene_chr\", t.StringType(), False),\n                t.StructField(\"gene_start\", t.IntegerType(), False),\n                t.StructField(\"gene_end\", t.IntegerType(), False),\n                t.StructField(\"gene_name\", t.StringType(), False),\n                t.StructField(\"chrom\", t.StringType(), False),\n                t.StructField(\"start\", t.IntegerType(), False),\n                t.StructField(\"end\", t.IntegerType(), False),\n                t.StructField(\"score\", t.FloatType(), False),\n            ]\n        )\n        return spark.read.csv(path, sep=\"\\t\", header=True, schema=thurman_schema)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsThurman],\n        thurman_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse the Thurman et al. 2012 dataset.\n\n        Args:\n            thurman_raw (DataFrame): raw Thurman et al. 2019 dataset\n            gene_index (GeneIndex): gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Interval dataset containing Thurman et al. 2012 data\n        \"\"\"\n        dataset_name = \"thurman2012\"\n        experiment_type = \"dhscor\"\n        pmid = \"22955617\"\n\n        return cls(\n            _df=(\n                thurman_raw.select(\n                    f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").alias(\"chrom\"),\n                    \"start\",\n                    \"end\",\n                    \"gene_name\",\n                    \"score\",\n                )\n                # Lift over to the GRCh38 build:\n                .transform(\n                    lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n                )\n                .alias(\"intervals\")\n                # Map gene names to gene IDs:\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[\n                        f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\"),\n                        f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                    ],\n                    how=\"inner\",\n                )\n                # Select relevant columns and add constant columns:\n                .select(\n                    f.col(\"chrom\").alias(\"chromosome\"),\n                    f.col(\"mapped_start\").alias(\"start\"),\n                    f.col(\"mapped_end\").alias(\"end\"),\n                    \"geneId\",\n                    f.col(\"score\").cast(t.DoubleType()).alias(\"resourceScore\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n                .distinct()\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman.parse","title":"parse(thurman_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse the Thurman et al. 2012 dataset.

Parameters:

Name Type Description Default thurman_raw DataFrame

raw Thurman et al. 2019 dataset

required gene_index GeneIndex

gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Interval dataset containing Thurman et al. 2012 data

Source code in src/otg/datasource/intervals/thurman.py
@classmethod\ndef parse(\n    cls: type[IntervalsThurman],\n    thurman_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse the Thurman et al. 2012 dataset.\n\n    Args:\n        thurman_raw (DataFrame): raw Thurman et al. 2019 dataset\n        gene_index (GeneIndex): gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Interval dataset containing Thurman et al. 2012 data\n    \"\"\"\n    dataset_name = \"thurman2012\"\n    experiment_type = \"dhscor\"\n    pmid = \"22955617\"\n\n    return cls(\n        _df=(\n            thurman_raw.select(\n                f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").alias(\"chrom\"),\n                \"start\",\n                \"end\",\n                \"gene_name\",\n                \"score\",\n            )\n            # Lift over to the GRCh38 build:\n            .transform(\n                lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n            )\n            .alias(\"intervals\")\n            # Map gene names to gene IDs:\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\"),\n                    f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                ],\n                how=\"inner\",\n            )\n            # Select relevant columns and add constant columns:\n            .select(\n                f.col(\"chrom\").alias(\"chromosome\"),\n                f.col(\"mapped_start\").alias(\"start\"),\n                f.col(\"mapped_end\").alias(\"end\"),\n                \"geneId\",\n                f.col(\"score\").cast(t.DoubleType()).alias(\"resourceScore\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n            .distinct()\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read thurman dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

DataFrame with raw thurman data

Source code in src/otg/datasource/intervals/thurman.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read thurman dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: DataFrame with raw thurman data\n    \"\"\"\n    thurman_schema = t.StructType(\n        [\n            t.StructField(\"gene_chr\", t.StringType(), False),\n            t.StructField(\"gene_start\", t.IntegerType(), False),\n            t.StructField(\"gene_end\", t.IntegerType(), False),\n            t.StructField(\"gene_name\", t.StringType(), False),\n            t.StructField(\"chrom\", t.StringType(), False),\n            t.StructField(\"start\", t.IntegerType(), False),\n            t.StructField(\"end\", t.IntegerType(), False),\n            t.StructField(\"score\", t.FloatType(), False),\n        ]\n    )\n    return spark.read.csv(path, sep=\"\\t\", header=True, schema=thurman_schema)\n
"},{"location":"python_api/datasource/open_targets/_open_targets/","title":"Open Targets","text":"

The Open Targets Platform is a comprehensive resource that aims to aggregate and harmonise various types of data to facilitate the identification, prioritisation, and validation of drug targets. By integrating publicly available datasets including data generated by the Open Targets consortium, the Platform builds and scores target-disease associations to assist in drug target identification and prioritisation. It also integrates relevant annotation information about targets, diseases, phenotypes, and drugs, as well as their most relevant relationships.

Genomic data from Open Targets integrates human genome-wide association studies (GWAS) and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes.

"},{"location":"python_api/datasource/open_targets/target/","title":"Target","text":""},{"location":"python_api/datasource/open_targets/target/#otg.datasource.open_targets.target.OpenTargetsTarget","title":"otg.datasource.open_targets.target.OpenTargetsTarget","text":"

Parser for OTPlatform target dataset.

Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.

The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is: - Genes from all biotypes encoded in canonical chromosomes - Genes in alternative assemblies encoding for a reviewed protein product.

Source code in src/otg/datasource/open_targets/target.py
class OpenTargetsTarget:\n    \"\"\"Parser for OTPlatform target dataset.\n\n    Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.\n\n    The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is:\n    - Genes from all biotypes encoded in canonical chromosomes\n    - Genes in alternative assemblies encoding for a reviewed protein product.\n    \"\"\"\n\n    @staticmethod\n    def _get_gene_tss(strand_col: Column, start_col: Column, end_col: Column) -> Column:\n        \"\"\"Returns the TSS of a gene based on its orientation.\n\n        Args:\n            strand_col (Column): Column containing 1 if the coding strand of the gene is forward, and -1 if it is reverse.\n            start_col (Column): Column containing the start position of the gene.\n            end_col (Column): Column containing the end position of the gene.\n\n        Returns:\n            Column: Column containing the TSS of the gene.\n\n        Examples:\n            >>> df = spark.createDataFrame([{\"strand\": 1, \"start\": 100, \"end\": 200}, {\"strand\": -1, \"start\": 100, \"end\": 200}])\n            >>> df.withColumn(\"tss\", OpenTargetsTarget._get_gene_tss(f.col(\"strand\"), f.col(\"start\"), f.col(\"end\"))).show()\n            +---+-----+------+---+\n            |end|start|strand|tss|\n            +---+-----+------+---+\n            |200|  100|     1|100|\n            |200|  100|    -1|200|\n            +---+-----+------+---+\n            <BLANKLINE>\n\n        \"\"\"\n        return f.when(strand_col == 1, start_col).when(strand_col == -1, end_col)\n\n    @classmethod\n    def as_gene_index(cls: type[GeneIndex], target_index: DataFrame) -> GeneIndex:\n        \"\"\"Initialise GeneIndex from source dataset.\n\n        Args:\n            target_index (DataFrame): Target index dataframe\n\n        Returns:\n            GeneIndex: Gene index dataset\n        \"\"\"\n        return GeneIndex(\n            _df=target_index.select(\n                f.coalesce(f.col(\"id\"), f.lit(\"unknown\")).alias(\"geneId\"),\n                \"approvedSymbol\",\n                \"approvedName\",\n                \"biotype\",\n                f.col(\"obsoleteSymbols.label\").alias(\"obsoleteSymbols\"),\n                f.coalesce(f.col(\"genomicLocation.chromosome\"), f.lit(\"unknown\")).alias(\n                    \"chromosome\"\n                ),\n                OpenTargetsTarget._get_gene_tss(\n                    f.col(\"genomicLocation.strand\"),\n                    f.col(\"genomicLocation.start\"),\n                    f.col(\"genomicLocation.end\"),\n                ).alias(\"tss\"),\n                f.col(\"genomicLocation.start\").alias(\"start\"),\n                f.col(\"genomicLocation.end\").alias(\"end\"),\n                f.col(\"genomicLocation.strand\").alias(\"strand\"),\n            ),\n            _schema=GeneIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/open_targets/target/#otg.datasource.open_targets.target.OpenTargetsTarget.as_gene_index","title":"as_gene_index(target_index: DataFrame) -> GeneIndex classmethod","text":"

Initialise GeneIndex from source dataset.

Parameters:

Name Type Description Default target_index DataFrame

Target index dataframe

required

Returns:

Name Type Description GeneIndex GeneIndex

Gene index dataset

Source code in src/otg/datasource/open_targets/target.py
@classmethod\ndef as_gene_index(cls: type[GeneIndex], target_index: DataFrame) -> GeneIndex:\n    \"\"\"Initialise GeneIndex from source dataset.\n\n    Args:\n        target_index (DataFrame): Target index dataframe\n\n    Returns:\n        GeneIndex: Gene index dataset\n    \"\"\"\n    return GeneIndex(\n        _df=target_index.select(\n            f.coalesce(f.col(\"id\"), f.lit(\"unknown\")).alias(\"geneId\"),\n            \"approvedSymbol\",\n            \"approvedName\",\n            \"biotype\",\n            f.col(\"obsoleteSymbols.label\").alias(\"obsoleteSymbols\"),\n            f.coalesce(f.col(\"genomicLocation.chromosome\"), f.lit(\"unknown\")).alias(\n                \"chromosome\"\n            ),\n            OpenTargetsTarget._get_gene_tss(\n                f.col(\"genomicLocation.strand\"),\n                f.col(\"genomicLocation.start\"),\n                f.col(\"genomicLocation.end\"),\n            ).alias(\"tss\"),\n            f.col(\"genomicLocation.start\").alias(\"start\"),\n            f.col(\"genomicLocation.end\").alias(\"end\"),\n            f.col(\"genomicLocation.strand\").alias(\"strand\"),\n        ),\n        _schema=GeneIndex.get_schema(),\n    )\n
"},{"location":"python_api/datasource/ukbiobank/_ukbiobank/","title":"UK Biobank","text":"

The UK Biobank is a large-scale biomedical database and research resource that contains a diverse range of in-depth information from 500,000 volunteers in the United Kingdom. Its genomic data comprises whole-genome sequencing for a subset of participants, along with genotyping arrays for the entire cohort. The data has been a cornerstone for numerous genome-wide association studies (GWAS) and other genetic analyses, advancing our understanding of human health and disease.

Recent efforts to rapidly and systematically apply established GWAS methods to all available data fields in UK Biobank have made available large repositories of summary statistics. To leverage these data disease locus discovery, we used full summary statistics from: The Neale lab Round 2 (N=2139). - These analyses applied GWAS (implemented in Hail) to all data fields using imputed genotypes from HRC as released by UK Biobank in May 2017, consisting of 337,199 individuals post-QC. Full details of the Neale lab GWAS implementation are available here. We have remove all ICD-10 related traits from the Neale data to reduce overlap with the SAIGE results. - http://www.nealelab.is/uk-biobank/ The University of Michigan SAIGE analysis (N=1281). - The SAIGE analysis uses PheCode derived phenotypes and applies a new method that \"provides accurate P values even when case-control ratios are extremely unbalanced\". See Zhou et al. (2018) for further details. - https://pubmed.ncbi.nlm.nih.gov/30104761/

"},{"location":"python_api/datasource/ukbiobank/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/ukbiobank/study_index/#otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex","title":"otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex","text":"

Bases: StudyIndex

Study index dataset from UKBiobank.

The following information is extracted:

  • studyId
  • pubmedId
  • publicationDate
  • publicationJournal
  • publicationTitle
  • publicationFirstAuthor
  • traitFromSource
  • ancestry_discoverySamples
  • ancestry_replicationSamples
  • initialSampleSize
  • nCases
  • replicationSamples

Some fields are populated as constants, such as projectID, studyType, and initial sample size.

Source code in src/otg/datasource/ukbiobank/study_index.py
class UKBiobankStudyIndex(StudyIndex):\n    \"\"\"Study index dataset from UKBiobank.\n\n    The following information is extracted:\n\n    - studyId\n    - pubmedId\n    - publicationDate\n    - publicationJournal\n    - publicationTitle\n    - publicationFirstAuthor\n    - traitFromSource\n    - ancestry_discoverySamples\n    - ancestry_replicationSamples\n    - initialSampleSize\n    - nCases\n    - replicationSamples\n\n    Some fields are populated as constants, such as projectID, studyType, and initial sample size.\n    \"\"\"\n\n    @classmethod\n    def from_source(\n        cls: type[UKBiobankStudyIndex],\n        ukbiobank_studies: DataFrame,\n    ) -> UKBiobankStudyIndex:\n        \"\"\"This function ingests study level metadata from UKBiobank.\n\n        The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n        The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n        Args:\n            ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n        Returns:\n            UKBiobankStudyIndex: Annotated UKBiobank study table.\n        \"\"\"\n        return StudyIndex(\n            _df=(\n                ukbiobank_studies.select(\n                    f.col(\"code\").alias(\"studyId\"),\n                    f.lit(\"UKBiobank\").alias(\"projectId\"),\n                    f.lit(\"gwas\").alias(\"studyType\"),\n                    f.col(\"trait\").alias(\"traitFromSource\"),\n                    # Make publication and ancestry schema columns.\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n                        \"pubmedId\"\n                    ),\n                    f.when(\n                        f.col(\"code\").startswith(\"SAIGE_\"),\n                        \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n                    )\n                    .otherwise(None)\n                    .alias(\"publicationTitle\"),\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n                        \"publicationFirstAuthor\"\n                    ),\n                    f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n                    .otherwise(\"2018-10-24\")\n                    .alias(\"publicationDate\"),\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n                        \"publicationJournal\"\n                    ),\n                    f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n                    f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n                    f.array(\n                        f.struct(\n                            f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n                            f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n                                \"ancestry\"\n                            ),\n                        )\n                    ).alias(\"discoverySamples\"),\n                    f.col(\"in_path\").alias(\"summarystatsLocation\"),\n                    f.lit(True).alias(\"hasSumstats\"),\n                )\n                .withColumn(\n                    \"traitFromSource\",\n                    f.when(\n                        f.col(\"traitFromSource\").contains(\":\"),\n                        f.concat(\n                            f.initcap(\n                                f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n                            ),\n                            f.lit(\" | \"),\n                            f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n                        ),\n                    ).otherwise(f.col(\"traitFromSource\")),\n                )\n                .withColumn(\n                    \"ldPopulationStructure\",\n                    cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n                )\n            ),\n            _schema=StudyIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/ukbiobank/study_index/#otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex.from_source","title":"from_source(ukbiobank_studies: DataFrame) -> UKBiobankStudyIndex classmethod","text":"

This function ingests study level metadata from UKBiobank.

The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).

The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.

Parameters:

Name Type Description Default ukbiobank_studies DataFrame

UKBiobank study manifest file loaded in spark session.

required

Returns:

Name Type Description UKBiobankStudyIndex UKBiobankStudyIndex

Annotated UKBiobank study table.

Source code in src/otg/datasource/ukbiobank/study_index.py
@classmethod\ndef from_source(\n    cls: type[UKBiobankStudyIndex],\n    ukbiobank_studies: DataFrame,\n) -> UKBiobankStudyIndex:\n    \"\"\"This function ingests study level metadata from UKBiobank.\n\n    The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n    The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n    Args:\n        ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n    Returns:\n        UKBiobankStudyIndex: Annotated UKBiobank study table.\n    \"\"\"\n    return StudyIndex(\n        _df=(\n            ukbiobank_studies.select(\n                f.col(\"code\").alias(\"studyId\"),\n                f.lit(\"UKBiobank\").alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.col(\"trait\").alias(\"traitFromSource\"),\n                # Make publication and ancestry schema columns.\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n                    \"pubmedId\"\n                ),\n                f.when(\n                    f.col(\"code\").startswith(\"SAIGE_\"),\n                    \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n                )\n                .otherwise(None)\n                .alias(\"publicationTitle\"),\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n                    \"publicationFirstAuthor\"\n                ),\n                f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n                .otherwise(\"2018-10-24\")\n                .alias(\"publicationDate\"),\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n                    \"publicationJournal\"\n                ),\n                f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n                f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n                f.array(\n                    f.struct(\n                        f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n                        f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n                            \"ancestry\"\n                        ),\n                    )\n                ).alias(\"discoverySamples\"),\n                f.col(\"in_path\").alias(\"summarystatsLocation\"),\n                f.lit(True).alias(\"hasSumstats\"),\n            )\n            .withColumn(\n                \"traitFromSource\",\n                f.when(\n                    f.col(\"traitFromSource\").contains(\":\"),\n                    f.concat(\n                        f.initcap(\n                            f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n                        ),\n                        f.lit(\" | \"),\n                        f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n                    ),\n                ).otherwise(f.col(\"traitFromSource\")),\n            )\n            .withColumn(\n                \"ldPopulationStructure\",\n                cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            )\n        ),\n        _schema=StudyIndex.get_schema(),\n    )\n
"},{"location":"python_api/method/_method/","title":"Method","text":"

TBC

"},{"location":"python_api/method/clumping/","title":"Clumping","text":"

Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.

We have implemented 2 clumping methods:

"},{"location":"python_api/method/clumping/#otg.method.clump.LDclumping","title":"otg.method.clump.LDclumping","text":"

LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.

Source code in src/otg/method/clump.py
class LDclumping:\n    \"\"\"LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.\"\"\"\n\n    @staticmethod\n    def _is_lead_linked(\n        study_id: Column,\n        variant_id: Column,\n        p_value_exponent: Column,\n        p_value_mantissa: Column,\n        ld_set: Column,\n    ) -> Column:\n        \"\"\"Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n        Args:\n            study_id (Column): studyId\n            variant_id (Column): Lead variant id\n            p_value_exponent (Column): p-value exponent\n            p_value_mantissa (Column): p-value mantissa\n            ld_set (Column): Array of variants in LD with the lead variant\n\n        Returns:\n            Column: Boolean in which True indicates that the lead is linked to another tag in the same dataset.\n        \"\"\"\n        leads_in_study = f.collect_set(variant_id).over(Window.partitionBy(study_id))\n        tags_in_studylocus = f.array_union(\n            # Get all tag variants from the credible set per studyLocusId\n            f.transform(ld_set, lambda x: x.tagVariantId),\n            # And append the lead variant so that the intersection is the same for all studyLocusIds in a study\n            f.array(variant_id),\n        )\n        intersect_lead_tags = f.array_sort(\n            f.array_intersect(leads_in_study, tags_in_studylocus)\n        )\n        return (\n            # If the lead is in the credible set, we rank the peaks by p-value\n            f.when(\n                f.size(intersect_lead_tags) > 0,\n                f.row_number().over(\n                    Window.partitionBy(study_id, intersect_lead_tags).orderBy(\n                        p_value_exponent, p_value_mantissa\n                    )\n                )\n                > 1,\n            )\n            # If the intersection is empty (lead is not in the credible set or cred set is empty), the association is not linked\n            .otherwise(f.lit(False))\n        )\n\n    @classmethod\n    def clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n        \"\"\"Perform clumping on studyLocus dataset.\n\n        Args:\n            associations (StudyLocus): StudyLocus dataset\n\n        Returns:\n            StudyLocus: including flag and removing locus information for LD clumped loci.\n        \"\"\"\n        return associations.clump()\n
"},{"location":"python_api/method/clumping/#otg.method.clump.LDclumping.clump","title":"clump(associations: StudyLocus) -> StudyLocus classmethod","text":"

Perform clumping on studyLocus dataset.

Parameters:

Name Type Description Default associations StudyLocus

StudyLocus dataset

required

Returns:

Name Type Description StudyLocus StudyLocus

including flag and removing locus information for LD clumped loci.

Source code in src/otg/method/clump.py
@classmethod\ndef clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n    \"\"\"Perform clumping on studyLocus dataset.\n\n    Args:\n        associations (StudyLocus): StudyLocus dataset\n\n    Returns:\n        StudyLocus: including flag and removing locus information for LD clumped loci.\n    \"\"\"\n    return associations.clump()\n
"},{"location":"python_api/method/coloc/","title":"Coloc","text":""},{"location":"python_api/method/coloc/#otg.method.colocalisation.Coloc","title":"otg.method.colocalisation.Coloc","text":"

Calculate bayesian colocalisation based on overlapping signals from credible sets.

Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.

Hypothesis Description H0 no association with either trait in the region H1 association with trait 1 only H2 association with trait 2 only H3 both traits are associated, but have different single causal variants H4 both traits are associated and share the same single causal variant

Approximate Bayes factors required

Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).

Source code in src/otg/method/colocalisation.py
class Coloc:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals from credible sets.\n\n    Based on the [R COLOC package](https://github.com/chr1swallace/coloc/blob/main/R/claudia.R), which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that **only one single causal variant** exists for any given trait in any genomic region.\n\n    | Hypothesis    | Description                                                           |\n    | ------------- | --------------------------------------------------------------------- |\n    | H<sub>0</sub> | no association with either trait in the region                        |\n    | H<sub>1</sub> | association with trait 1 only                                         |\n    | H<sub>2</sub> | association with trait 2 only                                         |\n    | H<sub>3</sub> | both traits are associated, but have different single causal variants |\n    | H<sub>4</sub> | both traits are associated and share the same single causal variant   |\n\n    !!! warning \"Approximate Bayes factors required\"\n        Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (`logABF` column).\n\n    \"\"\"\n\n    @staticmethod\n    def _get_logsum(log_abf: ndarray) -> float:\n        \"\"\"Calculates logsum of vector.\n\n        This function calculates the log of the sum of the exponentiated\n        logs taking out the max, i.e. insuring that the sum is not Inf\n\n        Args:\n            log_abf (ndarray): log approximate bayes factor\n\n        Returns:\n            float: logsum\n\n        Example:\n            >>> l = [0.2, 0.1, 0.05, 0]\n            >>> round(Coloc._get_logsum(l), 6)\n            1.476557\n        \"\"\"\n        themax = np.max(log_abf)\n        result = themax + np.log(np.sum(np.exp(log_abf - themax)))\n        return float(result)\n\n    @staticmethod\n    def _get_posteriors(all_abfs: ndarray) -> DenseVector:\n        \"\"\"Calculate posterior probabilities for each hypothesis.\n\n        Args:\n            all_abfs (ndarray): h0-h4 bayes factors\n\n        Returns:\n            DenseVector: Posterior\n\n        Example:\n            >>> l = np.array([0.2, 0.1, 0.05, 0])\n            >>> Coloc._get_posteriors(l)\n            DenseVector([0.279, 0.2524, 0.2401, 0.2284])\n        \"\"\"\n        diff = all_abfs - Coloc._get_logsum(all_abfs)\n        abfs_posteriors = np.exp(diff)\n        return Vectors.dense(abfs_posteriors)\n\n    @classmethod\n    def colocalise(\n        cls: type[Coloc],\n        overlapping_signals: StudyLocusOverlap,\n        priorc1: float = 1e-4,\n        priorc2: float = 1e-4,\n        priorc12: float = 1e-5,\n    ) -> Colocalisation:\n        \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n        Args:\n            overlapping_signals (StudyLocusOverlap): overlapping peaks\n            priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n            priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n            priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n        Returns:\n            Colocalisation: Colocalisation results\n        \"\"\"\n        # register udfs\n        logsum = f.udf(Coloc._get_logsum, DoubleType())\n        posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n        return Colocalisation(\n            _df=(\n                overlapping_signals.df\n                # Before summing log_abf columns nulls need to be filled with 0:\n                .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n                # Sum of log_abfs for each pair of signals\n                .withColumn(\n                    \"sum_log_abf\",\n                    f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n                )\n                # Group by overlapping peak and generating dense vectors of log_abf:\n                .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n                .agg(\n                    f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                    fml.array_to_vector(\n                        f.collect_list(f.col(\"statistics.left_logABF\"))\n                    ).alias(\"left_logABF\"),\n                    fml.array_to_vector(\n                        f.collect_list(f.col(\"statistics.right_logABF\"))\n                    ).alias(\"right_logABF\"),\n                    fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n                        \"sum_log_abf\"\n                    ),\n                )\n                .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n                .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n                .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n                .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n                # Add priors\n                # priorc1 Prior on variant being causal for trait 1\n                .withColumn(\"priorc1\", f.lit(priorc1))\n                # priorc2 Prior on variant being causal for trait 2\n                .withColumn(\"priorc2\", f.lit(priorc2))\n                # priorc12 Prior on variant being causal for traits 1 and 2\n                .withColumn(\"priorc12\", f.lit(priorc12))\n                # h0-h2\n                .withColumn(\"lH0abf\", f.lit(0))\n                .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n                .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n                # h3\n                .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n                # exclude null H3/H4s: due to sumlogsum == logsum12\n                .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n                .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n                .withColumn(\n                    \"logdiff\",\n                    (\n                        f.col(\"max\")\n                        + f.log(\n                            f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n                            - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n                        )\n                    ),\n                )\n                .withColumn(\n                    \"lH3abf\",\n                    f.log(f.col(\"priorc1\"))\n                    + f.log(f.col(\"priorc2\"))\n                    + f.col(\"logdiff\"),\n                )\n                .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n                # h4\n                .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n                # cleaning\n                .drop(\n                    \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n                )\n                # posteriors\n                .withColumn(\n                    \"allABF\",\n                    fml.array_to_vector(\n                        f.array(\n                            f.col(\"lH0abf\"),\n                            f.col(\"lH1abf\"),\n                            f.col(\"lH2abf\"),\n                            f.col(\"lH3abf\"),\n                            f.col(\"lH4abf\"),\n                        )\n                    ),\n                )\n                .withColumn(\n                    \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n                )\n                .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n                .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n                .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n                .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n                .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n                .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n                .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n                # clean up\n                .drop(\n                    \"posteriors\",\n                    \"allABF\",\n                    \"h4h3\",\n                    \"lH0abf\",\n                    \"lH1abf\",\n                    \"lH2abf\",\n                    \"lH3abf\",\n                    \"lH4abf\",\n                )\n                .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n            ),\n            _schema=Colocalisation.get_schema(),\n        )\n
"},{"location":"python_api/method/coloc/#otg.method.colocalisation.Coloc.colocalise","title":"colocalise(overlapping_signals: StudyLocusOverlap, priorc1: float = 0.0001, priorc2: float = 0.0001, priorc12: float = 1e-05) -> Colocalisation classmethod","text":"

Calculate bayesian colocalisation based on overlapping signals.

Parameters:

Name Type Description Default overlapping_signals StudyLocusOverlap

overlapping peaks

required priorc1 float

Prior on variant being causal for trait 1. Defaults to 1e-4.

0.0001 priorc2 float

Prior on variant being causal for trait 2. Defaults to 1e-4.

0.0001 priorc12 float

Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.

1e-05

Returns:

Name Type Description Colocalisation Colocalisation

Colocalisation results

Source code in src/otg/method/colocalisation.py
@classmethod\ndef colocalise(\n    cls: type[Coloc],\n    overlapping_signals: StudyLocusOverlap,\n    priorc1: float = 1e-4,\n    priorc2: float = 1e-4,\n    priorc12: float = 1e-5,\n) -> Colocalisation:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n    Args:\n        overlapping_signals (StudyLocusOverlap): overlapping peaks\n        priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n        priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n        priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n    Returns:\n        Colocalisation: Colocalisation results\n    \"\"\"\n    # register udfs\n    logsum = f.udf(Coloc._get_logsum, DoubleType())\n    posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n    return Colocalisation(\n        _df=(\n            overlapping_signals.df\n            # Before summing log_abf columns nulls need to be filled with 0:\n            .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n            # Sum of log_abfs for each pair of signals\n            .withColumn(\n                \"sum_log_abf\",\n                f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n            )\n            # Group by overlapping peak and generating dense vectors of log_abf:\n            .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n            .agg(\n                f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                fml.array_to_vector(\n                    f.collect_list(f.col(\"statistics.left_logABF\"))\n                ).alias(\"left_logABF\"),\n                fml.array_to_vector(\n                    f.collect_list(f.col(\"statistics.right_logABF\"))\n                ).alias(\"right_logABF\"),\n                fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n                    \"sum_log_abf\"\n                ),\n            )\n            .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n            .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n            .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n            .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n            # Add priors\n            # priorc1 Prior on variant being causal for trait 1\n            .withColumn(\"priorc1\", f.lit(priorc1))\n            # priorc2 Prior on variant being causal for trait 2\n            .withColumn(\"priorc2\", f.lit(priorc2))\n            # priorc12 Prior on variant being causal for traits 1 and 2\n            .withColumn(\"priorc12\", f.lit(priorc12))\n            # h0-h2\n            .withColumn(\"lH0abf\", f.lit(0))\n            .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n            .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n            # h3\n            .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n            # exclude null H3/H4s: due to sumlogsum == logsum12\n            .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n            .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n            .withColumn(\n                \"logdiff\",\n                (\n                    f.col(\"max\")\n                    + f.log(\n                        f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n                        - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n                    )\n                ),\n            )\n            .withColumn(\n                \"lH3abf\",\n                f.log(f.col(\"priorc1\"))\n                + f.log(f.col(\"priorc2\"))\n                + f.col(\"logdiff\"),\n            )\n            .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n            # h4\n            .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n            # cleaning\n            .drop(\n                \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n            )\n            # posteriors\n            .withColumn(\n                \"allABF\",\n                fml.array_to_vector(\n                    f.array(\n                        f.col(\"lH0abf\"),\n                        f.col(\"lH1abf\"),\n                        f.col(\"lH2abf\"),\n                        f.col(\"lH3abf\"),\n                        f.col(\"lH4abf\"),\n                    )\n                ),\n            )\n            .withColumn(\n                \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n            )\n            .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n            .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n            .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n            .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n            .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n            .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n            .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n            # clean up\n            .drop(\n                \"posteriors\",\n                \"allABF\",\n                \"h4h3\",\n                \"lH0abf\",\n                \"lH1abf\",\n                \"lH2abf\",\n                \"lH3abf\",\n                \"lH4abf\",\n            )\n            .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n        ),\n        _schema=Colocalisation.get_schema(),\n    )\n
"},{"location":"python_api/method/ecaviar/","title":"eCAVIAR","text":""},{"location":"python_api/method/ecaviar/#otg.method.colocalisation.ECaviar","title":"otg.method.colocalisation.ECaviar","text":"

ECaviar-based colocalisation analysis.

It extends CAVIAR\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.

Source code in src/otg/method/colocalisation.py
class ECaviar:\n    \"\"\"ECaviar-based colocalisation analysis.\n\n    It extends [CAVIAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142122/#bib18)\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (**CLPP**) by utilizing the marginal posterior probabilities. This framework allows for **multiple variants to be causal** in a single locus.\n    \"\"\"\n\n    @staticmethod\n    def _get_clpp(left_pp: Column, right_pp: Column) -> Column:\n        \"\"\"Calculate the colocalisation posterior probability (CLPP).\n\n        If the fact that the same variant is found causal for two studies are independent events,\n        CLPP is defined as the product of posterior porbabilities that a variant is causal in both studies.\n\n        Args:\n            left_pp (Column): left posterior probability\n            right_pp (Column): right posterior probability\n\n        Returns:\n            Column: CLPP\n\n        Examples:\n            >>> d = [{\"left_pp\": 0.5, \"right_pp\": 0.5}, {\"left_pp\": 0.25, \"right_pp\": 0.75}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"clpp\", ECaviar._get_clpp(f.col(\"left_pp\"), f.col(\"right_pp\"))).show()\n            +-------+--------+------+\n            |left_pp|right_pp|  clpp|\n            +-------+--------+------+\n            |    0.5|     0.5|  0.25|\n            |   0.25|    0.75|0.1875|\n            +-------+--------+------+\n            <BLANKLINE>\n\n        \"\"\"\n        return left_pp * right_pp\n\n    @classmethod\n    def colocalise(\n        cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n    ) -> Colocalisation:\n        \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n        Args:\n            overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n        Returns:\n            Colocalisation: colocalisation results based on eCAVIAR.\n        \"\"\"\n        return Colocalisation(\n            _df=(\n                overlapping_signals.df.withColumn(\n                    \"clpp\",\n                    ECaviar._get_clpp(\n                        f.col(\"statistics.left_posteriorProbability\"),\n                        f.col(\"statistics.right_posteriorProbability\"),\n                    ),\n                )\n                .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n                .agg(\n                    f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                    f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n                )\n                .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n            ),\n            _schema=Colocalisation.get_schema(),\n        )\n
"},{"location":"python_api/method/ecaviar/#otg.method.colocalisation.ECaviar.colocalise","title":"colocalise(overlapping_signals: StudyLocusOverlap) -> Colocalisation classmethod","text":"

Calculate bayesian colocalisation based on overlapping signals.

Parameters:

Name Type Description Default overlapping_signals StudyLocusOverlap

overlapping signals.

required

Returns:

Name Type Description Colocalisation Colocalisation

colocalisation results based on eCAVIAR.

Source code in src/otg/method/colocalisation.py
@classmethod\ndef colocalise(\n    cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n) -> Colocalisation:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n    Args:\n        overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n    Returns:\n        Colocalisation: colocalisation results based on eCAVIAR.\n    \"\"\"\n    return Colocalisation(\n        _df=(\n            overlapping_signals.df.withColumn(\n                \"clpp\",\n                ECaviar._get_clpp(\n                    f.col(\"statistics.left_posteriorProbability\"),\n                    f.col(\"statistics.right_posteriorProbability\"),\n                ),\n            )\n            .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n            .agg(\n                f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n            )\n            .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n        ),\n        _schema=Colocalisation.get_schema(),\n    )\n
"},{"location":"python_api/method/ld_annotator/","title":"LDAnnotator","text":""},{"location":"python_api/method/ld_annotator/#otg.method.ld.LDAnnotator","title":"otg.method.ld.LDAnnotator","text":"

Class to annotate linkage disequilibrium (LD) operations from GnomAD.

Source code in src/otg/method/ld.py
class LDAnnotator:\n    \"\"\"Class to annotate linkage disequilibrium (LD) operations from GnomAD.\"\"\"\n\n    @staticmethod\n    def _calculate_weighted_r_overall(ld_set: Column) -> Column:\n        \"\"\"Aggregation of weighted R information using ancestry proportions.\n\n        Args:\n            ld_set (Column): LD set\n\n        Returns:\n            Column: LD set with added 'r2Overall' field\n        \"\"\"\n        return f.transform(\n            ld_set,\n            lambda x: f.struct(\n                x[\"tagVariantId\"].alias(\"tagVariantId\"),\n                # r2Overall is the accumulated sum of each r2 relative to the population size\n                f.aggregate(\n                    x[\"rValues\"],\n                    f.lit(0.0),\n                    lambda acc, y: acc\n                    + f.coalesce(\n                        f.pow(y[\"r\"], 2) * y[\"relativeSampleSize\"], f.lit(0.0)\n                    ),  # we use coalesce to avoid problems when r/relativeSampleSize is null\n                ).alias(\"r2Overall\"),\n            ),\n        )\n\n    @staticmethod\n    def _add_population_size(ld_set: Column, study_populations: Column) -> Column:\n        \"\"\"Add population size to each rValues entry in the ldSet.\n\n        Args:\n            ld_set (Column): LD set\n            study_populations (Column): Study populations\n\n        Returns:\n            Column: LD set with added 'relativeSampleSize' field\n        \"\"\"\n        # Create a population to relativeSampleSize map from the struct\n        populations_map = f.map_from_arrays(\n            study_populations[\"ldPopulation\"],\n            study_populations[\"relativeSampleSize\"],\n        )\n        return f.transform(\n            ld_set,\n            lambda x: f.struct(\n                x[\"tagVariantId\"].alias(\"tagVariantId\"),\n                f.transform(\n                    x[\"rValues\"],\n                    lambda y: f.struct(\n                        y[\"population\"].alias(\"population\"),\n                        y[\"r\"].alias(\"r\"),\n                        populations_map[y[\"population\"]].alias(\"relativeSampleSize\"),\n                    ),\n                ).alias(\"rValues\"),\n            ),\n        )\n\n    @classmethod\n    def ld_annotate(\n        cls: type[LDAnnotator],\n        associations: StudyLocus,\n        studies: StudyIndex,\n        ld_index: LDIndex,\n    ) -> StudyLocus:\n        \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n        This function:\n            1. Annotates study locus with population structure information from the study index\n            2. Joins the LD index to the StudyLocus\n            3. Adds the population size of the study to each rValues entry in the ldSet\n            4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n        Args:\n            associations (StudyLocus): Dataset to be LD annotated\n            studies (StudyIndex): Dataset with study information\n            ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n        Returns:\n            StudyLocus: including additional column with LD information.\n        \"\"\"\n        return (\n            StudyLocus(\n                _df=(\n                    associations.df\n                    # Drop ldSet column if already available\n                    .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n                    # Annotate study locus with population structure from study index\n                    .join(\n                        studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n                        on=\"studyId\",\n                        how=\"left\",\n                    )\n                    # Bring LD information from LD Index\n                    .join(\n                        ld_index.df,\n                        on=[\"variantId\", \"chromosome\"],\n                        how=\"left\",\n                    )\n                    # Add population size to each rValues entry in the ldSet if population structure available:\n                    .withColumn(\n                        \"ldSet\",\n                        f.when(\n                            f.col(\"ldPopulationStructure\").isNotNull(),\n                            cls._add_population_size(\n                                f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n                            ),\n                        ),\n                    )\n                    # Aggregate weighted R information using ancestry proportions\n                    .withColumn(\n                        \"ldSet\",\n                        f.when(\n                            f.col(\"ldPopulationStructure\").isNotNull(),\n                            cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n                        ),\n                    ).drop(\"ldPopulationStructure\")\n                ),\n                _schema=StudyLocus.get_schema(),\n            )\n            ._qc_no_population()\n            ._qc_unresolved_ld()\n        )\n
"},{"location":"python_api/method/ld_annotator/#otg.method.ld.LDAnnotator.ld_annotate","title":"ld_annotate(associations: StudyLocus, studies: StudyIndex, ld_index: LDIndex) -> StudyLocus classmethod","text":"

Annotate linkage disequilibrium (LD) information to a set of studyLocus.

This function
  1. Annotates study locus with population structure information from the study index
  2. Joins the LD index to the StudyLocus
  3. Adds the population size of the study to each rValues entry in the ldSet
  4. Calculates the overall R weighted by the ancestry proportions in every given study.

Parameters:

Name Type Description Default associations StudyLocus

Dataset to be LD annotated

required studies StudyIndex

Dataset with study information

required ld_index LDIndex

Dataset with LD information for every variant present in LD matrix

required

Returns:

Name Type Description StudyLocus StudyLocus

including additional column with LD information.

Source code in src/otg/method/ld.py
@classmethod\ndef ld_annotate(\n    cls: type[LDAnnotator],\n    associations: StudyLocus,\n    studies: StudyIndex,\n    ld_index: LDIndex,\n) -> StudyLocus:\n    \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n    This function:\n        1. Annotates study locus with population structure information from the study index\n        2. Joins the LD index to the StudyLocus\n        3. Adds the population size of the study to each rValues entry in the ldSet\n        4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n    Args:\n        associations (StudyLocus): Dataset to be LD annotated\n        studies (StudyIndex): Dataset with study information\n        ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n    Returns:\n        StudyLocus: including additional column with LD information.\n    \"\"\"\n    return (\n        StudyLocus(\n            _df=(\n                associations.df\n                # Drop ldSet column if already available\n                .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n                # Annotate study locus with population structure from study index\n                .join(\n                    studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n                    on=\"studyId\",\n                    how=\"left\",\n                )\n                # Bring LD information from LD Index\n                .join(\n                    ld_index.df,\n                    on=[\"variantId\", \"chromosome\"],\n                    how=\"left\",\n                )\n                # Add population size to each rValues entry in the ldSet if population structure available:\n                .withColumn(\n                    \"ldSet\",\n                    f.when(\n                        f.col(\"ldPopulationStructure\").isNotNull(),\n                        cls._add_population_size(\n                            f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n                        ),\n                    ),\n                )\n                # Aggregate weighted R information using ancestry proportions\n                .withColumn(\n                    \"ldSet\",\n                    f.when(\n                        f.col(\"ldPopulationStructure\").isNotNull(),\n                        cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n                    ),\n                ).drop(\"ldPopulationStructure\")\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n        ._qc_no_population()\n        ._qc_unresolved_ld()\n    )\n
"},{"location":"python_api/method/pics/","title":"PICS","text":""},{"location":"python_api/method/pics/#otg.method.pics.PICS","title":"otg.method.pics.PICS","text":"

Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.

Source code in src/otg/method/pics.py
class PICS:\n    \"\"\"Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.\"\"\"\n\n    @staticmethod\n    def _pics_relative_posterior_probability(\n        neglog_p: float, pics_snp_mu: float, pics_snp_std: float\n    ) -> float:\n        \"\"\"Compute the PICS posterior probability for a given SNP.\n\n        !!! info \"This probability needs to be scaled to take into account the probabilities of the other variants in the locus.\"\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            pics_snp_mu (float): Mean P value of the association between a SNP and a trait\n            pics_snp_std (float): Standard deviation for the P value of the association between a SNP and a trait\n\n        Returns:\n            float: Posterior probability of the association between a SNP and a trait\n\n        Examples:\n            >>> rel_prob = PICS._pics_relative_posterior_probability(neglog_p=10.0, pics_snp_mu=1.0, pics_snp_std=10.0)\n            >>> round(rel_prob, 3)\n            0.368\n        \"\"\"\n        return float(norm(pics_snp_mu, pics_snp_std).sf(neglog_p) * 2)\n\n    @staticmethod\n    def _pics_standard_deviation(neglog_p: float, r2: float, k: float) -> float | None:\n        \"\"\"Compute the PICS standard deviation.\n\n        This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n        valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            r2 (float): LD score between a given SNP and the lead variant\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            float | None: Standard deviation for the P value of the association between a SNP and a trait\n\n        Examples:\n            >>> PICS._pics_standard_deviation(neglog_p=1.0, r2=1.0, k=6.4)\n            0.0\n            >>> round(PICS._pics_standard_deviation(neglog_p=10.0, r2=0.5, k=6.4), 3)\n            1.493\n            >>> print(PICS._pics_standard_deviation(neglog_p=1.0, r2=0.0, k=6.4))\n            None\n        \"\"\"\n        return (\n            abs(((1 - (r2**0.5) ** k) ** 0.5) * (neglog_p**0.5) / 2)\n            if r2 >= 0.5\n            else None\n        )\n\n    @staticmethod\n    def _pics_mu(neglog_p: float, r2: float) -> float | None:\n        \"\"\"Compute the PICS mu that estimates the probability of association between a given SNP and the trait.\n\n        This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n        valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            r2 (float): LD score between a given SNP and the lead variant\n\n        Returns:\n            float | None: Mean P value of the association between a SNP and a trait\n\n        Examples:\n            >>> PICS._pics_mu(neglog_p=1.0, r2=1.0)\n            1.0\n            >>> PICS._pics_mu(neglog_p=10.0, r2=0.5)\n            5.0\n            >>> print(PICS._pics_mu(neglog_p=10.0, r2=0.3))\n            None\n        \"\"\"\n        return neglog_p * r2 if r2 >= 0.5 else None\n\n    @staticmethod\n    def _finemap(ld_set: list[Row], lead_neglog_p: float, k: float) -> list | None:\n        \"\"\"Calculates the probability of a variant being causal in a study-locus context by applying the PICS method.\n\n        It is intended to be applied as an UDF in `PICS.finemap`, where each row is a StudyLocus association.\n        The function iterates over every SNP in the `ldSet` array, and it returns an updated locus with\n        its association signal and causality probability as of PICS.\n\n        Args:\n            ld_set (list[Row]): list of tagging variants after expanding the locus\n            lead_neglog_p (float): P value of the association signal between the lead variant and the study in the form of -log10.\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            list | None: List of tagging variants with an estimation of the association signal and their posterior probability as of PICS.\n\n        Examples:\n            >>> from pyspark.sql import Row\n            >>> ld_set = [\n            ...     Row(variantId=\"var1\", r2Overall=0.8),\n            ...     Row(variantId=\"var2\", r2Overall=1),\n            ... ]\n            >>> PICS._finemap(ld_set, lead_neglog_p=10.0, k=6.4)\n            [{'variantId': 'var1', 'r2Overall': 0.8, 'standardError': 0.07420896512708416, 'posteriorProbability': 0.07116959886882368}, {'variantId': 'var2', 'r2Overall': 1, 'standardError': 0.9977000638225533, 'posteriorProbability': 0.9288304011311763}]\n            >>> empty_ld_set = []\n            >>> PICS._finemap(empty_ld_set, lead_neglog_p=10.0, k=6.4)\n            []\n            >>> ld_set_with_no_r2 = [\n            ...     Row(variantId=\"var1\", r2Overall=None),\n            ...     Row(variantId=\"var2\", r2Overall=None),\n            ... ]\n            >>> PICS._finemap(ld_set_with_no_r2, lead_neglog_p=10.0, k=6.4)\n            [{'variantId': 'var1', 'r2Overall': None}, {'variantId': 'var2', 'r2Overall': None}]\n        \"\"\"\n        if ld_set is None:\n            return None\n        elif not ld_set:\n            return []\n        tmp_credible_set = []\n        new_credible_set = []\n        # First iteration: calculation of mu, standard deviation, and the relative posterior probability\n        for tag_struct in ld_set:\n            tag_dict = (\n                tag_struct.asDict()\n            )  # tag_struct is of type pyspark.Row, we'll represent it as a dict\n            if (\n                not tag_dict[\"r2Overall\"]\n                or tag_dict[\"r2Overall\"] < 0.5\n                or not lead_neglog_p\n            ):\n                # If PICS cannot be calculated, we'll return the original credible set\n                new_credible_set.append(tag_dict)\n                continue\n\n            pics_snp_mu = PICS._pics_mu(lead_neglog_p, tag_dict[\"r2Overall\"])\n            pics_snp_std = PICS._pics_standard_deviation(\n                lead_neglog_p, tag_dict[\"r2Overall\"], k\n            )\n            pics_snp_std = 0.001 if pics_snp_std == 0 else pics_snp_std\n            if pics_snp_mu is not None and pics_snp_std is not None:\n                posterior_probability = PICS._pics_relative_posterior_probability(\n                    lead_neglog_p, pics_snp_mu, pics_snp_std\n                )\n                tag_dict[\"standardError\"] = 10**-pics_snp_std\n                tag_dict[\"relativePosteriorProbability\"] = posterior_probability\n\n                tmp_credible_set.append(tag_dict)\n\n        # Second iteration: calculation of the sum of all the posteriors in each study-locus, so that we scale them between 0-1\n        total_posteriors = sum(\n            tag_dict.get(\"relativePosteriorProbability\", 0)\n            for tag_dict in tmp_credible_set\n        )\n\n        # Third iteration: calculation of the final posteriorProbability\n        for tag_dict in tmp_credible_set:\n            if total_posteriors != 0:\n                tag_dict[\"posteriorProbability\"] = float(\n                    tag_dict.get(\"relativePosteriorProbability\", 0) / total_posteriors\n                )\n            tag_dict.pop(\"relativePosteriorProbability\")\n            new_credible_set.append(tag_dict)\n        return new_credible_set\n\n    @classmethod\n    def finemap(\n        cls: type[PICS], associations: StudyLocus, k: float = 6.4\n    ) -> StudyLocus:\n        \"\"\"Run PICS on a study locus.\n\n        !!! info \"Study locus needs to be LD annotated\"\n            The study locus needs to be LD annotated before PICS can be calculated.\n\n        Args:\n            associations (StudyLocus): Study locus to finemap using PICS\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            StudyLocus: Study locus with PICS results\n        \"\"\"\n        # Register UDF by defining the structure of the output locus array of structs\n        # it also renames tagVariantId to variantId\n\n        picsed_ldset_schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"tagVariantId\", t.StringType(), True),\n                    t.StructField(\"r2Overall\", t.DoubleType(), True),\n                    t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                    t.StructField(\"standardError\", t.DoubleType(), True),\n                ]\n            )\n        )\n        picsed_study_locus_schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"variantId\", t.StringType(), True),\n                    t.StructField(\"r2Overall\", t.DoubleType(), True),\n                    t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                    t.StructField(\"standardError\", t.DoubleType(), True),\n                ]\n            )\n        )\n        _finemap_udf = f.udf(\n            lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n            picsed_ldset_schema,\n        )\n        return StudyLocus(\n            _df=(\n                associations.df\n                # Old locus column will be dropped if available\n                .select(*[col for col in associations.df.columns if col != \"locus\"])\n                # Estimate neglog_pvalue for the lead variant\n                .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n                # New locus containing the PICS results\n                .withColumn(\n                    \"locus\",\n                    f.when(\n                        f.col(\"ldSet\").isNotNull(),\n                        _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n                            picsed_study_locus_schema\n                        ),\n                    ),\n                )\n                # Rename tagVariantId to variantId\n                .drop(\"neglog_pvalue\")\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n
"},{"location":"python_api/method/pics/#otg.method.pics.PICS.finemap","title":"finemap(associations: StudyLocus, k: float = 6.4) -> StudyLocus classmethod","text":"

Run PICS on a study locus.

Study locus needs to be LD annotated

The study locus needs to be LD annotated before PICS can be calculated.

Parameters:

Name Type Description Default associations StudyLocus

Study locus to finemap using PICS

required k float

Empiric constant that can be adjusted to fit the curve, 6.4 recommended.

6.4

Returns:

Name Type Description StudyLocus StudyLocus

Study locus with PICS results

Source code in src/otg/method/pics.py
@classmethod\ndef finemap(\n    cls: type[PICS], associations: StudyLocus, k: float = 6.4\n) -> StudyLocus:\n    \"\"\"Run PICS on a study locus.\n\n    !!! info \"Study locus needs to be LD annotated\"\n        The study locus needs to be LD annotated before PICS can be calculated.\n\n    Args:\n        associations (StudyLocus): Study locus to finemap using PICS\n        k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n    Returns:\n        StudyLocus: Study locus with PICS results\n    \"\"\"\n    # Register UDF by defining the structure of the output locus array of structs\n    # it also renames tagVariantId to variantId\n\n    picsed_ldset_schema = t.ArrayType(\n        t.StructType(\n            [\n                t.StructField(\"tagVariantId\", t.StringType(), True),\n                t.StructField(\"r2Overall\", t.DoubleType(), True),\n                t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                t.StructField(\"standardError\", t.DoubleType(), True),\n            ]\n        )\n    )\n    picsed_study_locus_schema = t.ArrayType(\n        t.StructType(\n            [\n                t.StructField(\"variantId\", t.StringType(), True),\n                t.StructField(\"r2Overall\", t.DoubleType(), True),\n                t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                t.StructField(\"standardError\", t.DoubleType(), True),\n            ]\n        )\n    )\n    _finemap_udf = f.udf(\n        lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n        picsed_ldset_schema,\n    )\n    return StudyLocus(\n        _df=(\n            associations.df\n            # Old locus column will be dropped if available\n            .select(*[col for col in associations.df.columns if col != \"locus\"])\n            # Estimate neglog_pvalue for the lead variant\n            .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n            # New locus containing the PICS results\n            .withColumn(\n                \"locus\",\n                f.when(\n                    f.col(\"ldSet\").isNotNull(),\n                    _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n                        picsed_study_locus_schema\n                    ),\n                ),\n            )\n            # Rename tagVariantId to variantId\n            .drop(\"neglog_pvalue\")\n        ),\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/method/window_based_clumping/","title":"Window-based clumping","text":""},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping","title":"otg.method.window_based_clumping.WindowBasedClumping","text":"

Get semi-lead snps from summary statistics using a window based function.

Source code in src/otg/method/window_based_clumping.py
class WindowBasedClumping:\n    \"\"\"Get semi-lead snps from summary statistics using a window based function.\"\"\"\n\n    @staticmethod\n    def _cluster_peaks(\n        study: Column, chromosome: Column, position: Column, window_length: int\n    ) -> Column:\n        \"\"\"Cluster GWAS significant variants, were clusters are separated by a defined distance.\n\n        !! Important to note that the length of the clusters can be arbitrarily big.\n\n        Args:\n            study (Column): study identifier\n            chromosome (Column): chromosome identifier\n            position (Column): position of the variant\n            window_length (int): window length in basepair\n\n        Returns:\n            Column: containing cluster identifier\n\n        Examples:\n            >>> data = [\n            ...     # Cluster 1:\n            ...     ('s1', 'chr1', 2),\n            ...     ('s1', 'chr1', 4),\n            ...     ('s1', 'chr1', 12),\n            ...     # Cluster 2 - Same chromosome:\n            ...     ('s1', 'chr1', 31),\n            ...     ('s1', 'chr1', 38),\n            ...     ('s1', 'chr1', 42),\n            ...     # Cluster 3 - New chromosome:\n            ...     ('s1', 'chr2', 41),\n            ...     ('s1', 'chr2', 44),\n            ...     ('s1', 'chr2', 50),\n            ...     # Cluster 4 - other study:\n            ...     ('s2', 'chr2', 55),\n            ...     ('s2', 'chr2', 62),\n            ...     ('s2', 'chr2', 70),\n            ... ]\n            >>> window_length = 10\n            >>> (\n            ...     spark.createDataFrame(data, ['studyId', 'chromosome', 'position'])\n            ...     .withColumn(\"cluster_id\",\n            ...         WindowBasedClumping._cluster_peaks(\n            ...             f.col('studyId'),\n            ...             f.col('chromosome'),\n            ...             f.col('position'),\n            ...             window_length\n            ...         )\n            ...     ).show()\n            ... )\n            +-------+----------+--------+----------+\n            |studyId|chromosome|position|cluster_id|\n            +-------+----------+--------+----------+\n            |     s1|      chr1|       2| s1_chr1_2|\n            |     s1|      chr1|       4| s1_chr1_2|\n            |     s1|      chr1|      12| s1_chr1_2|\n            |     s1|      chr1|      31|s1_chr1_31|\n            |     s1|      chr1|      38|s1_chr1_31|\n            |     s1|      chr1|      42|s1_chr1_31|\n            |     s1|      chr2|      41|s1_chr2_41|\n            |     s1|      chr2|      44|s1_chr2_41|\n            |     s1|      chr2|      50|s1_chr2_41|\n            |     s2|      chr2|      55|s2_chr2_55|\n            |     s2|      chr2|      62|s2_chr2_55|\n            |     s2|      chr2|      70|s2_chr2_55|\n            +-------+----------+--------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # By adding previous position, the cluster boundary can be identified:\n        previous_position = f.lag(position).over(\n            Window.partitionBy(study, chromosome).orderBy(position)\n        )\n        # We consider a cluster boudary if subsequent snps are further than the defined window:\n        cluster_id = f.when(\n            (previous_position.isNull())\n            | (position - previous_position > window_length),\n            f.concat_ws(\"_\", study, chromosome, position),\n        )\n        # The cluster identifier is propagated across every variant of the cluster:\n        return f.when(\n            cluster_id.isNull(),\n            f.last(cluster_id, ignorenulls=True).over(\n                Window.partitionBy(study, chromosome)\n                .orderBy(position)\n                .rowsBetween(Window.unboundedPreceding, Window.currentRow)\n            ),\n        ).otherwise(cluster_id)\n\n    @staticmethod\n    def _prune_peak(position: ndarray, window_size: int) -> DenseVector:\n        \"\"\"Establish lead snps based on their positions listed by p-value.\n\n        The function `find_peak` assigns lead SNPs based on their positions listed by p-value within a specified window size.\n\n        Args:\n            position (ndarray): positions of the SNPs sorted by p-value.\n            window_size (int): the distance in bp within which associations are clumped together around the lead snp.\n\n        Returns:\n            DenseVector: binary vector where 1 indicates a lead SNP and 0 indicates a non-lead SNP.\n\n        Examples:\n            >>> from pyspark.ml import functions as fml\n            >>> from pyspark.ml.linalg import DenseVector\n            >>> WindowBasedClumping._prune_peak(np.array((3, 9, 8, 4, 6)), 2)\n            DenseVector([1.0, 1.0, 0.0, 0.0, 1.0])\n\n        \"\"\"\n        # Initializing the lead list with zeroes:\n        is_lead: ndarray = np.zeros(len(position))\n\n        # List containing indices of leads:\n        lead_indices: list = []\n\n        # Looping through all positions:\n        for index in range(len(position)):\n            # Looping through leads to find out if they are within a window:\n            for lead_index in lead_indices:\n                # If any of the leads within the window:\n                if abs(position[lead_index] - position[index]) < window_size:\n                    # Skipping further checks:\n                    break\n            else:\n                # None of the leads were within the window:\n                lead_indices.append(index)\n                is_lead[index] = 1\n\n        return DenseVector(is_lead)\n\n    @classmethod\n    def clump(\n        cls: type[WindowBasedClumping],\n        summary_stats: SummaryStatistics,\n        window_length: int,\n        p_value_significance: float = 5e-8,\n    ) -> StudyLocus:\n        \"\"\"Clump summary statistics by distance.\n\n        Args:\n            summary_stats (SummaryStatistics): summary statistics to clump\n            window_length (int): window length in basepair\n            p_value_significance (float): only more significant variants are considered\n\n        Returns:\n            StudyLocus: clumped summary statistics\n        \"\"\"\n        # Create window for locus clusters\n        # - variants where the distance between subsequent variants is below the defined threshold.\n        # - Variants are sorted by descending significance\n        cluster_window = Window.partitionBy(\n            \"studyId\", \"chromosome\", \"cluster_id\"\n        ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n        return StudyLocus(\n            _df=(\n                summary_stats\n                # Dropping snps below significance - all subsequent steps are done on significant variants:\n                .pvalue_filter(p_value_significance)\n                .df\n                # Clustering summary variants for efficient windowing (complexity reduction):\n                .withColumn(\n                    \"cluster_id\",\n                    WindowBasedClumping._cluster_peaks(\n                        f.col(\"studyId\"),\n                        f.col(\"chromosome\"),\n                        f.col(\"position\"),\n                        window_length,\n                    ),\n                )\n                # Within each cluster variants are ranked by significance:\n                .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n                # Collect positions in cluster for the most significant variant (complexity reduction):\n                .withColumn(\n                    \"collectedPositions\",\n                    f.when(\n                        f.col(\"pvRank\") == 1,\n                        f.collect_list(f.col(\"position\")).over(\n                            cluster_window.rowsBetween(\n                                Window.currentRow, Window.unboundedFollowing\n                            )\n                        ),\n                    ).otherwise(f.array()),\n                )\n                # Get semi indices only ONCE per cluster:\n                .withColumn(\n                    \"semiIndices\",\n                    f.when(\n                        f.size(f.col(\"collectedPositions\")) > 0,\n                        fml.vector_to_array(\n                            f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n                                fml.array_to_vector(f.col(\"collectedPositions\")),\n                                f.lit(window_length),\n                            )\n                        ),\n                    ),\n                )\n                # Propagating the result of the above calculation for all rows:\n                .withColumn(\n                    \"semiIndices\",\n                    f.when(\n                        f.col(\"semiIndices\").isNull(),\n                        f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n                            cluster_window\n                        ),\n                    ).otherwise(f.col(\"semiIndices\")),\n                )\n                # Keeping semi indices only:\n                .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n                .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n                # Adding study-locus id:\n                .withColumn(\n                    \"studyLocusId\",\n                    StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n                )\n                # Initialize QC column as array of strings:\n                .withColumn(\n                    \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n                )\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n\n    @classmethod\n    def clump_with_locus(\n        cls: type[WindowBasedClumping],\n        summary_stats: SummaryStatistics,\n        window_length: int,\n        p_value_significance: float = 5e-8,\n        p_value_baseline: float = 0.05,\n        locus_window_length: int | None = None,\n    ) -> StudyLocus:\n        \"\"\"Clump significant associations while collecting locus around them.\n\n        Args:\n            summary_stats (SummaryStatistics): Input summary statistics dataset\n            window_length (int): Window size in  bp, used for distance based clumping.\n            p_value_significance (float): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n            p_value_baseline (float): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n            locus_window_length (int | None): The distance for collecting locus around the semi indices. Defaults to None.\n\n        Returns:\n            StudyLocus: StudyLocus after clumping with information about the `locus`\n        \"\"\"\n        # If no locus window provided, using the same value:\n        if locus_window_length is None:\n            locus_window_length = window_length\n\n        # Run distance based clumping on the summary stats:\n        clumped_dataframe = WindowBasedClumping.clump(\n            summary_stats,\n            window_length=window_length,\n            p_value_significance=p_value_significance,\n        ).df.alias(\"clumped\")\n\n        # Get list of columns from clumped dataset for further propagation:\n        clumped_columns = clumped_dataframe.columns\n\n        # Dropping variants not meeting the baseline criteria:\n        sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n        # Renaming columns:\n        sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n            *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n        ).alias(\"sumstat\")\n\n        study_locus_df = (\n            sumstats_baseline_renamed\n            # Joining the two datasets together:\n            .join(\n                f.broadcast(clumped_dataframe),\n                on=[\n                    (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n                    & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n                    & (\n                        f.col(\"sumstat.tag_position\")\n                        >= (f.col(\"clumped.position\") - locus_window_length)\n                    )\n                    & (\n                        f.col(\"sumstat.tag_position\")\n                        <= (f.col(\"clumped.position\") + locus_window_length)\n                    )\n                ],\n                how=\"right\",\n            )\n            .withColumn(\n                \"locus\",\n                f.struct(\n                    f.col(\"tag_variantId\").alias(\"variantId\"),\n                    f.col(\"tag_beta\").alias(\"beta\"),\n                    f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n                    f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n                    f.col(\"tag_standardError\").alias(\"standardError\"),\n                ),\n            )\n            .groupby(\"studyLocusId\")\n            .agg(\n                *[\n                    f.first(col).alias(col)\n                    for col in clumped_columns\n                    if col != \"studyLocusId\"\n                ],\n                f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n            )\n        )\n\n        return StudyLocus(\n            _df=study_locus_df,\n            _schema=StudyLocus.get_schema(),\n        )\n
"},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping.clump","title":"clump(summary_stats: SummaryStatistics, window_length: int, p_value_significance: float = 5e-08) -> StudyLocus classmethod","text":"

Clump summary statistics by distance.

Parameters:

Name Type Description Default summary_stats SummaryStatistics

summary statistics to clump

required window_length int

window length in basepair

required p_value_significance float

only more significant variants are considered

5e-08

Returns:

Name Type Description StudyLocus StudyLocus

clumped summary statistics

Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump(\n    cls: type[WindowBasedClumping],\n    summary_stats: SummaryStatistics,\n    window_length: int,\n    p_value_significance: float = 5e-8,\n) -> StudyLocus:\n    \"\"\"Clump summary statistics by distance.\n\n    Args:\n        summary_stats (SummaryStatistics): summary statistics to clump\n        window_length (int): window length in basepair\n        p_value_significance (float): only more significant variants are considered\n\n    Returns:\n        StudyLocus: clumped summary statistics\n    \"\"\"\n    # Create window for locus clusters\n    # - variants where the distance between subsequent variants is below the defined threshold.\n    # - Variants are sorted by descending significance\n    cluster_window = Window.partitionBy(\n        \"studyId\", \"chromosome\", \"cluster_id\"\n    ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n    return StudyLocus(\n        _df=(\n            summary_stats\n            # Dropping snps below significance - all subsequent steps are done on significant variants:\n            .pvalue_filter(p_value_significance)\n            .df\n            # Clustering summary variants for efficient windowing (complexity reduction):\n            .withColumn(\n                \"cluster_id\",\n                WindowBasedClumping._cluster_peaks(\n                    f.col(\"studyId\"),\n                    f.col(\"chromosome\"),\n                    f.col(\"position\"),\n                    window_length,\n                ),\n            )\n            # Within each cluster variants are ranked by significance:\n            .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n            # Collect positions in cluster for the most significant variant (complexity reduction):\n            .withColumn(\n                \"collectedPositions\",\n                f.when(\n                    f.col(\"pvRank\") == 1,\n                    f.collect_list(f.col(\"position\")).over(\n                        cluster_window.rowsBetween(\n                            Window.currentRow, Window.unboundedFollowing\n                        )\n                    ),\n                ).otherwise(f.array()),\n            )\n            # Get semi indices only ONCE per cluster:\n            .withColumn(\n                \"semiIndices\",\n                f.when(\n                    f.size(f.col(\"collectedPositions\")) > 0,\n                    fml.vector_to_array(\n                        f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n                            fml.array_to_vector(f.col(\"collectedPositions\")),\n                            f.lit(window_length),\n                        )\n                    ),\n                ),\n            )\n            # Propagating the result of the above calculation for all rows:\n            .withColumn(\n                \"semiIndices\",\n                f.when(\n                    f.col(\"semiIndices\").isNull(),\n                    f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n                        cluster_window\n                    ),\n                ).otherwise(f.col(\"semiIndices\")),\n            )\n            # Keeping semi indices only:\n            .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n            .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n            # Adding study-locus id:\n            .withColumn(\n                \"studyLocusId\",\n                StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n            )\n            # Initialize QC column as array of strings:\n            .withColumn(\n                \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n            )\n        ),\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping.clump_with_locus","title":"clump_with_locus(summary_stats: SummaryStatistics, window_length: int, p_value_significance: float = 5e-08, p_value_baseline: float = 0.05, locus_window_length: int | None = None) -> StudyLocus classmethod","text":"

Clump significant associations while collecting locus around them.

Parameters:

Name Type Description Default summary_stats SummaryStatistics

Input summary statistics dataset

required window_length int

Window size in bp, used for distance based clumping.

required p_value_significance float

GWAS significance threshold used to filter peaks. Defaults to 5e-8.

5e-08 p_value_baseline float

Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.

0.05 locus_window_length int | None

The distance for collecting locus around the semi indices. Defaults to None.

None

Returns:

Name Type Description StudyLocus StudyLocus

StudyLocus after clumping with information about the locus

Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump_with_locus(\n    cls: type[WindowBasedClumping],\n    summary_stats: SummaryStatistics,\n    window_length: int,\n    p_value_significance: float = 5e-8,\n    p_value_baseline: float = 0.05,\n    locus_window_length: int | None = None,\n) -> StudyLocus:\n    \"\"\"Clump significant associations while collecting locus around them.\n\n    Args:\n        summary_stats (SummaryStatistics): Input summary statistics dataset\n        window_length (int): Window size in  bp, used for distance based clumping.\n        p_value_significance (float): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n        p_value_baseline (float): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n        locus_window_length (int | None): The distance for collecting locus around the semi indices. Defaults to None.\n\n    Returns:\n        StudyLocus: StudyLocus after clumping with information about the `locus`\n    \"\"\"\n    # If no locus window provided, using the same value:\n    if locus_window_length is None:\n        locus_window_length = window_length\n\n    # Run distance based clumping on the summary stats:\n    clumped_dataframe = WindowBasedClumping.clump(\n        summary_stats,\n        window_length=window_length,\n        p_value_significance=p_value_significance,\n    ).df.alias(\"clumped\")\n\n    # Get list of columns from clumped dataset for further propagation:\n    clumped_columns = clumped_dataframe.columns\n\n    # Dropping variants not meeting the baseline criteria:\n    sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n    # Renaming columns:\n    sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n        *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n    ).alias(\"sumstat\")\n\n    study_locus_df = (\n        sumstats_baseline_renamed\n        # Joining the two datasets together:\n        .join(\n            f.broadcast(clumped_dataframe),\n            on=[\n                (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n                & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n                & (\n                    f.col(\"sumstat.tag_position\")\n                    >= (f.col(\"clumped.position\") - locus_window_length)\n                )\n                & (\n                    f.col(\"sumstat.tag_position\")\n                    <= (f.col(\"clumped.position\") + locus_window_length)\n                )\n            ],\n            how=\"right\",\n        )\n        .withColumn(\n            \"locus\",\n            f.struct(\n                f.col(\"tag_variantId\").alias(\"variantId\"),\n                f.col(\"tag_beta\").alias(\"beta\"),\n                f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n                f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n                f.col(\"tag_standardError\").alias(\"standardError\"),\n            ),\n        )\n        .groupby(\"studyLocusId\")\n        .agg(\n            *[\n                f.first(col).alias(col)\n                for col in clumped_columns\n                if col != \"studyLocusId\"\n            ],\n            f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n        )\n    )\n\n    return StudyLocus(\n        _df=study_locus_df,\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/step/_step/","title":"Step","text":"

TBC

"},{"location":"python_api/step/colocalisation/","title":"Colocalisation","text":""},{"location":"python_api/step/colocalisation/#otg.colocalisation.ColocalisationStep","title":"otg.colocalisation.ColocalisationStep dataclass","text":"

Colocalisation step.

This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).

Attributes:

Name Type Description study_locus_path DictConfig

Input Study-locus path.

coloc_path DictConfig

Output Colocalisation path.

priorc1 float

Prior on variant being causal for trait 1.

priorc2 float

Prior on variant being causal for trait 2.

priorc12 float

Prior on variant being causal for traits 1 and 2.

Source code in src/otg/colocalisation.py
@dataclass\nclass ColocalisationStep:\n    \"\"\"Colocalisation step.\n\n    This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).\n\n    Attributes:\n        study_locus_path (DictConfig): Input Study-locus path.\n        coloc_path (DictConfig): Output Colocalisation path.\n        priorc1 (float): Prior on variant being causal for trait 1.\n        priorc2 (float): Prior on variant being causal for trait 2.\n        priorc12 (float): Prior on variant being causal for traits 1 and 2.\n    \"\"\"\n\n    session: Session = Session()\n\n    study_locus_path: str = MISSING\n    study_index_path: str = MISSING\n    coloc_path: str = MISSING\n    priorc1: float = 1e-4\n    priorc2: float = 1e-4\n    priorc12: float = 1e-5\n\n    def __post_init__(self: ColocalisationStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Study-locus information\n        sl = StudyLocus.from_parquet(self.session, self.study_locus_path)\n        si = StudyIndex.from_parquet(self.session, self.study_index_path)\n\n        # Study-locus overlaps for 95% credible sets\n        sl_overlaps = sl.credible_set(CredibleInterval.IS95).overlaps(si)\n\n        coloc_results = Coloc.colocalise(\n            sl_overlaps, self.priorc1, self.priorc2, self.priorc12\n        )\n        ecaviar_results = ECaviar.colocalise(sl_overlaps)\n\n        coloc_results.df.unionByName(ecaviar_results.df, allowMissingColumns=True)\n\n        coloc_results.df.write.mode(self.session.write_mode).parquet(self.coloc_path)\n
"},{"location":"python_api/step/finngen/","title":"FinnGen","text":""},{"location":"python_api/step/finngen/#otg.finngen.FinnGenStep","title":"otg.finngen.FinnGenStep dataclass","text":"

FinnGen ingestion step.

Attributes:

Name Type Description finngen_phenotype_table_url str

FinnGen API for fetching the list of studies.

finngen_release_prefix str

Release prefix pattern.

finngen_sumstat_url_prefix str

URL prefix for summary statistics location.

finngen_sumstat_url_suffix str

URL prefix suffix for summary statistics location.

finngen_study_index_out str

Output path for the FinnGen study index dataset.

finngen_summary_stats_out str

Output path for the FinnGen summary statistics.

Source code in src/otg/finngen.py
@dataclass\nclass FinnGenStep:\n    \"\"\"FinnGen ingestion step.\n\n    Attributes:\n        finngen_phenotype_table_url (str): FinnGen API for fetching the list of studies.\n        finngen_release_prefix (str): Release prefix pattern.\n        finngen_sumstat_url_prefix (str): URL prefix for summary statistics location.\n        finngen_sumstat_url_suffix (str): URL prefix suffix for summary statistics location.\n        finngen_study_index_out (str): Output path for the FinnGen study index dataset.\n        finngen_summary_stats_out (str): Output path for the FinnGen summary statistics.\n    \"\"\"\n\n    session: Session = Session()\n\n    finngen_phenotype_table_url: str = MISSING\n    finngen_release_prefix: str = MISSING\n    finngen_sumstat_url_prefix: str = MISSING\n    finngen_sumstat_url_suffix: str = MISSING\n    finngen_study_index_out: str = MISSING\n    finngen_summary_stats_out: str = MISSING\n\n    def __post_init__(self: FinnGenStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read the JSON data from the URL.\n        json_data = urlopen(self.finngen_phenotype_table_url).read().decode(\"utf-8\")\n        rdd = self.session.spark.sparkContext.parallelize([json_data])\n        df = self.session.spark.read.json(rdd)\n\n        # Parse the study index data.\n        finngen_studies = FinnGenStudyIndex.from_source(\n            df,\n            self.finngen_release_prefix,\n            self.finngen_sumstat_url_prefix,\n            self.finngen_sumstat_url_suffix,\n        )\n\n        # Write the study index output.\n        finngen_studies.df.write.mode(self.session.write_mode).parquet(\n            self.finngen_study_index_out\n        )\n\n        # Prepare list of files for ingestion.\n        input_filenames = [\n            row.summarystatsLocation for row in finngen_studies.collect()\n        ]\n        summary_stats_df = self.session.spark.read.option(\"delimiter\", \"\\t\").csv(\n            input_filenames, header=True\n        )\n\n        # Specify data processing instructions.\n        summary_stats_df = FinnGenSummaryStats.from_finngen_harmonized_summary_stats(\n            summary_stats_df\n        ).df\n\n        # Sort and partition for output.\n        summary_stats_df.sortWithinPartitions(\"position\").write.partitionBy(\n            \"studyId\", \"chromosome\"\n        ).mode(self.session.write_mode).parquet(self.finngen_summary_stats_out)\n
"},{"location":"python_api/step/gene_index/","title":"Gene Index","text":""},{"location":"python_api/step/gene_index/#otg.gene_index.GeneIndexStep","title":"otg.gene_index.GeneIndexStep dataclass","text":"

Gene index step.

This step generates a gene index dataset from an Open Targets Platform target dataset.

Attributes:

Name Type Description target_path str

Open targets Platform target dataset path.

gene_index_path str

Output gene index path.

Source code in src/otg/gene_index.py
@dataclass\nclass GeneIndexStep:\n    \"\"\"Gene index step.\n\n    This step generates a gene index dataset from an Open Targets Platform target dataset.\n\n    Attributes:\n        target_path (str): Open targets Platform target dataset path.\n        gene_index_path (str): Output gene index path.\n    \"\"\"\n\n    session: Session = Session()\n\n    target_path: str = MISSING\n    gene_index_path: str = MISSING\n\n    def __post_init__(self: GeneIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        platform_target = self.session.spark.read.parquet(self.target_path)\n        # Transform\n        gene_index = OpenTargetsTarget.as_gene_index(platform_target)\n        # Load\n        gene_index.df.write.mode(self.session.write_mode).parquet(self.gene_index_path)\n
"},{"location":"python_api/step/gwas_catalog/","title":"GWAS Catalog","text":""},{"location":"python_api/step/gwas_catalog/#otg.gwas_catalog.GWASCatalogStep","title":"otg.gwas_catalog.GWASCatalogStep dataclass","text":"

GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.

Attributes:

Name Type Description catalog_studies_file str

Raw GWAS catalog studies file.

catalog_ancestry_file str

Ancestry annotations file from GWAS Catalog.

catalog_sumstats_lut str

GWAS Catalog summary statistics lookup table.

catalog_associations_file str

Raw GWAS catalog associations file.

variant_annotation_path str

Input variant annotation path.

ld_populations list

List of populations to include.

min_r2 float

Minimum r2 to consider when considering variants within a window.

catalog_studies_out str

Output GWAS catalog studies path.

catalog_associations_out str

Output GWAS catalog associations path.

Source code in src/otg/gwas_catalog.py
@dataclass\nclass GWASCatalogStep:\n    \"\"\"GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.\n\n    Attributes:\n        catalog_studies_file (str): Raw GWAS catalog studies file.\n        catalog_ancestry_file (str): Ancestry annotations file from GWAS Catalog.\n        catalog_sumstats_lut (str): GWAS Catalog summary statistics lookup table.\n        catalog_associations_file (str): Raw GWAS catalog associations file.\n        variant_annotation_path (str): Input variant annotation path.\n        ld_populations (list): List of populations to include.\n        min_r2 (float): Minimum r2 to consider when considering variants within a window.\n        catalog_studies_out (str): Output GWAS catalog studies path.\n        catalog_associations_out (str): Output GWAS catalog associations path.\n    \"\"\"\n\n    session: Session = Session()\n\n    catalog_studies_file: str = MISSING\n    catalog_ancestry_file: str = MISSING\n    catalog_sumstats_lut: str = MISSING\n    catalog_associations_file: str = MISSING\n    variant_annotation_path: str = MISSING\n    ld_index_path: str = MISSING\n    min_r2: float = 0.5\n    catalog_studies_out: str = MISSING\n    catalog_associations_out: str = MISSING\n\n    def __post_init__(self: GWASCatalogStep) -> None:\n        \"\"\"Run step.\"\"\"\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        # All inputs:\n        # Variant annotation dataset\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        # GWAS Catalog raw study information\n        catalog_studies = self.session.spark.read.csv(\n            self.catalog_studies_file, sep=\"\\t\", header=True\n        )\n        # GWAS Catalog ancestry information\n        ancestry_lut = self.session.spark.read.csv(\n            self.catalog_ancestry_file, sep=\"\\t\", header=True\n        )\n        # GWAS Catalog summary statistics information\n        sumstats_lut = self.session.spark.read.csv(\n            self.catalog_sumstats_lut, sep=\"\\t\", header=False\n        )\n        # GWAS Catalog raw association information\n        catalog_associations = self.session.spark.read.csv(\n            self.catalog_associations_file, sep=\"\\t\", header=True\n        )\n        # LD index dataset\n        ld_index = LDIndex.from_parquet(self.session, self.ld_index_path)\n\n        # Transform:\n        # GWAS Catalog study index and study-locus splitted\n        study_index, study_locus = GWASCatalogStudySplitter.split(\n            GWASCatalogStudyIndex.from_source(\n                catalog_studies, ancestry_lut, sumstats_lut\n            ),\n            GWASCatalogAssociations.from_source(catalog_associations, va),\n        )\n\n        # Annotate LD information and clump associations dataset\n        study_locus_ld = LDAnnotator.ld_annotate(study_locus, study_index, ld_index)\n\n        # Fine-mapping LD-clumped study-locus using PICS\n        finemapped_study_locus = PICS.finemap(study_locus_ld).annotate_credible_sets()\n\n        # Write:\n        study_index.df.write.mode(self.session.write_mode).parquet(\n            self.catalog_studies_out\n        )\n        finemapped_study_locus.df.write.mode(self.session.write_mode).parquet(\n            self.catalog_associations_out\n        )\n
"},{"location":"python_api/step/gwas_catalog_sumstat_preprocess/","title":"GWAS Catalog sumstat preprocess","text":""},{"location":"python_api/step/gwas_catalog_sumstat_preprocess/#otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep","title":"otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep dataclass","text":"

Step to preprocess GWAS Catalog harmonised summary stats.

Attributes:

Name Type Description raw_sumstats_path str

Input raw GWAS Catalog summary statistics path.

out_sumstats_path str

Output GWAS Catalog summary statistics path.

study_id str

GWAS Catalog study identifier.

Source code in src/otg/gwas_catalog_sumstat_preprocess.py
@dataclass\nclass GWASCatalogSumstatsPreprocessStep:\n    \"\"\"Step to preprocess GWAS Catalog harmonised summary stats.\n\n    Attributes:\n        raw_sumstats_path (str): Input raw GWAS Catalog summary statistics path.\n        out_sumstats_path (str): Output GWAS Catalog summary statistics path.\n        study_id (str): GWAS Catalog study identifier.\n    \"\"\"\n\n    session: Session = Session()\n\n    raw_sumstats_path: str = MISSING\n    out_sumstats_path: str = MISSING\n    study_id: str = MISSING\n\n    def __post_init__(self: GWASCatalogSumstatsPreprocessStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        self.session.logger.info(self.raw_sumstats_path)\n        self.session.logger.info(self.out_sumstats_path)\n        self.session.logger.info(self.study_id)\n\n        # Reading dataset:\n        raw_dataset = self.session.spark.read.csv(\n            self.raw_sumstats_path, header=True, sep=\"\\t\"\n        )\n        self.session.logger.info(\n            f\"Number of single point associations: {raw_dataset.count()}\"\n        )\n\n        # Processing dataset:\n        GWASCatalogSummaryStatistics.from_gwas_harmonized_summary_stats(\n            raw_dataset, self.study_id\n        ).df.write.mode(self.session.write_mode).parquet(self.out_sumstats_path)\n        self.session.logger.info(\"Processing dataset successfully completed.\")\n
"},{"location":"python_api/step/ld_index/","title":"LD Index","text":""},{"location":"python_api/step/ld_index/#otg.ld_index.LDIndexStep","title":"otg.ld_index.LDIndexStep dataclass","text":"

LD index step.

This step is resource intensive

Suggested params: high memory machine, 5TB of boot disk, no SSDs.

Attributes:

Name Type Description ld_matrix_template str

Template path for LD matrix from gnomAD.

ld_index_raw_template str

Template path for the variant indices correspondance in the LD Matrix from gnomAD.

min_r2 float

Minimum r2 to consider when considering variants within a window.

grch37_to_grch38_chain_path str

Path to GRCh37 to GRCh38 chain file.

ld_populations List[str]

List of population-specific LD matrices to process.

ld_index_out str

Output LD index path.

Source code in src/otg/ld_index.py
@dataclass\nclass LDIndexStep:\n    \"\"\"LD index step.\n\n    !!! warning \"This step is resource intensive\"\n        Suggested params: high memory machine, 5TB of boot disk, no SSDs.\n\n    Attributes:\n        ld_matrix_template (str): Template path for LD matrix from gnomAD.\n        ld_index_raw_template (str): Template path for the variant indices correspondance in the LD Matrix from gnomAD.\n        min_r2 (float): Minimum r2 to consider when considering variants within a window.\n        grch37_to_grch38_chain_path (str): Path to GRCh37 to GRCh38 chain file.\n        ld_populations (List[str]): List of population-specific LD matrices to process.\n        ld_index_out (str): Output LD index path.\n    \"\"\"\n\n    session: Session = Session()\n\n    ld_matrix_template: str = \"gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm\"\n    ld_index_raw_template: str = \"gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht\"\n    min_r2: float = 0.5\n    grch37_to_grch38_chain_path: str = (\n        \"gs://hail-common/references/grch37_to_grch38.over.chain.gz\"\n    )\n    ld_populations: List[str] = field(\n        default_factory=lambda: [\n            \"afr\",  # African-American\n            \"amr\",  # American Admixed/Latino\n            \"asj\",  # Ashkenazi Jewish\n            \"eas\",  # East Asian\n            \"fin\",  # Finnish\n            \"nfe\",  # Non-Finnish European\n            \"nwe\",  # Northwestern European\n            \"seu\",  # Southeastern European\n        ]\n    )\n    ld_index_out: str = MISSING\n\n    def __post_init__(self: LDIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        ld_index = GnomADLDMatrix.as_ld_index(\n            self.ld_populations,\n            self.ld_matrix_template,\n            self.ld_index_raw_template,\n            self.grch37_to_grch38_chain_path,\n            self.min_r2,\n        )\n        self.session.logger.info(f\"Writing LD index to: {self.ld_index_out}\")\n        (\n            ld_index.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(f\"{self.ld_index_out}\")\n        )\n
"},{"location":"python_api/step/ukbiobank/","title":"UK Biobank","text":""},{"location":"python_api/step/ukbiobank/#otg.ukbiobank.UKBiobankStep","title":"otg.ukbiobank.UKBiobankStep dataclass","text":"

UKBiobank study table ingestion step.

Attributes:

Name Type Description ukbiobank_manifest str

UKBiobank manifest of studies.

ukbiobank_study_index_out str

Output path for the UKBiobank study index dataset.

Source code in src/otg/ukbiobank.py
@dataclass\nclass UKBiobankStep:\n    \"\"\"UKBiobank study table ingestion step.\n\n    Attributes:\n        ukbiobank_manifest (str): UKBiobank manifest of studies.\n        ukbiobank_study_index_out (str): Output path for the UKBiobank study index dataset.\n    \"\"\"\n\n    session: Session = Session()\n\n    ukbiobank_manifest: str = MISSING\n    ukbiobank_study_index_out: str = MISSING\n\n    def __post_init__(self: UKBiobankStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read in the UKBiobank manifest tsv file.\n        df = self.session.spark.read.csv(\n            self.ukbiobank_manifest, sep=\"\\t\", header=True, inferSchema=True\n        )\n\n        # Parse the study index data.\n        ukbiobank_study_index = UKBiobankStudyIndex.from_source(df)\n\n        # Write the output.\n        ukbiobank_study_index.df.write.mode(self.session.write_mode).parquet(\n            self.ukbiobank_study_index_out\n        )\n
"},{"location":"python_api/step/variant_annotation_step/","title":"Variant Annotation","text":""},{"location":"python_api/step/variant_annotation_step/#otg.variant_annotation.VariantAnnotationStep","title":"otg.variant_annotation.VariantAnnotationStep dataclass","text":"

Variant annotation step.

Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.

Attributes:

Name Type Description gnomad_genomes str

Path to gnomAD genomes hail table.

chain_38_to_37 str

Path to GRCh38 to GRCh37 chain file.

variant_annotation_path str

Output variant annotation path.

populations List[str]

List of populations to include.

Source code in src/otg/variant_annotation.py
@dataclass\nclass VariantAnnotationStep:\n    \"\"\"Variant annotation step.\n\n    Variant annotation step produces a dataset of the type `VariantAnnotation` derived from gnomADs `gnomad.genomes.vX.X.X.sites.ht` Hail's table. This dataset is used to validate variants and as a source of annotation.\n\n    Attributes:\n        gnomad_genomes (str): Path to gnomAD genomes hail table.\n        chain_38_to_37 (str): Path to GRCh38 to GRCh37 chain file.\n        variant_annotation_path (str): Output variant annotation path.\n        populations (List[str]): List of populations to include.\n    \"\"\"\n\n    session: Session = Session()\n\n    gnomad_genomes: str = MISSING\n    chain_38_to_37: str = MISSING\n    variant_annotation_path: str = MISSING\n    populations: List[str] = field(\n        default_factory=lambda: [\n            \"afr\",  # African-American\n            \"amr\",  # American Admixed/Latino\n            \"ami\",  # Amish ancestry\n            \"asj\",  # Ashkenazi Jewish\n            \"eas\",  # East Asian\n            \"fin\",  # Finnish\n            \"nfe\",  # Non-Finnish European\n            \"mid\",  # Middle Eastern\n            \"sas\",  # South Asian\n            \"oth\",  # Other\n        ]\n    )\n\n    def __post_init__(self: VariantAnnotationStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Initialise hail session.\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        # Run variant annotation.\n        variant_annotation = GnomADVariants.as_variant_annotation(\n            self.gnomad_genomes,\n            self.chain_38_to_37,\n            self.populations,\n        )\n        # Write data partitioned by chromosome and position.\n        (\n            variant_annotation.df.repartition(400, \"chromosome\")\n            .sortWithinPartitions(\"chromosome\", \"position\")\n            .write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.variant_annotation_path)\n        )\n
"},{"location":"python_api/step/variant_index_step/","title":"Variant Index","text":""},{"location":"python_api/step/variant_index_step/#otg.variant_index.VariantIndexStep","title":"otg.variant_index.VariantIndexStep dataclass","text":"

Run variant index step to only variants in study-locus sets.

Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.

Attributes:

Name Type Description variant_annotation_path str

Input variant annotation path.

study_locus_path str

Input study-locus path.

variant_index_path str

Output variant index path.

Source code in src/otg/variant_index.py
@dataclass\nclass VariantIndexStep:\n    \"\"\"Run variant index step to only variants in study-locus sets.\n\n    Using a `VariantAnnotation` dataset as a reference, this step creates and writes a dataset of the type `VariantIndex` that includes only variants that have disease-association data with a reduced set of annotations.\n\n    Attributes:\n        variant_annotation_path (str): Input variant annotation path.\n        study_locus_path (str): Input study-locus path.\n        variant_index_path (str): Output variant index path.\n    \"\"\"\n\n    session: Session = Session()\n\n    variant_annotation_path: str = MISSING\n    study_locus_path: str = MISSING\n    variant_index_path: str = MISSING\n\n    def __post_init__(self: VariantIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        study_locus = StudyLocus.from_parquet(\n            self.session, self.study_locus_path, recursiveFileLookup=True\n        )\n\n        # Transform\n        vi = VariantIndex.from_variant_annotation(va, study_locus)\n\n        # Load\n        self.session.logger.info(f\"Writing variant index to: {self.variant_index_path}\")\n        (\n            vi.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.variant_index_path)\n        )\n
"},{"location":"python_api/step/variant_to_gene_step/","title":"Variant-to-gene","text":""},{"location":"python_api/step/variant_to_gene_step/#otg.v2g.V2GStep","title":"otg.v2g.V2GStep dataclass","text":"

Variant-to-gene (V2G) step.

This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:

  1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
  2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
  3. Distance between the variant and each gene's canonical transcription start site (TSS).

Attributes:

Name Type Description variant_index_path str

Input variant index path.

variant_annotation_path str

Input variant annotation path.

gene_index_path str

Input gene index path.

vep_consequences_path str

Input VEP consequences path.

liftover_chain_file_path str

Path to GRCh37 to GRCh38 chain file.

liftover_max_length_difference int

Maximum length difference for liftover.

max_distance int

Maximum distance to consider.

approved_biotypes list[str]

List of approved biotypes.

intervals dict

Dictionary of interval sources.

v2g_path str

Output V2G path.

Source code in src/otg/v2g.py
@dataclass\nclass V2GStep:\n    \"\"\"Variant-to-gene (V2G) step.\n\n    This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:\n\n    1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).\n    2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.\n    3. Distance between the variant and each gene's canonical transcription start site (TSS).\n\n    Attributes:\n        variant_index_path (str): Input variant index path.\n        variant_annotation_path (str): Input variant annotation path.\n        gene_index_path (str): Input gene index path.\n        vep_consequences_path (str): Input VEP consequences path.\n        liftover_chain_file_path (str): Path to GRCh37 to GRCh38 chain file.\n        liftover_max_length_difference: Maximum length difference for liftover.\n        max_distance (int): Maximum distance to consider.\n        approved_biotypes (list[str]): List of approved biotypes.\n        intervals (dict): Dictionary of interval sources.\n        v2g_path (str): Output V2G path.\n    \"\"\"\n\n    session: Session = Session()\n\n    variant_index_path: str = MISSING\n    variant_annotation_path: str = MISSING\n    gene_index_path: str = MISSING\n    vep_consequences_path: str = MISSING\n    liftover_chain_file_path: str = MISSING\n    liftover_max_length_difference: int = 100\n    max_distance: int = 500_000\n    approved_biotypes: List[str] = field(\n        default_factory=lambda: [\n            \"protein_coding\",\n            \"3prime_overlapping_ncRNA\",\n            \"antisense\",\n            \"bidirectional_promoter_lncRNA\",\n            \"IG_C_gene\",\n            \"IG_D_gene\",\n            \"IG_J_gene\",\n            \"IG_V_gene\",\n            \"lincRNA\",\n            \"macro_lncRNA\",\n            \"non_coding\",\n            \"sense_intronic\",\n            \"sense_overlapping\",\n        ]\n    )\n    intervals: Dict[str, str] = field(default_factory=dict)\n    v2g_path: str = MISSING\n\n    def __post_init__(self: V2GStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read\n        gene_index = GeneIndex.from_parquet(self.session, self.gene_index_path)\n        vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        vep_consequences = self.session.spark.read.csv(\n            self.vep_consequences_path, sep=\"\\t\", header=True\n        ).select(\n            f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n                \"variantFunctionalConsequenceId\"\n            ),\n            f.col(\"Term\").alias(\"label\"),\n            f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n        )\n\n        # Transform\n        lift = LiftOverSpark(\n            # lift over variants to hg38\n            self.liftover_chain_file_path,\n            self.liftover_max_length_difference,\n        )\n        gene_index_filtered = gene_index.filter_by_biotypes(\n            # Filter gene index by approved biotypes to define V2G gene universe\n            list(self.approved_biotypes)\n        )\n        va_slimmed = va.filter_by_variant_df(\n            # Variant annotation reduced to the variant index to define V2G variant universe\n            vi.df\n        ).persist()\n        intervals = Intervals(\n            _df=reduce(\n                lambda x, y: x.unionByName(y, allowMissingColumns=True),\n                # create interval instances by parsing each source\n                [\n                    Intervals.from_source(\n                        self.session.spark, source_name, source_path, gene_index, lift\n                    ).df\n                    for source_name, source_path in self.intervals.items()\n                ],\n            ),\n            _schema=Intervals.get_schema(),\n        )\n        v2g_datasets = [\n            va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n            va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n            va_slimmed.get_polyphen_v2g(gene_index_filtered),\n            va_slimmed.get_sift_v2g(gene_index_filtered),\n            va_slimmed.get_plof_v2g(gene_index_filtered),\n            intervals.v2g(vi),\n        ]\n        v2g = V2G(\n            _df=reduce(\n                lambda x, y: x.unionByName(y, allowMissingColumns=True),\n                [dataset.df for dataset in v2g_datasets],\n            ).repartition(\"chromosome\"),\n            _schema=V2G.get_schema(),\n        )\n\n        # Load\n        (\n            v2g.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.v2g_path)\n        )\n
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Open Targets Genetics","text":"

Ingestion and analysis of genetic and functional genomic data for the identification and prioritisation of drug targets.

This project is still in experimental phase. Please refer to the roadmap section for more information.

For all development information, including running the code, troubleshooting, or contributing, see the development section.

"},{"location":"installation/","title":"Installation","text":"

TBC

"},{"location":"roadmap/","title":"Roadmap","text":"

The Open Targets core team is working on refactoring Open Targets Genetics, aiming to:

  • Re-focus the product around Target ID
  • Create a gold standard toolkit for post-GWAS analysis
  • Faster/robust addition of new datasets and datatypes
  • Reduce computational and financial cost

See here for a list of open issues for this project.

Schematic diagram representing the drafted process:

"},{"location":"usage/","title":"How-to","text":"

TBC

"},{"location":"development/_development/","title":"Development","text":"

This section contains various technical information on how to develop and run the code.

"},{"location":"development/airflow/","title":"Running Airflow workflows","text":"

Airflow code is located in src/airflow. Make sure to execute all of the instructions from that directory, unless stated otherwise.

"},{"location":"development/airflow/#set-up-docker","title":"Set up Docker","text":"

We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):

sudo apt install docker-compose\n

Next, verify that you can run Docker. This should say \"Hello from Docker\":

docker run hello-world\n

If the command above raises a permission error, fix it and reboot:

sudo usermod -a -G docker $USER\nnewgrp docker\n
"},{"location":"development/airflow/#set-up-airflow","title":"Set up Airflow","text":"

This section is adapted from instructions from https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html. When you run the commands, make sure your current working directory is src/airflow.

# Download the latest docker-compose.yaml file.\ncurl -sLfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml\n\n# Make expected directories.\nmkdir -p ./config ./dags ./logs ./plugins\n\n# Construct the modified Docker image with additional PIP dependencies.\ndocker build . --tag opentargets-airflow:2.7.1\n\n# Set environment variables.\ncat << EOF > .env\nAIRFLOW_UID=$(id -u)\nAIRFLOW_IMAGE_NAME=opentargets-airflow:2.7.1\nEOF\n

Now modify docker-compose.yaml and add the following to the x-airflow-common \u2192 environment section:

GOOGLE_APPLICATION_CREDENTIALS: '/opt/airflow/config/application_default_credentials.json'\nAIRFLOW__CELERY__WORKER_CONCURRENCY: 32\nAIRFLOW__CORE__PARALLELISM: 32\nAIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 32\nAIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 16\nAIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1\n

"},{"location":"development/airflow/#start-airflow","title":"Start Airflow","text":"
docker-compose up\n

Airflow UI will now be available at http://localhost:8080/home. Default username and password are both airflow.

"},{"location":"development/airflow/#configure-google-cloud-access","title":"Configure Google Cloud access","text":"

In order to be able to access Google Cloud and do work with Dataproc, Airflow will need to be configured. First, obtain Google default application credentials by running this command and following the instructions:

gcloud auth application-default login\n

Next, copy the file into the config/ subdirectory which we created above:

cp ~/.config/gcloud/application_default_credentials.json config/\n

Now open the Airflow UI and:

  • Navigate to Admin \u2192 Connections.
  • Click on \"Add new record\".
  • Set \"Connection type\" to `Google Cloud``.
  • Set \"Connection ID\" to google_cloud_default.
  • Set \"Credential Configuration File\" to /opt/airflow/config/application_default_credentials.json.
  • Click on \"Save\".
"},{"location":"development/airflow/#run-a-workflow","title":"Run a workflow","text":"

Workflows, which must be placed under the dags/ directory, will appear in the \"DAGs\" section of the UI, which is also the main page. They can be triggered manually by opening a workflow and clicking on the \"Play\" button in the upper right corner.

In order to restart a failed task, click on it and then click on \"Clear task\".

"},{"location":"development/airflow/#troubleshooting","title":"Troubleshooting","text":"

Note that when you a a new workflow under dags/, Airflow will not pick that up immediately. By default the filesystem is only scanned for new DAGs every 300s. However, once the DAG is added, updates are applied nearly instantaneously.

Also, if you edit the DAG while an instance of it is running, it might cause problems with the run, as Airflow will try to update the tasks and their properties in DAG according to the file changes.

"},{"location":"development/contributing/","title":"Contributing guidelines","text":""},{"location":"development/contributing/#one-time-configuration","title":"One-time configuration","text":"

The steps in this section only ever need to be done once on any particular system.

Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.

Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.

Check that you have java installed.

"},{"location":"development/contributing/#environment-configuration","title":"Environment configuration","text":"

Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.

It is recommended to use VS Code as an IDE for development.

"},{"location":"development/contributing/#how-to-run-the-code","title":"How to run the code","text":"

All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.

In order to run the code:

  1. Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.

  2. Manually edit your local pyproject.toml file and modify the version of the code.

    • This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
    • You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
    • Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
    • Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
  3. Run make build.

    • This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
    • A version specific subpath is used, so uploading the code will not affect any branches but your own.
    • If there was already a code bundle uploaded with the same version number, it will be replaced.
  4. Submit the Dataproc job with poetry run python workflow/workflow_template.py

    • You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
    • The script will provision the cluster and submit the job.
    • The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
    • Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
    • On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
"},{"location":"development/contributing/#contributing-checklist","title":"Contributing checklist","text":"

When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)

For more details on each of these steps, see the sections below.

"},{"location":"development/contributing/#documentation","title":"Documentation","text":"
  • If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
  • Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
  • Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
  • Step definition in docs/reference/step (example: docs/reference/step/finngen.md)
"},{"location":"development/contributing/#configuration","title":"Configuration","text":"
  • Input and output paths in config/datasets/gcp.yaml
  • Step configuration in config/step/STEP.yaml (example: config/step/finngen.yaml)
"},{"location":"development/contributing/#classes","title":"Classes","text":"
  • Dataset class in src/org/dataset/ (example: src/otg/dataset/study_index.py \u2192 StudyIndexFinnGen)
  • Step main running class in src/org/STEP.py (example: src/org/finngen.py)
"},{"location":"development/contributing/#tests","title":"Tests","text":"
  • Test study fixture in tests/conftest.py (example: mock_study_index_finngen in that module)
  • Test sample data in tests/data_samples (example: tests/data_samples/finngen_studies_sample.json)
  • Test definition in tests/ (example: tests/dataset/test_study_index.py \u2192 test_study_index_finngen_creation)
"},{"location":"development/troubleshooting/","title":"Troubleshooting","text":""},{"location":"development/troubleshooting/#blaslapack","title":"BLAS/LAPACK","text":"

If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.

"},{"location":"development/troubleshooting/#pyenv-and-poetry","title":"Pyenv and Poetry","text":"

If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:

  1. Close your currently activated environment, if any: exit
  2. Uninstall Poetry: curl -sSL https://install.python-poetry.org | python3 - --uninstall
  3. Clear Poetry cache: rm -rf ~/.cache/pypoetry
  4. Clear pre-commit cache: rm -rf ~/.cache/pre-commit
  5. Switch to system Python shell: pyenv shell system
  6. Edit ~/.bashrc to remove the lines related to Pyenv configuration
  7. Remove Pyenv configuration and cache: rm -rf ~/.pyenv

After that, open a fresh shell session and run make setup-dev again.

"},{"location":"development/troubleshooting/#java","title":"Java","text":"

Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.

If you are encountering problems with initialising a Spark session, try using Java 11.

"},{"location":"development/troubleshooting/#pre-commit","title":"Pre-commit","text":"

If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.

One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.

Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.

On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.

After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).

"},{"location":"python_api/dataset/_dataset/","title":"Dataset","text":""},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset","title":"otg.dataset.dataset.Dataset dataclass","text":"

Bases: ABC

Open Targets Genetics Dataset.

Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the schemas module.

Source code in src/otg/dataset/dataset.py
@dataclass\nclass Dataset(ABC):\n    \"\"\"Open Targets Genetics Dataset.\n\n    `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `schemas` module.\n    \"\"\"\n\n    _df: DataFrame\n    _schema: StructType\n\n    def __post_init__(self: Dataset) -> None:\n        \"\"\"Post init.\"\"\"\n        self.validate_schema()\n\n    @property\n    def df(self: Dataset) -> DataFrame:\n        \"\"\"Dataframe included in the Dataset.\n\n        Returns:\n            DataFrame: Dataframe included in the Dataset\n        \"\"\"\n        return self._df\n\n    @df.setter\n    def df(self: Dataset, new_df: DataFrame) -> None:  # noqa: CCE001\n        \"\"\"Dataframe setter.\n\n        Args:\n            new_df (DataFrame): New dataframe to be included in the Dataset\n        \"\"\"\n        self._df: DataFrame = new_df\n        self.validate_schema()\n\n    @property\n    def schema(self: Dataset) -> StructType:\n        \"\"\"Dataframe expected schema.\n\n        Returns:\n            StructType: Dataframe expected schema\n        \"\"\"\n        return self._schema\n\n    @classmethod\n    @abstractmethod\n    def get_schema(cls: type[Dataset]) -> StructType:\n        \"\"\"Abstract method to get the schema. Must be implemented by child classes.\n\n        Returns:\n            StructType: Schema for the Dataset\n        \"\"\"\n        pass\n\n    @classmethod\n    def from_parquet(\n        cls: type[Dataset], session: Session, path: str, **kwargs: dict[str, Any]\n    ) -> Dataset:\n        \"\"\"Reads a parquet file into a Dataset with a given schema.\n\n        Args:\n            session (Session): Spark session\n            path (str): Path to the parquet file\n            **kwargs (dict[str, Any]): Additional arguments to pass to spark.read.parquet\n\n        Returns:\n            Dataset: Dataset with the parquet file contents\n        \"\"\"\n        schema = cls.get_schema()\n        df = session.read_parquet(path=path, schema=schema, **kwargs)\n        return cls(_df=df, _schema=schema)\n\n    def validate_schema(self: Dataset) -> None:  # sourcery skip: invert-any-all\n        \"\"\"Validate DataFrame schema against expected class schema.\n\n        Raises:\n            ValueError: DataFrame schema is not valid\n        \"\"\"\n        expected_schema = self._schema\n        expected_fields = flatten_schema(expected_schema)\n        observed_schema = self._df.schema\n        observed_fields = flatten_schema(observed_schema)\n\n        # Unexpected fields in dataset\n        if unexpected_field_names := [\n            x.name\n            for x in observed_fields\n            if x.name not in [y.name for y in expected_fields]\n        ]:\n            raise ValueError(\n                f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n            )\n\n        # Required fields not in dataset\n        required_fields = [x.name for x in expected_schema if not x.nullable]\n        if missing_required_fields := [\n            req\n            for req in required_fields\n            if not any(field.name == req for field in observed_fields)\n        ]:\n            raise ValueError(\n                f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n            )\n\n        # Fields with duplicated names\n        if duplicated_fields := [\n            x for x in set(observed_fields) if observed_fields.count(x) > 1\n        ]:\n            raise ValueError(\n                f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n            )\n\n        # Fields with different datatype\n        observed_field_types = {\n            field.name: type(field.dataType) for field in observed_fields\n        }\n        expected_field_types = {\n            field.name: type(field.dataType) for field in expected_fields\n        }\n        if fields_with_different_observed_datatype := [\n            name\n            for name, observed_type in observed_field_types.items()\n            if name in expected_field_types\n            and observed_type != expected_field_types[name]\n        ]:\n            raise ValueError(\n                f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n            )\n\n    def persist(self: Dataset) -> Dataset:\n        \"\"\"Persist in memory the DataFrame included in the Dataset.\n\n        Returns:\n            Dataset: Persisted Dataset\n        \"\"\"\n        self.df = self._df.persist()\n        return self\n\n    def unpersist(self: Dataset) -> Dataset:\n        \"\"\"Remove the persisted DataFrame from memory.\n\n        Returns:\n            Dataset: Unpersisted Dataset\n        \"\"\"\n        self.df = self._df.unpersist()\n        return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.df","title":"df: DataFrame property writable","text":"

Dataframe included in the Dataset.

Returns:

Name Type Description DataFrame DataFrame

Dataframe included in the Dataset

"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.schema","title":"schema: StructType property","text":"

Dataframe expected schema.

Returns:

Name Type Description StructType StructType

Dataframe expected schema

"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.from_parquet","title":"from_parquet(session: Session, path: str, **kwargs: dict[str, Any]) -> Dataset classmethod","text":"

Reads a parquet file into a Dataset with a given schema.

Parameters:

Name Type Description Default session Session

Spark session

required path str

Path to the parquet file

required **kwargs dict[str, Any]

Additional arguments to pass to spark.read.parquet

{}

Returns:

Name Type Description Dataset Dataset

Dataset with the parquet file contents

Source code in src/otg/dataset/dataset.py
@classmethod\ndef from_parquet(\n    cls: type[Dataset], session: Session, path: str, **kwargs: dict[str, Any]\n) -> Dataset:\n    \"\"\"Reads a parquet file into a Dataset with a given schema.\n\n    Args:\n        session (Session): Spark session\n        path (str): Path to the parquet file\n        **kwargs (dict[str, Any]): Additional arguments to pass to spark.read.parquet\n\n    Returns:\n        Dataset: Dataset with the parquet file contents\n    \"\"\"\n    schema = cls.get_schema()\n    df = session.read_parquet(path=path, schema=schema, **kwargs)\n    return cls(_df=df, _schema=schema)\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.get_schema","title":"get_schema() -> StructType abstractmethod classmethod","text":"

Abstract method to get the schema. Must be implemented by child classes.

Returns:

Name Type Description StructType StructType

Schema for the Dataset

Source code in src/otg/dataset/dataset.py
@classmethod\n@abstractmethod\ndef get_schema(cls: type[Dataset]) -> StructType:\n    \"\"\"Abstract method to get the schema. Must be implemented by child classes.\n\n    Returns:\n        StructType: Schema for the Dataset\n    \"\"\"\n    pass\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.persist","title":"persist() -> Dataset","text":"

Persist in memory the DataFrame included in the Dataset.

Returns:

Name Type Description Dataset Dataset

Persisted Dataset

Source code in src/otg/dataset/dataset.py
def persist(self: Dataset) -> Dataset:\n    \"\"\"Persist in memory the DataFrame included in the Dataset.\n\n    Returns:\n        Dataset: Persisted Dataset\n    \"\"\"\n    self.df = self._df.persist()\n    return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.unpersist","title":"unpersist() -> Dataset","text":"

Remove the persisted DataFrame from memory.

Returns:

Name Type Description Dataset Dataset

Unpersisted Dataset

Source code in src/otg/dataset/dataset.py
def unpersist(self: Dataset) -> Dataset:\n    \"\"\"Remove the persisted DataFrame from memory.\n\n    Returns:\n        Dataset: Unpersisted Dataset\n    \"\"\"\n    self.df = self._df.unpersist()\n    return self\n
"},{"location":"python_api/dataset/_dataset/#otg.dataset.dataset.Dataset.validate_schema","title":"validate_schema() -> None","text":"

Validate DataFrame schema against expected class schema.

Raises:

Type Description ValueError

DataFrame schema is not valid

Source code in src/otg/dataset/dataset.py
def validate_schema(self: Dataset) -> None:  # sourcery skip: invert-any-all\n    \"\"\"Validate DataFrame schema against expected class schema.\n\n    Raises:\n        ValueError: DataFrame schema is not valid\n    \"\"\"\n    expected_schema = self._schema\n    expected_fields = flatten_schema(expected_schema)\n    observed_schema = self._df.schema\n    observed_fields = flatten_schema(observed_schema)\n\n    # Unexpected fields in dataset\n    if unexpected_field_names := [\n        x.name\n        for x in observed_fields\n        if x.name not in [y.name for y in expected_fields]\n    ]:\n        raise ValueError(\n            f\"The {unexpected_field_names} fields are not included in DataFrame schema: {expected_fields}\"\n        )\n\n    # Required fields not in dataset\n    required_fields = [x.name for x in expected_schema if not x.nullable]\n    if missing_required_fields := [\n        req\n        for req in required_fields\n        if not any(field.name == req for field in observed_fields)\n    ]:\n        raise ValueError(\n            f\"The {missing_required_fields} fields are required but missing: {required_fields}\"\n        )\n\n    # Fields with duplicated names\n    if duplicated_fields := [\n        x for x in set(observed_fields) if observed_fields.count(x) > 1\n    ]:\n        raise ValueError(\n            f\"The following fields are duplicated in DataFrame schema: {duplicated_fields}\"\n        )\n\n    # Fields with different datatype\n    observed_field_types = {\n        field.name: type(field.dataType) for field in observed_fields\n    }\n    expected_field_types = {\n        field.name: type(field.dataType) for field in expected_fields\n    }\n    if fields_with_different_observed_datatype := [\n        name\n        for name, observed_type in observed_field_types.items()\n        if name in expected_field_types\n        and observed_type != expected_field_types[name]\n    ]:\n        raise ValueError(\n            f\"The following fields present differences in their datatypes: {fields_with_different_observed_datatype}.\"\n        )\n
"},{"location":"python_api/dataset/colocalisation/","title":"Colocalisation","text":""},{"location":"python_api/dataset/colocalisation/#otg.dataset.colocalisation.Colocalisation","title":"otg.dataset.colocalisation.Colocalisation dataclass","text":"

Bases: Dataset

Colocalisation results for pairs of overlapping study-locus.

Source code in src/otg/dataset/colocalisation.py
@dataclass\nclass Colocalisation(Dataset):\n    \"\"\"Colocalisation results for pairs of overlapping study-locus.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[Colocalisation]) -> StructType:\n        \"\"\"Provides the schema for the Colocalisation dataset.\n\n        Returns:\n            StructType: Schema for the Colocalisation dataset\n        \"\"\"\n        return parse_spark_schema(\"colocalisation.json\")\n
"},{"location":"python_api/dataset/colocalisation/#otg.dataset.colocalisation.Colocalisation.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the Colocalisation dataset.

Returns:

Name Type Description StructType StructType

Schema for the Colocalisation dataset

Source code in src/otg/dataset/colocalisation.py
@classmethod\ndef get_schema(cls: type[Colocalisation]) -> StructType:\n    \"\"\"Provides the schema for the Colocalisation dataset.\n\n    Returns:\n        StructType: Schema for the Colocalisation dataset\n    \"\"\"\n    return parse_spark_schema(\"colocalisation.json\")\n
"},{"location":"python_api/dataset/colocalisation/#schema","title":"Schema","text":"
root\n |-- leftStudyLocusId: long (nullable = false)\n |-- rightStudyLocusId: long (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- colocalisationMethod: string (nullable = false)\n |-- numberColocalisingVariants: long (nullable = false)\n |-- h0: double (nullable = true)\n |-- h1: double (nullable = true)\n |-- h2: double (nullable = true)\n |-- h3: double (nullable = true)\n |-- h4: double (nullable = true)\n |-- log2h4h3: double (nullable = true)\n |-- clpp: double (nullable = true)\n
"},{"location":"python_api/dataset/gene_index/","title":"Gene Index","text":""},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex","title":"otg.dataset.gene_index.GeneIndex dataclass","text":"

Bases: Dataset

Gene index dataset.

Gene-based annotation.

Source code in src/otg/dataset/gene_index.py
@dataclass\nclass GeneIndex(Dataset):\n    \"\"\"Gene index dataset.\n\n    Gene-based annotation.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[GeneIndex]) -> StructType:\n        \"\"\"Provides the schema for the GeneIndex dataset.\n\n        Returns:\n            StructType: Schema for the GeneIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"gene_index.json\")\n\n    def filter_by_biotypes(self: GeneIndex, biotypes: list) -> GeneIndex:\n        \"\"\"Filter by approved biotypes.\n\n        Args:\n            biotypes (list): List of Ensembl biotypes to keep.\n\n        Returns:\n            GeneIndex: Gene index dataset filtered by biotypes.\n        \"\"\"\n        self.df = self._df.filter(f.col(\"biotype\").isin(biotypes))\n        return self\n\n    def locations_lut(self: GeneIndex) -> DataFrame:\n        \"\"\"Gene location information.\n\n        Returns:\n            DataFrame: Gene LUT including genomic location information.\n        \"\"\"\n        return self.df.select(\n            \"geneId\",\n            \"chromosome\",\n            \"start\",\n            \"end\",\n            \"strand\",\n            \"tss\",\n        )\n\n    def symbols_lut(self: GeneIndex) -> DataFrame:\n        \"\"\"Gene symbol lookup table.\n\n        Pre-processess gene/target dataset to create lookup table of gene symbols, including\n        obsoleted gene symbols.\n\n        Returns:\n            DataFrame: Gene LUT for symbol mapping containing `geneId` and `geneSymbol` columns.\n        \"\"\"\n        return self.df.select(\n            f.explode(\n                f.array_union(f.array(\"approvedSymbol\"), f.col(\"obsoleteSymbols\"))\n            ).alias(\"geneSymbol\"),\n            \"*\",\n        )\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.filter_by_biotypes","title":"filter_by_biotypes(biotypes: list) -> GeneIndex","text":"

Filter by approved biotypes.

Parameters:

Name Type Description Default biotypes list

List of Ensembl biotypes to keep.

required

Returns:

Name Type Description GeneIndex GeneIndex

Gene index dataset filtered by biotypes.

Source code in src/otg/dataset/gene_index.py
def filter_by_biotypes(self: GeneIndex, biotypes: list) -> GeneIndex:\n    \"\"\"Filter by approved biotypes.\n\n    Args:\n        biotypes (list): List of Ensembl biotypes to keep.\n\n    Returns:\n        GeneIndex: Gene index dataset filtered by biotypes.\n    \"\"\"\n    self.df = self._df.filter(f.col(\"biotype\").isin(biotypes))\n    return self\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the GeneIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the GeneIndex dataset

Source code in src/otg/dataset/gene_index.py
@classmethod\ndef get_schema(cls: type[GeneIndex]) -> StructType:\n    \"\"\"Provides the schema for the GeneIndex dataset.\n\n    Returns:\n        StructType: Schema for the GeneIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"gene_index.json\")\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.locations_lut","title":"locations_lut() -> DataFrame","text":"

Gene location information.

Returns:

Name Type Description DataFrame DataFrame

Gene LUT including genomic location information.

Source code in src/otg/dataset/gene_index.py
def locations_lut(self: GeneIndex) -> DataFrame:\n    \"\"\"Gene location information.\n\n    Returns:\n        DataFrame: Gene LUT including genomic location information.\n    \"\"\"\n    return self.df.select(\n        \"geneId\",\n        \"chromosome\",\n        \"start\",\n        \"end\",\n        \"strand\",\n        \"tss\",\n    )\n
"},{"location":"python_api/dataset/gene_index/#otg.dataset.gene_index.GeneIndex.symbols_lut","title":"symbols_lut() -> DataFrame","text":"

Gene symbol lookup table.

Pre-processess gene/target dataset to create lookup table of gene symbols, including obsoleted gene symbols.

Returns:

Name Type Description DataFrame DataFrame

Gene LUT for symbol mapping containing geneId and geneSymbol columns.

Source code in src/otg/dataset/gene_index.py
def symbols_lut(self: GeneIndex) -> DataFrame:\n    \"\"\"Gene symbol lookup table.\n\n    Pre-processess gene/target dataset to create lookup table of gene symbols, including\n    obsoleted gene symbols.\n\n    Returns:\n        DataFrame: Gene LUT for symbol mapping containing `geneId` and `geneSymbol` columns.\n    \"\"\"\n    return self.df.select(\n        f.explode(\n            f.array_union(f.array(\"approvedSymbol\"), f.col(\"obsoleteSymbols\"))\n        ).alias(\"geneSymbol\"),\n        \"*\",\n    )\n
"},{"location":"python_api/dataset/gene_index/#schema","title":"Schema","text":"
root\n |-- geneId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- approvedSymbol: string (nullable = true)\n |-- biotype: string (nullable = true)\n |-- approvedName: string (nullable = true)\n |-- obsoleteSymbols: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- tss: long (nullable = true)\n |-- start: long (nullable = true)\n |-- end: long (nullable = true)\n |-- strand: integer (nullable = true)\n
"},{"location":"python_api/dataset/intervals/","title":"Intervals","text":""},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals","title":"otg.dataset.intervals.Intervals dataclass","text":"

Bases: Dataset

Intervals dataset links genes to genomic regions based on genome interaction studies.

Source code in src/otg/dataset/intervals.py
@dataclass\nclass Intervals(Dataset):\n    \"\"\"Intervals dataset links genes to genomic regions based on genome interaction studies.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[Intervals]) -> StructType:\n        \"\"\"Provides the schema for the Intervals dataset.\n\n        Returns:\n            StructType: Schema for the Intervals dataset\n        \"\"\"\n        return parse_spark_schema(\"intervals.json\")\n\n    @classmethod\n    def from_source(\n        cls: type[Intervals],\n        spark: SparkSession,\n        source_name: str,\n        source_path: str,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Collect interval data for a particular source.\n\n        Args:\n            spark (SparkSession): Spark session\n            source_name (str): Name of the interval source\n            source_path (str): Path to the interval source file\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance to convert coordinats from hg37 to hg38\n\n        Returns:\n            Intervals: Intervals dataset\n\n        Raises:\n            ValueError: If the source name is not recognised\n        \"\"\"\n        from otg.datasource.intervals.andersson import IntervalsAndersson\n        from otg.datasource.intervals.javierre import IntervalsJavierre\n        from otg.datasource.intervals.jung import IntervalsJung\n        from otg.datasource.intervals.thurman import IntervalsThurman\n\n        source_to_class = {\n            \"andersson\": IntervalsAndersson,\n            \"javierre\": IntervalsJavierre,\n            \"jung\": IntervalsJung,\n            \"thurman\": IntervalsThurman,\n        }\n\n        if source_name not in source_to_class:\n            raise ValueError(f\"Unknown interval source: {source_name}\")\n\n        source_class = source_to_class[source_name]\n        data = source_class.read(spark, source_path)\n        return source_class.parse(data, gene_index, lift)\n\n    def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n        \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n        Args:\n            variant_index (VariantIndex): Variant index dataset\n\n        Returns:\n            V2G: Variant-to-gene evidence dataset\n        \"\"\"\n        return V2G(\n            _df=(\n                self.df.alias(\"interval\")\n                .join(\n                    variant_index.df.selectExpr(\n                        \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n                    ).alias(\"vi\"),\n                    on=[\n                        f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n                        f.col(\"vi.position\").between(\n                            f.col(\"interval.start\"), f.col(\"interval.end\")\n                        ),\n                    ],\n                    how=\"inner\",\n                )\n                .drop(\"start\", \"end\", \"vi_chromosome\", \"position\")\n            ),\n            _schema=V2G.get_schema(),\n        )\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.from_source","title":"from_source(spark: SparkSession, source_name: str, source_path: str, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Collect interval data for a particular source.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required source_name str

Name of the interval source

required source_path str

Path to the interval source file

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance to convert coordinats from hg37 to hg38

required

Returns:

Name Type Description Intervals Intervals

Intervals dataset

Raises:

Type Description ValueError

If the source name is not recognised

Source code in src/otg/dataset/intervals.py
@classmethod\ndef from_source(\n    cls: type[Intervals],\n    spark: SparkSession,\n    source_name: str,\n    source_path: str,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Collect interval data for a particular source.\n\n    Args:\n        spark (SparkSession): Spark session\n        source_name (str): Name of the interval source\n        source_path (str): Path to the interval source file\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance to convert coordinats from hg37 to hg38\n\n    Returns:\n        Intervals: Intervals dataset\n\n    Raises:\n        ValueError: If the source name is not recognised\n    \"\"\"\n    from otg.datasource.intervals.andersson import IntervalsAndersson\n    from otg.datasource.intervals.javierre import IntervalsJavierre\n    from otg.datasource.intervals.jung import IntervalsJung\n    from otg.datasource.intervals.thurman import IntervalsThurman\n\n    source_to_class = {\n        \"andersson\": IntervalsAndersson,\n        \"javierre\": IntervalsJavierre,\n        \"jung\": IntervalsJung,\n        \"thurman\": IntervalsThurman,\n    }\n\n    if source_name not in source_to_class:\n        raise ValueError(f\"Unknown interval source: {source_name}\")\n\n    source_class = source_to_class[source_name]\n    data = source_class.read(spark, source_path)\n    return source_class.parse(data, gene_index, lift)\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the Intervals dataset.

Returns:

Name Type Description StructType StructType

Schema for the Intervals dataset

Source code in src/otg/dataset/intervals.py
@classmethod\ndef get_schema(cls: type[Intervals]) -> StructType:\n    \"\"\"Provides the schema for the Intervals dataset.\n\n    Returns:\n        StructType: Schema for the Intervals dataset\n    \"\"\"\n    return parse_spark_schema(\"intervals.json\")\n
"},{"location":"python_api/dataset/intervals/#otg.dataset.intervals.Intervals.v2g","title":"v2g(variant_index: VariantIndex) -> V2G","text":"

Convert intervals into V2G by intersecting with a variant index.

Parameters:

Name Type Description Default variant_index VariantIndex

Variant index dataset

required

Returns:

Name Type Description V2G V2G

Variant-to-gene evidence dataset

Source code in src/otg/dataset/intervals.py
def v2g(self: Intervals, variant_index: VariantIndex) -> V2G:\n    \"\"\"Convert intervals into V2G by intersecting with a variant index.\n\n    Args:\n        variant_index (VariantIndex): Variant index dataset\n\n    Returns:\n        V2G: Variant-to-gene evidence dataset\n    \"\"\"\n    return V2G(\n        _df=(\n            self.df.alias(\"interval\")\n            .join(\n                variant_index.df.selectExpr(\n                    \"chromosome as vi_chromosome\", \"variantId\", \"position\"\n                ).alias(\"vi\"),\n                on=[\n                    f.col(\"vi.vi_chromosome\") == f.col(\"interval.chromosome\"),\n                    f.col(\"vi.position\").between(\n                        f.col(\"interval.start\"), f.col(\"interval.end\")\n                    ),\n                ],\n                how=\"inner\",\n            )\n            .drop(\"start\", \"end\", \"vi_chromosome\", \"position\")\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/intervals/#schema","title":"Schema","text":"
root\n |-- chromosome: string (nullable = false)\n |-- start: string (nullable = false)\n |-- end: string (nullable = false)\n |-- geneId: string (nullable = false)\n |-- resourceScore: double (nullable = true)\n |-- score: double (nullable = true)\n |-- datasourceId: string (nullable = false)\n |-- datatypeId: string (nullable = false)\n |-- pmid: string (nullable = true)\n |-- biofeature: string (nullable = true)\n
"},{"location":"python_api/dataset/l2g_feature_matrix/","title":"L2G Feature Matrix","text":""},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix","title":"otg.dataset.l2g_feature_matrix.L2GFeatureMatrix dataclass","text":"

Bases: Dataset

Dataset with features for Locus to Gene prediction.

Source code in src/otg/dataset/l2g_feature_matrix.py
@dataclass\nclass L2GFeatureMatrix(Dataset):\n    \"\"\"Dataset with features for Locus to Gene prediction.\"\"\"\n\n    @classmethod\n    def generate_features(\n        cls: Type[L2GFeatureMatrix],\n        study_locus: StudyLocus,\n        study_index: StudyIndex,\n        variant_gene: V2G,\n        # colocalisation: Colocalisation,\n    ) -> L2GFeatureMatrix:\n        \"\"\"Generate features from the OTG datasets.\n\n        Args:\n            study_locus (StudyLocus): Study locus dataset\n            study_index (StudyIndex): Study index dataset\n            variant_gene (V2G): Variant to gene dataset\n\n        Returns:\n            L2GFeatureMatrix: L2G feature matrix dataset\n\n        Raises:\n            ValueError: If the feature matrix is empty\n        \"\"\"\n        if features_dfs := [\n            # Extract features\n            # ColocalisationFactory._get_coloc_features(\n            #     study_locus, study_index, colocalisation\n            # ).df,\n            StudyLocusFactory._get_tss_distance_features(study_locus, variant_gene).df,\n        ]:\n            fm = reduce(\n                lambda x, y: x.unionByName(y),\n                features_dfs,\n            )\n        else:\n            raise ValueError(\"No features found\")\n\n        # raise error if the feature matrix is empty\n        if fm.limit(1).count() != 0:\n            return cls(\n                _df=_convert_from_long_to_wide(\n                    fm, [\"studyLocusId\", \"geneId\"], \"featureName\", \"featureValue\"\n                ),\n                _schema=cls.get_schema(),\n            )\n        raise ValueError(\"L2G Feature matrix is empty\")\n\n    @classmethod\n    def get_schema(cls: type[L2GFeatureMatrix]) -> StructType:\n        \"\"\"Provides the schema for the L2gFeatureMatrix dataset.\n\n        Returns:\n            StructType: Schema for the L2gFeatureMatrix dataset\n        \"\"\"\n        return parse_spark_schema(\"l2g_feature_matrix.json\")\n\n    def fill_na(\n        self: L2GFeatureMatrix, value: float = 0.0, subset: list[str] | None = None\n    ) -> L2GFeatureMatrix:\n        \"\"\"Fill missing values in a column with a given value.\n\n        Args:\n            value (float): Value to replace missing values with. Defaults to 0.0.\n            subset (list[str] | None): Subset of columns to consider. Defaults to None.\n\n        Returns:\n            L2GFeatureMatrix: L2G feature matrix dataset\n        \"\"\"\n        self.df = self._df.fillna(value, subset=subset)\n        return self\n\n    def select_features(\n        self: L2GFeatureMatrix, features_list: list[str]\n    ) -> L2GFeatureMatrix:\n        \"\"\"Select a subset of features from the feature matrix.\n\n        Args:\n            features_list (list[str]): List of features to select\n\n        Returns:\n            L2GFeatureMatrix: L2G feature matrix dataset\n        \"\"\"\n        fixed_rows = [\"studyLocusId\", \"geneId\", \"goldStandardSet\"]\n        self.df = self._df.select(fixed_rows + features_list)\n        return self\n\n    def train_test_split(\n        self: L2GFeatureMatrix, fraction: float\n    ) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]:\n        \"\"\"Split the dataset into training and test sets.\n\n        Args:\n            fraction (float): Fraction of the dataset to use for training\n\n        Returns:\n            tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets\n        \"\"\"\n        train, test = self._df.randomSplit([fraction, 1 - fraction], seed=42)\n        return (\n            L2GFeatureMatrix(\n                _df=train, _schema=L2GFeatureMatrix.get_schema()\n            ).persist(),\n            L2GFeatureMatrix(_df=test, _schema=L2GFeatureMatrix.get_schema()).persist(),\n        )\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix.fill_na","title":"fill_na(value: float = 0.0, subset: list[str] | None = None) -> L2GFeatureMatrix","text":"

Fill missing values in a column with a given value.

Parameters:

Name Type Description Default value float

Value to replace missing values with. Defaults to 0.0.

0.0 subset list[str] | None

Subset of columns to consider. Defaults to None.

None

Returns:

Name Type Description L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
def fill_na(\n    self: L2GFeatureMatrix, value: float = 0.0, subset: list[str] | None = None\n) -> L2GFeatureMatrix:\n    \"\"\"Fill missing values in a column with a given value.\n\n    Args:\n        value (float): Value to replace missing values with. Defaults to 0.0.\n        subset (list[str] | None): Subset of columns to consider. Defaults to None.\n\n    Returns:\n        L2GFeatureMatrix: L2G feature matrix dataset\n    \"\"\"\n    self.df = self._df.fillna(value, subset=subset)\n    return self\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix.generate_features","title":"generate_features(study_locus: StudyLocus, study_index: StudyIndex, variant_gene: V2G) -> L2GFeatureMatrix classmethod","text":"

Generate features from the OTG datasets.

Parameters:

Name Type Description Default study_locus StudyLocus

Study locus dataset

required study_index StudyIndex

Study index dataset

required variant_gene V2G

Variant to gene dataset

required

Returns:

Name Type Description L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Raises:

Type Description ValueError

If the feature matrix is empty

Source code in src/otg/dataset/l2g_feature_matrix.py
@classmethod\ndef generate_features(\n    cls: Type[L2GFeatureMatrix],\n    study_locus: StudyLocus,\n    study_index: StudyIndex,\n    variant_gene: V2G,\n    # colocalisation: Colocalisation,\n) -> L2GFeatureMatrix:\n    \"\"\"Generate features from the OTG datasets.\n\n    Args:\n        study_locus (StudyLocus): Study locus dataset\n        study_index (StudyIndex): Study index dataset\n        variant_gene (V2G): Variant to gene dataset\n\n    Returns:\n        L2GFeatureMatrix: L2G feature matrix dataset\n\n    Raises:\n        ValueError: If the feature matrix is empty\n    \"\"\"\n    if features_dfs := [\n        # Extract features\n        # ColocalisationFactory._get_coloc_features(\n        #     study_locus, study_index, colocalisation\n        # ).df,\n        StudyLocusFactory._get_tss_distance_features(study_locus, variant_gene).df,\n    ]:\n        fm = reduce(\n            lambda x, y: x.unionByName(y),\n            features_dfs,\n        )\n    else:\n        raise ValueError(\"No features found\")\n\n    # raise error if the feature matrix is empty\n    if fm.limit(1).count() != 0:\n        return cls(\n            _df=_convert_from_long_to_wide(\n                fm, [\"studyLocusId\", \"geneId\"], \"featureName\", \"featureValue\"\n            ),\n            _schema=cls.get_schema(),\n        )\n    raise ValueError(\"L2G Feature matrix is empty\")\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the L2gFeatureMatrix dataset.

Returns:

Name Type Description StructType StructType

Schema for the L2gFeatureMatrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
@classmethod\ndef get_schema(cls: type[L2GFeatureMatrix]) -> StructType:\n    \"\"\"Provides the schema for the L2gFeatureMatrix dataset.\n\n    Returns:\n        StructType: Schema for the L2gFeatureMatrix dataset\n    \"\"\"\n    return parse_spark_schema(\"l2g_feature_matrix.json\")\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix.select_features","title":"select_features(features_list: list[str]) -> L2GFeatureMatrix","text":"

Select a subset of features from the feature matrix.

Parameters:

Name Type Description Default features_list list[str]

List of features to select

required

Returns:

Name Type Description L2GFeatureMatrix L2GFeatureMatrix

L2G feature matrix dataset

Source code in src/otg/dataset/l2g_feature_matrix.py
def select_features(\n    self: L2GFeatureMatrix, features_list: list[str]\n) -> L2GFeatureMatrix:\n    \"\"\"Select a subset of features from the feature matrix.\n\n    Args:\n        features_list (list[str]): List of features to select\n\n    Returns:\n        L2GFeatureMatrix: L2G feature matrix dataset\n    \"\"\"\n    fixed_rows = [\"studyLocusId\", \"geneId\", \"goldStandardSet\"]\n    self.df = self._df.select(fixed_rows + features_list)\n    return self\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#otg.dataset.l2g_feature_matrix.L2GFeatureMatrix.train_test_split","title":"train_test_split(fraction: float) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]","text":"

Split the dataset into training and test sets.

Parameters:

Name Type Description Default fraction float

Fraction of the dataset to use for training

required

Returns:

Type Description tuple[L2GFeatureMatrix, L2GFeatureMatrix]

tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets

Source code in src/otg/dataset/l2g_feature_matrix.py
def train_test_split(\n    self: L2GFeatureMatrix, fraction: float\n) -> tuple[L2GFeatureMatrix, L2GFeatureMatrix]:\n    \"\"\"Split the dataset into training and test sets.\n\n    Args:\n        fraction (float): Fraction of the dataset to use for training\n\n    Returns:\n        tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets\n    \"\"\"\n    train, test = self._df.randomSplit([fraction, 1 - fraction], seed=42)\n    return (\n        L2GFeatureMatrix(\n            _df=train, _schema=L2GFeatureMatrix.get_schema()\n        ).persist(),\n        L2GFeatureMatrix(_df=test, _schema=L2GFeatureMatrix.get_schema()).persist(),\n    )\n
"},{"location":"python_api/dataset/l2g_feature_matrix/#schema","title":"Schema","text":"
root\n |-- studyLocusId: long (nullable = false)\n |-- geneId: string (nullable = false)\n |-- goldStandardSet: string (nullable = true)\n |-- distanceTssMean: float (nullable = true)\n |-- distanceTssMinimum: float (nullable = true)\n |-- eqtlColocClppLocalMaximum: double (nullable = true)\n |-- eqtlColocClppNeighborhoodMaximum: double (nullable = true)\n |-- eqtlColocLlrLocalMaximum: double (nullable = true)\n |-- eqtlColocLlrNeighborhoodMaximum: double (nullable = true)\n |-- pqtlColocClppLocalMaximum: double (nullable = true)\n |-- pqtlColocClppNeighborhoodMaximum: double (nullable = true)\n |-- pqtlColocLlrLocalMaximum: double (nullable = true)\n |-- pqtlColocLlrNeighborhoodMaximum: double (nullable = true)\n |-- sqtlColocClppLocalMaximum: double (nullable = true)\n |-- sqtlColocClppNeighborhoodMaximum: double (nullable = true)\n |-- sqtlColocLlrLocalMaximum: double (nullable = true)\n |-- sqtlColocLlrNeighborhoodMaximum: double (nullable = true)\n
"},{"location":"python_api/dataset/l2g_gold_standard/","title":"L2G Gold Standard","text":""},{"location":"python_api/dataset/l2g_gold_standard/#otg.dataset.l2g_gold_standard.L2GGoldStandard","title":"otg.dataset.l2g_gold_standard.L2GGoldStandard dataclass","text":"

Bases: Dataset

L2G gold standard dataset.

Source code in src/otg/dataset/l2g_gold_standard.py
@dataclass\nclass L2GGoldStandard(Dataset):\n    \"\"\"L2G gold standard dataset.\"\"\"\n\n    @classmethod\n    def from_otg_curation(\n        cls: type[L2GGoldStandard],\n        gold_standard_curation: DataFrame,\n        v2g: V2G,\n        study_locus_overlap: StudyLocusOverlap,\n        interactions: DataFrame,\n    ) -> L2GGoldStandard:\n        \"\"\"Initialise L2GGoldStandard from source dataset.\n\n        Args:\n            gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from\n            v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS\n            study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci\n            interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene\n\n        Returns:\n            L2GGoldStandard: L2G Gold Standard dataset\n        \"\"\"\n        from otg.datasource.open_targets.l2g_gold_standard import (\n            OpenTargetsL2GGoldStandard,\n        )\n\n        return OpenTargetsL2GGoldStandard.as_l2g_gold_standard(\n            gold_standard_curation, v2g, study_locus_overlap, interactions\n        )\n\n    @classmethod\n    def get_schema(cls: type[L2GGoldStandard]) -> StructType:\n        \"\"\"Provides the schema for the L2GGoldStandard dataset.\n\n        Returns:\n            StructType: Spark schema for the L2GGoldStandard dataset\n        \"\"\"\n        return parse_spark_schema(\"l2g_gold_standard.json\")\n
"},{"location":"python_api/dataset/l2g_gold_standard/#otg.dataset.l2g_gold_standard.L2GGoldStandard.from_otg_curation","title":"from_otg_curation(gold_standard_curation: DataFrame, v2g: V2G, study_locus_overlap: StudyLocusOverlap, interactions: DataFrame) -> L2GGoldStandard classmethod","text":"

Initialise L2GGoldStandard from source dataset.

Parameters:

Name Type Description Default gold_standard_curation DataFrame

Gold standard curation dataframe, extracted from

required v2g V2G

Variant to gene dataset to bring distance between a variant and a gene's TSS

required study_locus_overlap StudyLocusOverlap

Study locus overlap dataset to remove duplicated loci

required interactions DataFrame

Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene

required

Returns:

Name Type Description L2GGoldStandard L2GGoldStandard

L2G Gold Standard dataset

Source code in src/otg/dataset/l2g_gold_standard.py
@classmethod\ndef from_otg_curation(\n    cls: type[L2GGoldStandard],\n    gold_standard_curation: DataFrame,\n    v2g: V2G,\n    study_locus_overlap: StudyLocusOverlap,\n    interactions: DataFrame,\n) -> L2GGoldStandard:\n    \"\"\"Initialise L2GGoldStandard from source dataset.\n\n    Args:\n        gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from\n        v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS\n        study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci\n        interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene\n\n    Returns:\n        L2GGoldStandard: L2G Gold Standard dataset\n    \"\"\"\n    from otg.datasource.open_targets.l2g_gold_standard import (\n        OpenTargetsL2GGoldStandard,\n    )\n\n    return OpenTargetsL2GGoldStandard.as_l2g_gold_standard(\n        gold_standard_curation, v2g, study_locus_overlap, interactions\n    )\n
"},{"location":"python_api/dataset/l2g_gold_standard/#otg.dataset.l2g_gold_standard.L2GGoldStandard.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the L2GGoldStandard dataset.

Returns:

Name Type Description StructType StructType

Spark schema for the L2GGoldStandard dataset

Source code in src/otg/dataset/l2g_gold_standard.py
@classmethod\ndef get_schema(cls: type[L2GGoldStandard]) -> StructType:\n    \"\"\"Provides the schema for the L2GGoldStandard dataset.\n\n    Returns:\n        StructType: Spark schema for the L2GGoldStandard dataset\n    \"\"\"\n    return parse_spark_schema(\"l2g_gold_standard.json\")\n
"},{"location":"python_api/dataset/l2g_gold_standard/#schema","title":"Schema","text":"
root\n |-- studyLocusId: long (nullable = false)\n |-- geneId: string (nullable = false)\n |-- goldStandardSet: string (nullable = false)\n |-- sources: array (nullable = false)\n |    |-- element: string (containsNull = true)\n
"},{"location":"python_api/dataset/l2g_prediction/","title":"L2G Prediction","text":""},{"location":"python_api/dataset/l2g_prediction/#otg.dataset.l2g_prediction.L2GPrediction","title":"otg.dataset.l2g_prediction.L2GPrediction dataclass","text":"

Bases: Dataset

Dataset that contains the Locus to Gene predictions.

It is the result of applying the L2G model on a feature matrix, which contains all the study/locus pairs and their functional annotations. The score column informs the confidence of the prediction that a gene is causal to an association.

Source code in src/otg/dataset/l2g_prediction.py
@dataclass\nclass L2GPrediction(Dataset):\n    \"\"\"Dataset that contains the Locus to Gene predictions.\n\n    It is the result of applying the L2G model on a feature matrix, which contains all\n    the study/locus pairs and their functional annotations. The score column informs the\n    confidence of the prediction that a gene is causal to an association.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[L2GPrediction]) -> StructType:\n        \"\"\"Provides the schema for the L2GPrediction dataset.\n\n        Returns:\n            StructType: Schema for the L2GPrediction dataset\n        \"\"\"\n        return parse_spark_schema(\"l2g_predictions.json\")\n\n    @classmethod\n    def from_study_locus(\n        cls: Type[L2GPrediction],\n        model_path: str,\n        study_locus: StudyLocus,\n        study_index: StudyIndex,\n        v2g: V2G,\n        # coloc: Colocalisation,\n    ) -> L2GPrediction:\n        \"\"\"Initialise L2G from feature matrix.\n\n        Args:\n            model_path (str): Path to the fitted model\n            study_locus (StudyLocus): Study locus dataset\n            study_index (StudyIndex): Study index dataset\n            v2g (V2G): Variant to gene dataset\n\n        Returns:\n            L2GPrediction: L2G dataset\n        \"\"\"\n        fm = L2GFeatureMatrix.generate_features(\n            study_locus=study_locus,\n            study_index=StudyIndex,\n            variant_gene=v2g,\n            # colocalisation=coloc,\n        ).fill_na()\n        return L2GPrediction(\n            # Load and apply fitted model\n            _df=(\n                LocusToGeneModel.load_from_disk(\n                    model_path,\n                    features_list=fm.df.drop(\"studyLocusId\", \"geneId\").columns,\n                ).predict(fm)\n                # the probability of the positive class is the second element inside the probability array\n                # - this is selected as the L2G probability\n                .select(\n                    \"studyLocusId\",\n                    \"geneId\",\n                    vector_to_array(\"probability\")[1].alias(\"score\"),\n                )\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/dataset/l2g_prediction/#otg.dataset.l2g_prediction.L2GPrediction.from_study_locus","title":"from_study_locus(model_path: str, study_locus: StudyLocus, study_index: StudyIndex, v2g: V2G) -> L2GPrediction classmethod","text":"

Initialise L2G from feature matrix.

Parameters:

Name Type Description Default model_path str

Path to the fitted model

required study_locus StudyLocus

Study locus dataset

required study_index StudyIndex

Study index dataset

required v2g V2G

Variant to gene dataset

required

Returns:

Name Type Description L2GPrediction L2GPrediction

L2G dataset

Source code in src/otg/dataset/l2g_prediction.py
@classmethod\ndef from_study_locus(\n    cls: Type[L2GPrediction],\n    model_path: str,\n    study_locus: StudyLocus,\n    study_index: StudyIndex,\n    v2g: V2G,\n    # coloc: Colocalisation,\n) -> L2GPrediction:\n    \"\"\"Initialise L2G from feature matrix.\n\n    Args:\n        model_path (str): Path to the fitted model\n        study_locus (StudyLocus): Study locus dataset\n        study_index (StudyIndex): Study index dataset\n        v2g (V2G): Variant to gene dataset\n\n    Returns:\n        L2GPrediction: L2G dataset\n    \"\"\"\n    fm = L2GFeatureMatrix.generate_features(\n        study_locus=study_locus,\n        study_index=StudyIndex,\n        variant_gene=v2g,\n        # colocalisation=coloc,\n    ).fill_na()\n    return L2GPrediction(\n        # Load and apply fitted model\n        _df=(\n            LocusToGeneModel.load_from_disk(\n                model_path,\n                features_list=fm.df.drop(\"studyLocusId\", \"geneId\").columns,\n            ).predict(fm)\n            # the probability of the positive class is the second element inside the probability array\n            # - this is selected as the L2G probability\n            .select(\n                \"studyLocusId\",\n                \"geneId\",\n                vector_to_array(\"probability\")[1].alias(\"score\"),\n            )\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/dataset/l2g_prediction/#otg.dataset.l2g_prediction.L2GPrediction.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the L2GPrediction dataset.

Returns:

Name Type Description StructType StructType

Schema for the L2GPrediction dataset

Source code in src/otg/dataset/l2g_prediction.py
@classmethod\ndef get_schema(cls: type[L2GPrediction]) -> StructType:\n    \"\"\"Provides the schema for the L2GPrediction dataset.\n\n    Returns:\n        StructType: Schema for the L2GPrediction dataset\n    \"\"\"\n    return parse_spark_schema(\"l2g_predictions.json\")\n
"},{"location":"python_api/dataset/l2g_prediction/#schema","title":"Schema","text":""},{"location":"python_api/dataset/ld_index/","title":"LD Index","text":""},{"location":"python_api/dataset/ld_index/#otg.dataset.ld_index.LDIndex","title":"otg.dataset.ld_index.LDIndex dataclass","text":"

Bases: Dataset

Dataset containing linkage desequilibrium information between variants.

Source code in src/otg/dataset/ld_index.py
@dataclass\nclass LDIndex(Dataset):\n    \"\"\"Dataset containing linkage desequilibrium information between variants.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[LDIndex]) -> StructType:\n        \"\"\"Provides the schema for the LDIndex dataset.\n\n        Returns:\n            StructType: Schema for the LDIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"ld_index.json\")\n
"},{"location":"python_api/dataset/ld_index/#otg.dataset.ld_index.LDIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the LDIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the LDIndex dataset

Source code in src/otg/dataset/ld_index.py
@classmethod\ndef get_schema(cls: type[LDIndex]) -> StructType:\n    \"\"\"Provides the schema for the LDIndex dataset.\n\n    Returns:\n        StructType: Schema for the LDIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"ld_index.json\")\n
"},{"location":"python_api/dataset/ld_index/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- ldSet: array (nullable = false)\n |    |-- element: struct (containsNull = false)\n |    |    |-- tagVariantId: string (nullable = false)\n |    |    |-- rValues: array (nullable = false)\n |    |    |    |-- element: struct (containsNull = false)\n |    |    |    |    |-- population: string (nullable = false)\n |    |    |    |    |-- r: double (nullable = false)\n
"},{"location":"python_api/dataset/study_index/","title":"Study Index","text":""},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex","title":"otg.dataset.study_index.StudyIndex dataclass","text":"

Bases: Dataset

Study index dataset.

A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.

Source code in src/otg/dataset/study_index.py
@dataclass\nclass StudyIndex(Dataset):\n    \"\"\"Study index dataset.\n\n    A study index dataset captures all the metadata for all studies including GWAS and Molecular QTL.\n    \"\"\"\n\n    @staticmethod\n    def _aggregate_samples_by_ancestry(merged: Column, ancestry: Column) -> Column:\n        \"\"\"Aggregate sample counts by ancestry in a list of struct colmns.\n\n        Args:\n            merged (Column): A column representing merged data (list of structs).\n            ancestry (Column): The `ancestry` parameter is a column that represents the ancestry of each\n                sample. (a struct)\n\n        Returns:\n            Column: the modified \"merged\" column after aggregating the samples by ancestry.\n        \"\"\"\n        # Iterating over the list of ancestries and adding the sample size if label matches:\n        return f.transform(\n            merged,\n            lambda a: f.when(\n                a.ancestry == ancestry.ancestry,\n                f.struct(\n                    a.ancestry.alias(\"ancestry\"),\n                    (a.sampleSize + ancestry.sampleSize).alias(\"sampleSize\"),\n                ),\n            ).otherwise(a),\n        )\n\n    @staticmethod\n    def _map_ancestries_to_ld_population(gwas_ancestry_label: Column) -> Column:\n        \"\"\"Normalise ancestry column from GWAS studies into reference LD panel based on a pre-defined map.\n\n        This function assumes all possible ancestry categories have a corresponding\n        LD panel in the LD index. It is very important to have the ancestry labels\n        moved to the LD panel map.\n\n        Args:\n            gwas_ancestry_label (Column): A struct column with ancestry label like Finnish,\n                European, African etc. and the corresponding sample size.\n\n        Returns:\n            Column: Struct column with the mapped LD population label and the sample size.\n        \"\"\"\n        # Loading ancestry label to LD population label:\n        json_dict = json.loads(\n            pkg_resources.read_text(\n                data, \"gwas_population_2_LD_panel_map.json\", encoding=\"utf-8\"\n            )\n        )\n        map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n        return f.struct(\n            map_expr[gwas_ancestry_label.ancestry].alias(\"ancestry\"),\n            gwas_ancestry_label.sampleSize.alias(\"sampleSize\"),\n        )\n\n    @classmethod\n    def get_schema(cls: type[StudyIndex]) -> StructType:\n        \"\"\"Provide the schema for the StudyIndex dataset.\n\n        Returns:\n            StructType: The schema of the StudyIndex dataset.\n        \"\"\"\n        return parse_spark_schema(\"study_index.json\")\n\n    @classmethod\n    def aggregate_and_map_ancestries(\n        cls: type[StudyIndex], discovery_samples: Column\n    ) -> Column:\n        \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n        Args:\n            discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n        Returns:\n            Column: A list of struct with mapped LD population and their relative sample size.\n        \"\"\"\n        # Map ancestry categories to population labels of the LD index:\n        mapped_ancestries = f.transform(\n            discovery_samples, cls._map_ancestries_to_ld_population\n        )\n\n        # Aggregate sample sizes belonging to the same LD population:\n        aggregated_counts = f.aggregate(\n            mapped_ancestries,\n            f.array_distinct(\n                f.transform(\n                    mapped_ancestries,\n                    lambda x: f.struct(\n                        x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n                    ),\n                )\n            ),\n            cls._aggregate_samples_by_ancestry,\n        )\n        # Getting total sample count:\n        total_sample_count = f.aggregate(\n            aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n        ).alias(\"sampleSize\")\n\n        # Calculating relative sample size for each LD population:\n        return f.transform(\n            aggregated_counts,\n            lambda ld_population: f.struct(\n                ld_population.ancestry.alias(\"ldPopulation\"),\n                (ld_population.sampleSize / total_sample_count).alias(\n                    \"relativeSampleSize\"\n                ),\n            ),\n        )\n\n    def study_type_lut(self: StudyIndex) -> DataFrame:\n        \"\"\"Return a lookup table of study type.\n\n        Returns:\n            DataFrame: A dataframe containing `studyId` and `studyType` columns.\n        \"\"\"\n        return self.df.select(\"studyId\", \"studyType\")\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.aggregate_and_map_ancestries","title":"aggregate_and_map_ancestries(discovery_samples: Column) -> Column classmethod","text":"

Map ancestries to populations in the LD reference and calculate relative sample size.

Parameters:

Name Type Description Default discovery_samples Column

A list of struct column. Has an ancestry column and a sampleSize columns

required

Returns:

Name Type Description Column Column

A list of struct with mapped LD population and their relative sample size.

Source code in src/otg/dataset/study_index.py
@classmethod\ndef aggregate_and_map_ancestries(\n    cls: type[StudyIndex], discovery_samples: Column\n) -> Column:\n    \"\"\"Map ancestries to populations in the LD reference and calculate relative sample size.\n\n    Args:\n        discovery_samples (Column): A list of struct column. Has an `ancestry` column and a `sampleSize` columns\n\n    Returns:\n        Column: A list of struct with mapped LD population and their relative sample size.\n    \"\"\"\n    # Map ancestry categories to population labels of the LD index:\n    mapped_ancestries = f.transform(\n        discovery_samples, cls._map_ancestries_to_ld_population\n    )\n\n    # Aggregate sample sizes belonging to the same LD population:\n    aggregated_counts = f.aggregate(\n        mapped_ancestries,\n        f.array_distinct(\n            f.transform(\n                mapped_ancestries,\n                lambda x: f.struct(\n                    x.ancestry.alias(\"ancestry\"), f.lit(0.0).alias(\"sampleSize\")\n                ),\n            )\n        ),\n        cls._aggregate_samples_by_ancestry,\n    )\n    # Getting total sample count:\n    total_sample_count = f.aggregate(\n        aggregated_counts, f.lit(0.0), lambda total, pop: total + pop.sampleSize\n    ).alias(\"sampleSize\")\n\n    # Calculating relative sample size for each LD population:\n    return f.transform(\n        aggregated_counts,\n        lambda ld_population: f.struct(\n            ld_population.ancestry.alias(\"ldPopulation\"),\n            (ld_population.sampleSize / total_sample_count).alias(\n                \"relativeSampleSize\"\n            ),\n        ),\n    )\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provide the schema for the StudyIndex dataset.

Returns:

Name Type Description StructType StructType

The schema of the StudyIndex dataset.

Source code in src/otg/dataset/study_index.py
@classmethod\ndef get_schema(cls: type[StudyIndex]) -> StructType:\n    \"\"\"Provide the schema for the StudyIndex dataset.\n\n    Returns:\n        StructType: The schema of the StudyIndex dataset.\n    \"\"\"\n    return parse_spark_schema(\"study_index.json\")\n
"},{"location":"python_api/dataset/study_index/#otg.dataset.study_index.StudyIndex.study_type_lut","title":"study_type_lut() -> DataFrame","text":"

Return a lookup table of study type.

Returns:

Name Type Description DataFrame DataFrame

A dataframe containing studyId and studyType columns.

Source code in src/otg/dataset/study_index.py
def study_type_lut(self: StudyIndex) -> DataFrame:\n    \"\"\"Return a lookup table of study type.\n\n    Returns:\n        DataFrame: A dataframe containing `studyId` and `studyType` columns.\n    \"\"\"\n    return self.df.select(\"studyId\", \"studyType\")\n
"},{"location":"python_api/dataset/study_index/#schema","title":"Schema","text":"
root\n |-- studyId: string (nullable = false)\n |-- projectId: string (nullable = false)\n |-- studyType: string (nullable = false)\n |-- traitFromSource: string (nullable = false)\n |-- traitFromSourceMappedIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- geneId: string (nullable = true)\n |-- pubmedId: string (nullable = true)\n |-- publicationTitle: string (nullable = true)\n |-- publicationFirstAuthor: string (nullable = true)\n |-- publicationDate: string (nullable = true)\n |-- publicationJournal: string (nullable = true)\n |-- backgroundTraitFromSourceMappedIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- initialSampleSize: string (nullable = true)\n |-- nCases: long (nullable = true)\n |-- nControls: long (nullable = true)\n |-- nSamples: long (nullable = true)\n |-- ldPopulationStructure: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- ldPopulation: string (nullable = true)\n |    |    |-- relativeSampleSize: double (nullable = true)\n |-- discoverySamples: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- sampleSize: long (nullable = true)\n |    |    |-- ancestry: string (nullable = true)\n |-- replicationSamples: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- sampleSize: long (nullable = true)\n |    |    |-- ancestry: string (nullable = true)\n |-- summarystatsLocation: string (nullable = true)\n |-- hasSumstats: boolean (nullable = true)\n
"},{"location":"python_api/dataset/study_locus/","title":"Study Locus","text":""},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus","title":"otg.dataset.study_locus.StudyLocus dataclass","text":"

Bases: Dataset

Study-Locus dataset.

This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.

Source code in src/otg/dataset/study_locus.py
@dataclass\nclass StudyLocus(Dataset):\n    \"\"\"Study-Locus dataset.\n\n    This dataset captures associations between study/traits and a genetic loci as provided by finemapping methods.\n    \"\"\"\n\n    @staticmethod\n    def _overlapping_peaks(credset_to_overlap: DataFrame) -> DataFrame:\n        \"\"\"Calculate overlapping signals (study-locus) between GWAS-GWAS and GWAS-Molecular trait.\n\n        Args:\n            credset_to_overlap (DataFrame): DataFrame containing at least `studyLocusId`, `studyType`, `chromosome` and `tagVariantId` columns.\n\n        Returns:\n            DataFrame: containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.\n        \"\"\"\n        # Reduce columns to the minimum to reduce the size of the dataframe\n        credset_to_overlap = credset_to_overlap.select(\n            \"studyLocusId\", \"studyType\", \"chromosome\", \"tagVariantId\"\n        )\n        return (\n            credset_to_overlap.alias(\"left\")\n            .filter(f.col(\"studyType\") == \"gwas\")\n            # Self join with complex condition. Left it's all gwas and right can be gwas or molecular trait\n            .join(\n                credset_to_overlap.alias(\"right\"),\n                on=[\n                    f.col(\"left.chromosome\") == f.col(\"right.chromosome\"),\n                    f.col(\"left.tagVariantId\") == f.col(\"right.tagVariantId\"),\n                    (f.col(\"right.studyType\") != \"gwas\")\n                    | (f.col(\"left.studyLocusId\") > f.col(\"right.studyLocusId\")),\n                ],\n                how=\"inner\",\n            )\n            .select(\n                f.col(\"left.studyLocusId\").alias(\"leftStudyLocusId\"),\n                f.col(\"right.studyLocusId\").alias(\"rightStudyLocusId\"),\n                f.col(\"left.chromosome\").alias(\"chromosome\"),\n            )\n            .distinct()\n            .repartition(\"chromosome\")\n            .persist()\n        )\n\n    @staticmethod\n    def _align_overlapping_tags(\n        loci_to_overlap: DataFrame, peak_overlaps: DataFrame\n    ) -> StudyLocusOverlap:\n        \"\"\"Align overlapping tags in pairs of overlapping study-locus, keeping all tags in both loci.\n\n        Args:\n            loci_to_overlap (DataFrame): containing `studyLocusId`, `studyType`, `chromosome`, `tagVariantId`, `logABF` and `posteriorProbability` columns.\n            peak_overlaps (DataFrame): containing `leftStudyLocusId`, `rightStudyLocusId` and `chromosome` columns.\n\n        Returns:\n            StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n        \"\"\"\n        # Complete information about all tags in the left study-locus of the overlap\n        stats_cols = [\n            \"logABF\",\n            \"posteriorProbability\",\n            \"beta\",\n            \"pValueMantissa\",\n            \"pValueExponent\",\n        ]\n        overlapping_left = loci_to_overlap.select(\n            f.col(\"chromosome\"),\n            f.col(\"tagVariantId\"),\n            f.col(\"studyLocusId\").alias(\"leftStudyLocusId\"),\n            *[f.col(col).alias(f\"left_{col}\") for col in stats_cols],\n        ).join(peak_overlaps, on=[\"chromosome\", \"leftStudyLocusId\"], how=\"inner\")\n\n        # Complete information about all tags in the right study-locus of the overlap\n        overlapping_right = loci_to_overlap.select(\n            f.col(\"chromosome\"),\n            f.col(\"tagVariantId\"),\n            f.col(\"studyLocusId\").alias(\"rightStudyLocusId\"),\n            *[f.col(col).alias(f\"right_{col}\") for col in stats_cols],\n        ).join(peak_overlaps, on=[\"chromosome\", \"rightStudyLocusId\"], how=\"inner\")\n\n        # Include information about all tag variants in both study-locus aligned by tag variant id\n        overlaps = overlapping_left.join(\n            overlapping_right,\n            on=[\n                \"chromosome\",\n                \"rightStudyLocusId\",\n                \"leftStudyLocusId\",\n                \"tagVariantId\",\n            ],\n            how=\"outer\",\n        ).select(\n            \"leftStudyLocusId\",\n            \"rightStudyLocusId\",\n            \"chromosome\",\n            \"tagVariantId\",\n            f.struct(\n                *[f\"left_{e}\" for e in stats_cols] + [f\"right_{e}\" for e in stats_cols]\n            ).alias(\"statistics\"),\n        )\n        return StudyLocusOverlap(\n            _df=overlaps,\n            _schema=StudyLocusOverlap.get_schema(),\n        )\n\n    @staticmethod\n    def _update_quality_flag(\n        qc: Column, flag_condition: Column, flag_text: StudyLocusQualityCheck\n    ) -> Column:\n        \"\"\"Update the provided quality control list with a new flag if condition is met.\n\n        Args:\n            qc (Column): Array column with the current list of qc flags.\n            flag_condition (Column): This is a column of booleans, signing which row should be flagged\n            flag_text (StudyLocusQualityCheck): Text for the new quality control flag\n\n        Returns:\n            Column: Array column with the updated list of qc flags.\n        \"\"\"\n        qc = f.when(qc.isNull(), f.array()).otherwise(qc)\n        return f.when(\n            flag_condition,\n            f.array_union(qc, f.array(f.lit(flag_text.value))),\n        ).otherwise(qc)\n\n    @staticmethod\n    def assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n        \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n        Args:\n            study_id_col (Column): column name with a study ID\n            variant_id_col (Column): column name with a variant ID\n\n        Returns:\n            Column: column with a study locus ID\n\n        Examples:\n            >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n            >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n            +----------+----------+--------------------+\n            |   studyId| variantId|      study_locus_id|\n            +----------+----------+--------------------+\n            |GCST000001|1_1000_A_C| 7437284926964690765|\n            |GCST000002|1_1000_A_C|-7653912547667845377|\n            +----------+----------+--------------------+\n            <BLANKLINE>\n        \"\"\"\n        return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n\n    @classmethod\n    def get_schema(cls: type[StudyLocus]) -> StructType:\n        \"\"\"Provides the schema for the StudyLocus dataset.\n\n        Returns:\n            StructType: schema for the StudyLocus dataset.\n        \"\"\"\n        return parse_spark_schema(\"study_locus.json\")\n\n    def filter_credible_set(\n        self: StudyLocus,\n        credible_interval: CredibleInterval,\n    ) -> StudyLocus:\n        \"\"\"Filter study-locus tag variants based on given credible interval.\n\n        Args:\n            credible_interval (CredibleInterval): Credible interval to filter for.\n\n        Returns:\n            StudyLocus: Filtered study-locus dataset.\n        \"\"\"\n        self.df = self._df.withColumn(\n            \"locus\",\n            f.filter(\n                f.col(\"locus\"),\n                lambda tag: (tag[credible_interval.value]),\n            ),\n        )\n        return self\n\n    def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n        \"\"\"Calculate overlapping study-locus.\n\n        Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n        appearing on the right side.\n\n        Args:\n            study_index (StudyIndex): Study index to resolve study types.\n\n        Returns:\n            StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n        \"\"\"\n        loci_to_overlap = (\n            self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n            .withColumn(\"locus\", f.explode(\"locus\"))\n            .select(\n                \"studyLocusId\",\n                \"studyType\",\n                \"chromosome\",\n                f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n                f.col(\"locus.logABF\").alias(\"logABF\"),\n                f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n                f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n                f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n                f.col(\"locus.beta\").alias(\"beta\"),\n            )\n            .persist()\n        )\n\n        # overlapping study-locus\n        peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n        # study-locus overlap by aligning overlapping variants\n        return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n\n    def unique_variants_in_locus(self: StudyLocus) -> DataFrame:\n        \"\"\"All unique variants collected in a `StudyLocus` dataframe.\n\n        Returns:\n            DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n        \"\"\"\n        return (\n            self.df.withColumn(\n                \"variantId\",\n                # Joint array of variants in that studylocus. Locus can be null\n                f.explode(\n                    f.array_union(\n                        f.array(f.col(\"variantId\")),\n                        f.coalesce(f.col(\"locus.variantId\"), f.array()),\n                    )\n                ),\n            )\n            .select(\n                \"variantId\", f.split(f.col(\"variantId\"), \"_\")[0].alias(\"chromosome\")\n            )\n            .distinct()\n        )\n\n    def neglog_pvalue(self: StudyLocus) -> Column:\n        \"\"\"Returns the negative log p-value.\n\n        Returns:\n            Column: Negative log p-value\n        \"\"\"\n        return calculate_neglog_pvalue(\n            self.df.pValueMantissa,\n            self.df.pValueExponent,\n        )\n\n    def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n        \"\"\"Annotate study-locus dataset with credible set flags.\n\n        Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n        `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n        of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n        Returns:\n            StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n\n        Raises:\n            ValueError: If `locus` column is not available.\n        \"\"\"\n        if \"locus\" not in self.df.columns:\n            raise ValueError(\"Locus column not available.\")\n\n        self.df = self.df.withColumn(\n            # Sort credible set by posterior probability in descending order\n            \"locus\",\n            f.when(\n                f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n                order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n            ),\n        ).withColumn(\n            # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n            # and zip the cumulative sums array with the credible set array to add the flags\n            \"locus\",\n            f.when(\n                f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n                f.zip_with(\n                    f.col(\"locus\"),\n                    f.transform(\n                        f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n                        lambda index: f.aggregate(\n                            f.slice(\n                                # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n                                # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n                                f.col(\"locus.posteriorProbability\"),\n                                1,\n                                index - 1,\n                            ),\n                            f.lit(0.0),\n                            lambda acc, el: acc + el,\n                        ),\n                    ),\n                    lambda struct_e, acc: struct_e.withField(\n                        CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n                    ).withField(\n                        CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n                    ),\n                ),\n            ),\n        )\n        return self\n\n    def clump(self: StudyLocus) -> StudyLocus:\n        \"\"\"Perform LD clumping of the studyLocus.\n\n        Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n        Returns:\n            StudyLocus: with empty credible sets for linked variants and QC flag.\n        \"\"\"\n        self.df = (\n            self.df.withColumn(\n                \"is_lead_linked\",\n                LDclumping._is_lead_linked(\n                    self.df.studyId,\n                    self.df.variantId,\n                    self.df.pValueExponent,\n                    self.df.pValueMantissa,\n                    self.df.ldSet,\n                ),\n            )\n            .withColumn(\n                \"ldSet\",\n                f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n            )\n            .withColumn(\n                \"qualityControls\",\n                StudyLocus._update_quality_flag(\n                    f.col(\"qualityControls\"),\n                    f.col(\"is_lead_linked\"),\n                    StudyLocusQualityCheck.LD_CLUMPED,\n                ),\n            )\n            .drop(\"is_lead_linked\")\n        )\n        return self\n\n    def _qc_unresolved_ld(\n        self: StudyLocus,\n    ) -> StudyLocus:\n        \"\"\"Flag associations with variants that are not found in the LD reference.\n\n        Returns:\n            StudyLocus: Updated study locus.\n        \"\"\"\n        self.df = self.df.withColumn(\n            \"qualityControls\",\n            self._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"ldSet\").isNull(),\n                StudyLocusQualityCheck.UNRESOLVED_LD,\n            ),\n        )\n        return self\n\n    def _qc_no_population(self: StudyLocus) -> StudyLocus:\n        \"\"\"Flag associations where the study doesn't have population information to resolve LD.\n\n        Returns:\n            StudyLocus: Updated study locus.\n        \"\"\"\n        # If the tested column is not present, return self unchanged:\n        if \"ldPopulationStructure\" not in self.df.columns:\n            return self\n\n        self.df = self.df.withColumn(\n            \"qualityControls\",\n            self._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"ldPopulationStructure\").isNull(),\n                StudyLocusQualityCheck.NO_POPULATION,\n            ),\n        )\n        return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.annotate_credible_sets","title":"annotate_credible_sets() -> StudyLocus","text":"

Annotate study-locus dataset with credible set flags.

Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.

Returns:

Name Type Description StudyLocus StudyLocus

including annotation on is95CredibleSet and is99CredibleSet.

Raises:

Type Description ValueError

If locus column is not available.

Source code in src/otg/dataset/study_locus.py
def annotate_credible_sets(self: StudyLocus) -> StudyLocus:\n    \"\"\"Annotate study-locus dataset with credible set flags.\n\n    Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds\n    `is95CredibleSet` and `is99CredibleSet` fields to the elements, indicating which are the tagging variants whose cumulative sum\n    of their `posteriorProbability` values is below 0.95 and 0.99, respectively.\n\n    Returns:\n        StudyLocus: including annotation on `is95CredibleSet` and `is99CredibleSet`.\n\n    Raises:\n        ValueError: If `locus` column is not available.\n    \"\"\"\n    if \"locus\" not in self.df.columns:\n        raise ValueError(\"Locus column not available.\")\n\n    self.df = self.df.withColumn(\n        # Sort credible set by posterior probability in descending order\n        \"locus\",\n        f.when(\n            f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n            order_array_of_structs_by_field(\"locus\", \"posteriorProbability\"),\n        ),\n    ).withColumn(\n        # Calculate array of cumulative sums of posterior probabilities to determine which variants are in the 95% and 99% credible sets\n        # and zip the cumulative sums array with the credible set array to add the flags\n        \"locus\",\n        f.when(\n            f.col(\"locus\").isNotNull() & (f.size(f.col(\"locus\")) > 0),\n            f.zip_with(\n                f.col(\"locus\"),\n                f.transform(\n                    f.sequence(f.lit(1), f.size(f.col(\"locus\"))),\n                    lambda index: f.aggregate(\n                        f.slice(\n                            # By using `index - 1` we introduce a value of `0.0` in the cumulative sums array. to ensure that the last variant\n                            # that exceeds the 0.95 threshold is included in the cumulative sum, as its probability is necessary to satisfy the threshold.\n                            f.col(\"locus.posteriorProbability\"),\n                            1,\n                            index - 1,\n                        ),\n                        f.lit(0.0),\n                        lambda acc, el: acc + el,\n                    ),\n                ),\n                lambda struct_e, acc: struct_e.withField(\n                    CredibleInterval.IS95.value, (acc < 0.95) & acc.isNotNull()\n                ).withField(\n                    CredibleInterval.IS99.value, (acc < 0.99) & acc.isNotNull()\n                ),\n            ),\n        ),\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.assign_study_locus_id","title":"assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column staticmethod","text":"

Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.

Parameters:

Name Type Description Default study_id_col Column

column name with a study ID

required variant_id_col Column

column name with a variant ID

required

Returns:

Name Type Description Column Column

column with a study locus ID

Examples:

>>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n>>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n+----------+----------+--------------------+\n|   studyId| variantId|      study_locus_id|\n+----------+----------+--------------------+\n|GCST000001|1_1000_A_C| 7437284926964690765|\n|GCST000002|1_1000_A_C|-7653912547667845377|\n+----------+----------+--------------------+\n
Source code in src/otg/dataset/study_locus.py
@staticmethod\ndef assign_study_locus_id(study_id_col: Column, variant_id_col: Column) -> Column:\n    \"\"\"Hashes a column with a variant ID and a study ID to extract a consistent studyLocusId.\n\n    Args:\n        study_id_col (Column): column name with a study ID\n        variant_id_col (Column): column name with a variant ID\n\n    Returns:\n        Column: column with a study locus ID\n\n    Examples:\n        >>> df = spark.createDataFrame([(\"GCST000001\", \"1_1000_A_C\"), (\"GCST000002\", \"1_1000_A_C\")]).toDF(\"studyId\", \"variantId\")\n        >>> df.withColumn(\"study_locus_id\", StudyLocus.assign_study_locus_id(*[f.col(\"variantId\"), f.col(\"studyId\")])).show()\n        +----------+----------+--------------------+\n        |   studyId| variantId|      study_locus_id|\n        +----------+----------+--------------------+\n        |GCST000001|1_1000_A_C| 7437284926964690765|\n        |GCST000002|1_1000_A_C|-7653912547667845377|\n        +----------+----------+--------------------+\n        <BLANKLINE>\n    \"\"\"\n    return f.xxhash64(*[study_id_col, variant_id_col]).alias(\"studyLocusId\")\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.clump","title":"clump() -> StudyLocus","text":"

Perform LD clumping of the studyLocus.

Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.

Returns:

Name Type Description StudyLocus StudyLocus

with empty credible sets for linked variants and QC flag.

Source code in src/otg/dataset/study_locus.py
def clump(self: StudyLocus) -> StudyLocus:\n    \"\"\"Perform LD clumping of the studyLocus.\n\n    Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n    Returns:\n        StudyLocus: with empty credible sets for linked variants and QC flag.\n    \"\"\"\n    self.df = (\n        self.df.withColumn(\n            \"is_lead_linked\",\n            LDclumping._is_lead_linked(\n                self.df.studyId,\n                self.df.variantId,\n                self.df.pValueExponent,\n                self.df.pValueMantissa,\n                self.df.ldSet,\n            ),\n        )\n        .withColumn(\n            \"ldSet\",\n            f.when(f.col(\"is_lead_linked\"), f.array()).otherwise(f.col(\"ldSet\")),\n        )\n        .withColumn(\n            \"qualityControls\",\n            StudyLocus._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.col(\"is_lead_linked\"),\n                StudyLocusQualityCheck.LD_CLUMPED,\n            ),\n        )\n        .drop(\"is_lead_linked\")\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.filter_credible_set","title":"filter_credible_set(credible_interval: CredibleInterval) -> StudyLocus","text":"

Filter study-locus tag variants based on given credible interval.

Parameters:

Name Type Description Default credible_interval CredibleInterval

Credible interval to filter for.

required

Returns:

Name Type Description StudyLocus StudyLocus

Filtered study-locus dataset.

Source code in src/otg/dataset/study_locus.py
def filter_credible_set(\n    self: StudyLocus,\n    credible_interval: CredibleInterval,\n) -> StudyLocus:\n    \"\"\"Filter study-locus tag variants based on given credible interval.\n\n    Args:\n        credible_interval (CredibleInterval): Credible interval to filter for.\n\n    Returns:\n        StudyLocus: Filtered study-locus dataset.\n    \"\"\"\n    self.df = self._df.withColumn(\n        \"locus\",\n        f.filter(\n            f.col(\"locus\"),\n            lambda tag: (tag[credible_interval.value]),\n        ),\n    )\n    return self\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.find_overlaps","title":"find_overlaps(study_index: StudyIndex) -> StudyLocusOverlap","text":"

Calculate overlapping study-locus.

Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.

Parameters:

Name Type Description Default study_index StudyIndex

Study index to resolve study types.

required

Returns:

Name Type Description StudyLocusOverlap StudyLocusOverlap

Pairs of overlapping study-locus with aligned tags.

Source code in src/otg/dataset/study_locus.py
def find_overlaps(self: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap:\n    \"\"\"Calculate overlapping study-locus.\n\n    Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always\n    appearing on the right side.\n\n    Args:\n        study_index (StudyIndex): Study index to resolve study types.\n\n    Returns:\n        StudyLocusOverlap: Pairs of overlapping study-locus with aligned tags.\n    \"\"\"\n    loci_to_overlap = (\n        self.df.join(study_index.study_type_lut(), on=\"studyId\", how=\"inner\")\n        .withColumn(\"locus\", f.explode(\"locus\"))\n        .select(\n            \"studyLocusId\",\n            \"studyType\",\n            \"chromosome\",\n            f.col(\"locus.variantId\").alias(\"tagVariantId\"),\n            f.col(\"locus.logABF\").alias(\"logABF\"),\n            f.col(\"locus.posteriorProbability\").alias(\"posteriorProbability\"),\n            f.col(\"locus.pValueMantissa\").alias(\"pValueMantissa\"),\n            f.col(\"locus.pValueExponent\").alias(\"pValueExponent\"),\n            f.col(\"locus.beta\").alias(\"beta\"),\n        )\n        .persist()\n    )\n\n    # overlapping study-locus\n    peak_overlaps = self._overlapping_peaks(loci_to_overlap)\n\n    # study-locus overlap by aligning overlapping variants\n    return self._align_overlapping_tags(loci_to_overlap, peak_overlaps)\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the StudyLocus dataset.

Returns:

Name Type Description StructType StructType

schema for the StudyLocus dataset.

Source code in src/otg/dataset/study_locus.py
@classmethod\ndef get_schema(cls: type[StudyLocus]) -> StructType:\n    \"\"\"Provides the schema for the StudyLocus dataset.\n\n    Returns:\n        StructType: schema for the StudyLocus dataset.\n    \"\"\"\n    return parse_spark_schema(\"study_locus.json\")\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.neglog_pvalue","title":"neglog_pvalue() -> Column","text":"

Returns the negative log p-value.

Returns:

Name Type Description Column Column

Negative log p-value

Source code in src/otg/dataset/study_locus.py
def neglog_pvalue(self: StudyLocus) -> Column:\n    \"\"\"Returns the negative log p-value.\n\n    Returns:\n        Column: Negative log p-value\n    \"\"\"\n    return calculate_neglog_pvalue(\n        self.df.pValueMantissa,\n        self.df.pValueExponent,\n    )\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocus.unique_variants_in_locus","title":"unique_variants_in_locus() -> DataFrame","text":"

All unique variants collected in a StudyLocus dataframe.

Returns:

Name Type Description DataFrame DataFrame

A dataframe containing variantId and chromosome columns.

Source code in src/otg/dataset/study_locus.py
def unique_variants_in_locus(self: StudyLocus) -> DataFrame:\n    \"\"\"All unique variants collected in a `StudyLocus` dataframe.\n\n    Returns:\n        DataFrame: A dataframe containing `variantId` and `chromosome` columns.\n    \"\"\"\n    return (\n        self.df.withColumn(\n            \"variantId\",\n            # Joint array of variants in that studylocus. Locus can be null\n            f.explode(\n                f.array_union(\n                    f.array(f.col(\"variantId\")),\n                    f.coalesce(f.col(\"locus.variantId\"), f.array()),\n                )\n            ),\n        )\n        .select(\n            \"variantId\", f.split(f.col(\"variantId\"), \"_\")[0].alias(\"chromosome\")\n        )\n        .distinct()\n    )\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.StudyLocusQualityCheck","title":"otg.dataset.study_locus.StudyLocusQualityCheck","text":"

Bases: Enum

Study-Locus quality control options listing concerns on the quality of the association.

Attributes:

Name Type Description SUBSIGNIFICANT_FLAG str

p-value below significance threshold

NO_GENOMIC_LOCATION_FLAG str

Incomplete genomic mapping

COMPOSITE_FLAG str

Composite association due to variant x variant interactions

VARIANT_INCONSISTENCY_FLAG str

Inconsistencies in the reported variants

NON_MAPPED_VARIANT_FLAG str

Variant not mapped to GnomAd

PALINDROMIC_ALLELE_FLAG str

Alleles are palindromic - cannot harmonize

AMBIGUOUS_STUDY str

Association with ambiguous study

UNRESOLVED_LD str

Variant not found in LD reference

LD_CLUMPED str

Explained by a more significant variant in high LD (clumped)

Source code in src/otg/dataset/study_locus.py
class StudyLocusQualityCheck(Enum):\n    \"\"\"Study-Locus quality control options listing concerns on the quality of the association.\n\n    Attributes:\n        SUBSIGNIFICANT_FLAG (str): p-value below significance threshold\n        NO_GENOMIC_LOCATION_FLAG (str): Incomplete genomic mapping\n        COMPOSITE_FLAG (str): Composite association due to variant x variant interactions\n        VARIANT_INCONSISTENCY_FLAG (str): Inconsistencies in the reported variants\n        NON_MAPPED_VARIANT_FLAG (str): Variant not mapped to GnomAd\n        PALINDROMIC_ALLELE_FLAG (str): Alleles are palindromic - cannot harmonize\n        AMBIGUOUS_STUDY (str): Association with ambiguous study\n        UNRESOLVED_LD (str): Variant not found in LD reference\n        LD_CLUMPED (str): Explained by a more significant variant in high LD (clumped)\n    \"\"\"\n\n    SUBSIGNIFICANT_FLAG = \"Subsignificant p-value\"\n    NO_GENOMIC_LOCATION_FLAG = \"Incomplete genomic mapping\"\n    COMPOSITE_FLAG = \"Composite association\"\n    INCONSISTENCY_FLAG = \"Variant inconsistency\"\n    NON_MAPPED_VARIANT_FLAG = \"No mapping in GnomAd\"\n    PALINDROMIC_ALLELE_FLAG = \"Palindrome alleles - cannot harmonize\"\n    AMBIGUOUS_STUDY = \"Association with ambiguous study\"\n    UNRESOLVED_LD = \"Variant not found in LD reference\"\n    LD_CLUMPED = \"Explained by a more significant variant in high LD (clumped)\"\n    NO_POPULATION = \"Study does not have population annotation to resolve LD\"\n
"},{"location":"python_api/dataset/study_locus/#otg.dataset.study_locus.CredibleInterval","title":"otg.dataset.study_locus.CredibleInterval","text":"

Bases: Enum

Credible interval enum.

Interval within which an unobserved parameter value falls with a particular probability.

Attributes:

Name Type Description IS95 str

95% credible interval

IS99 str

99% credible interval

Source code in src/otg/dataset/study_locus.py
class CredibleInterval(Enum):\n    \"\"\"Credible interval enum.\n\n    Interval within which an unobserved parameter value falls with a particular probability.\n\n    Attributes:\n        IS95 (str): 95% credible interval\n        IS99 (str): 99% credible interval\n    \"\"\"\n\n    IS95 = \"is95CredibleSet\"\n    IS99 = \"is99CredibleSet\"\n
"},{"location":"python_api/dataset/study_locus/#schema","title":"Schema","text":"
root\n |-- studyLocusId: long (nullable = false)\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- studyId: string (nullable = false)\n |-- beta: double (nullable = true)\n |-- oddsRatio: double (nullable = true)\n |-- oddsRatioConfidenceIntervalLower: double (nullable = true)\n |-- oddsRatioConfidenceIntervalUpper: double (nullable = true)\n |-- betaConfidenceIntervalLower: double (nullable = true)\n |-- betaConfidenceIntervalUpper: double (nullable = true)\n |-- pValueMantissa: float (nullable = true)\n |-- pValueExponent: integer (nullable = true)\n |-- effectAlleleFrequencyFromSource: float (nullable = true)\n |-- standardError: double (nullable = true)\n |-- subStudyDescription: string (nullable = true)\n |-- qualityControls: array (nullable = true)\n |    |-- element: string (containsNull = false)\n |-- finemappingMethod: string (nullable = true)\n |-- ldSet: array (nullable = true)\n |    |-- element: struct (containsNull = true)\n |    |    |-- tagVariantId: string (nullable = true)\n |    |    |-- r2Overall: double (nullable = true)\n |-- locus: array (nullable = true)\n |    |-- element: struct (containsNull = true)\n |    |    |-- is95CredibleSet: boolean (nullable = true)\n |    |    |-- is99CredibleSet: boolean (nullable = true)\n |    |    |-- logABF: double (nullable = true)\n |    |    |-- posteriorProbability: double (nullable = true)\n |    |    |-- variantId: string (nullable = true)\n |    |    |-- pValueMantissa: float (nullable = true)\n |    |    |-- pValueExponent: integer (nullable = true)\n |    |    |-- pValueMantissaConditioned: float (nullable = true)\n |    |    |-- pValueExponentConditioned: integer (nullable = true)\n |    |    |-- beta: double (nullable = true)\n |    |    |-- standardError: double (nullable = true)\n |    |    |-- betaConditioned: double (nullable = true)\n |    |    |-- standardErrorConditioned: double (nullable = true)\n |    |    |-- r2Overall: double (nullable = true)\n
"},{"location":"python_api/dataset/study_locus_overlap/","title":"Study Locus Overlap","text":""},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap","title":"otg.dataset.study_locus_overlap.StudyLocusOverlap dataclass","text":"

Bases: Dataset

Study-Locus overlap.

This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.

Note

This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.

Source code in src/otg/dataset/study_locus_overlap.py
@dataclass\nclass StudyLocusOverlap(Dataset):\n    \"\"\"Study-Locus overlap.\n\n    This dataset captures pairs of overlapping `StudyLocus`: that is associations whose credible sets share at least one tagging variant.\n\n    !!! note\n        This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n        \"\"\"Provides the schema for the StudyLocusOverlap dataset.\n\n        Returns:\n            StructType: Schema for the StudyLocusOverlap dataset\n        \"\"\"\n        return parse_spark_schema(\"study_locus_overlap.json\")\n\n    @classmethod\n    def from_associations(\n        cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n    ) -> StudyLocusOverlap:\n        \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n        Args:\n            study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n            study_index (StudyIndex): Study index to find the overlapping signals\n\n        Returns:\n            StudyLocusOverlap: Study-locus overlap dataset\n        \"\"\"\n        return study_locus.find_overlaps(study_index)\n
"},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap.from_associations","title":"from_associations(study_locus: StudyLocus, study_index: StudyIndex) -> StudyLocusOverlap classmethod","text":"

Find the overlapping signals in a particular set of associations (StudyLocus dataset).

Parameters:

Name Type Description Default study_locus StudyLocus

Study-locus associations to find the overlapping signals

required study_index StudyIndex

Study index to find the overlapping signals

required

Returns:

Name Type Description StudyLocusOverlap StudyLocusOverlap

Study-locus overlap dataset

Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef from_associations(\n    cls: type[StudyLocusOverlap], study_locus: StudyLocus, study_index: StudyIndex\n) -> StudyLocusOverlap:\n    \"\"\"Find the overlapping signals in a particular set of associations (StudyLocus dataset).\n\n    Args:\n        study_locus (StudyLocus): Study-locus associations to find the overlapping signals\n        study_index (StudyIndex): Study index to find the overlapping signals\n\n    Returns:\n        StudyLocusOverlap: Study-locus overlap dataset\n    \"\"\"\n    return study_locus.find_overlaps(study_index)\n
"},{"location":"python_api/dataset/study_locus_overlap/#otg.dataset.study_locus_overlap.StudyLocusOverlap.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the StudyLocusOverlap dataset.

Returns:

Name Type Description StructType StructType

Schema for the StudyLocusOverlap dataset

Source code in src/otg/dataset/study_locus_overlap.py
@classmethod\ndef get_schema(cls: type[StudyLocusOverlap]) -> StructType:\n    \"\"\"Provides the schema for the StudyLocusOverlap dataset.\n\n    Returns:\n        StructType: Schema for the StudyLocusOverlap dataset\n    \"\"\"\n    return parse_spark_schema(\"study_locus_overlap.json\")\n
"},{"location":"python_api/dataset/study_locus_overlap/#schema","title":"Schema","text":"
root\n |-- leftStudyLocusId: long (nullable = false)\n |-- rightStudyLocusId: long (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- tagVariantId: string (nullable = false)\n |-- statistics: struct (nullable = false)\n |    |-- left_pValueMantissa: float (nullable = true)\n |    |-- left_pValueExponent: integer (nullable = true)\n |    |-- right_pValueMantissa: float (nullable = true)\n |    |-- right_pValueExponent: integer (nullable = true)\n |    |-- left_beta: double (nullable = true)\n |    |-- right_beta: double (nullable = true)\n |    |-- left_logABF: double (nullable = true)\n |    |-- right_logABF: double (nullable = true)\n |    |-- left_posteriorProbability: double (nullable = true)\n |    |-- right_posteriorProbability: double (nullable = true)\n
"},{"location":"python_api/dataset/summary_statistics/","title":"Summary Statistics","text":""},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics","title":"otg.dataset.summary_statistics.SummaryStatistics dataclass","text":"

Bases: Dataset

Summary Statistics dataset.

A summary statistics dataset contains all single point statistics resulting from a GWAS.

Source code in src/otg/dataset/summary_statistics.py
@dataclass\nclass SummaryStatistics(Dataset):\n    \"\"\"Summary Statistics dataset.\n\n    A summary statistics dataset contains all single point statistics resulting from a GWAS.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[SummaryStatistics]) -> StructType:\n        \"\"\"Provides the schema for the SummaryStatistics dataset.\n\n        Returns:\n            StructType: Schema for the SummaryStatistics dataset\n        \"\"\"\n        return parse_spark_schema(\"summary_statistics.json\")\n\n    def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n        \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n        Args:\n            pvalue (float): upper limit of the p-value to be filtered upon.\n\n        Returns:\n            SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n        \"\"\"\n        # Converting p-value to mantissa and exponent:\n        (mantissa, exponent) = split_pvalue(pvalue)\n\n        # Applying filter:\n        df = self._df.filter(\n            (f.col(\"pValueExponent\") < exponent)\n            | (\n                (f.col(\"pValueExponent\") == exponent)\n                & (f.col(\"pValueMantissa\") <= mantissa)\n            )\n        )\n        return SummaryStatistics(_df=df, _schema=self._schema)\n\n    def window_based_clumping(\n        self: SummaryStatistics,\n        distance: int,\n        gwas_significance: float = 5e-8,\n        baseline_significance: float = 0.05,\n        locus_collect_distance: int | None = None,\n    ) -> StudyLocus:\n        \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n        Args:\n            distance (int): Distance in base pairs to be used for clumping.\n            gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n            baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n            locus_collect_distance (int | None): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n        Returns:\n            StudyLocus: Clumped study-locus containing variants based on window.\n        \"\"\"\n        # If locus collect distance is present, collect locus with the provided distance:\n        if locus_collect_distance:\n            clumped_df = WindowBasedClumping.clump_with_locus(\n                self,\n                window_length=distance,\n                p_value_significance=gwas_significance,\n                p_value_baseline=baseline_significance,\n                locus_window_length=locus_collect_distance,\n            )\n        else:\n            clumped_df = WindowBasedClumping.clump(\n                self, window_length=distance, p_value_significance=gwas_significance\n            )\n\n        return clumped_df\n\n    def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n        \"\"\"Exclude a region from the summary stats dataset.\n\n        Args:\n            region (str): region given in \"chr##:#####-####\" format\n\n        Returns:\n            SummaryStatistics: filtered summary statistics.\n        \"\"\"\n        (chromosome, start_position, end_position) = parse_region(region)\n\n        return SummaryStatistics(\n            _df=(\n                self.df.filter(\n                    ~(\n                        (f.col(\"chromosome\") == chromosome)\n                        & (\n                            (f.col(\"position\") >= start_position)\n                            & (f.col(\"position\") <= end_position)\n                        )\n                    )\n                )\n            ),\n            _schema=SummaryStatistics.get_schema(),\n        )\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.exclude_region","title":"exclude_region(region: str) -> SummaryStatistics","text":"

Exclude a region from the summary stats dataset.

Parameters:

Name Type Description Default region str

region given in \"chr##:#####-####\" format

required

Returns:

Name Type Description SummaryStatistics SummaryStatistics

filtered summary statistics.

Source code in src/otg/dataset/summary_statistics.py
def exclude_region(self: SummaryStatistics, region: str) -> SummaryStatistics:\n    \"\"\"Exclude a region from the summary stats dataset.\n\n    Args:\n        region (str): region given in \"chr##:#####-####\" format\n\n    Returns:\n        SummaryStatistics: filtered summary statistics.\n    \"\"\"\n    (chromosome, start_position, end_position) = parse_region(region)\n\n    return SummaryStatistics(\n        _df=(\n            self.df.filter(\n                ~(\n                    (f.col(\"chromosome\") == chromosome)\n                    & (\n                        (f.col(\"position\") >= start_position)\n                        & (f.col(\"position\") <= end_position)\n                    )\n                )\n            )\n        ),\n        _schema=SummaryStatistics.get_schema(),\n    )\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the SummaryStatistics dataset.

Returns:

Name Type Description StructType StructType

Schema for the SummaryStatistics dataset

Source code in src/otg/dataset/summary_statistics.py
@classmethod\ndef get_schema(cls: type[SummaryStatistics]) -> StructType:\n    \"\"\"Provides the schema for the SummaryStatistics dataset.\n\n    Returns:\n        StructType: Schema for the SummaryStatistics dataset\n    \"\"\"\n    return parse_spark_schema(\"summary_statistics.json\")\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.pvalue_filter","title":"pvalue_filter(pvalue: float) -> SummaryStatistics","text":"

Filter summary statistics based on the provided p-value threshold.

Parameters:

Name Type Description Default pvalue float

upper limit of the p-value to be filtered upon.

required

Returns:

Name Type Description SummaryStatistics SummaryStatistics

summary statistics object containing single point associations with p-values at least as significant as the provided threshold.

Source code in src/otg/dataset/summary_statistics.py
def pvalue_filter(self: SummaryStatistics, pvalue: float) -> SummaryStatistics:\n    \"\"\"Filter summary statistics based on the provided p-value threshold.\n\n    Args:\n        pvalue (float): upper limit of the p-value to be filtered upon.\n\n    Returns:\n        SummaryStatistics: summary statistics object containing single point associations with p-values at least as significant as the provided threshold.\n    \"\"\"\n    # Converting p-value to mantissa and exponent:\n    (mantissa, exponent) = split_pvalue(pvalue)\n\n    # Applying filter:\n    df = self._df.filter(\n        (f.col(\"pValueExponent\") < exponent)\n        | (\n            (f.col(\"pValueExponent\") == exponent)\n            & (f.col(\"pValueMantissa\") <= mantissa)\n        )\n    )\n    return SummaryStatistics(_df=df, _schema=self._schema)\n
"},{"location":"python_api/dataset/summary_statistics/#otg.dataset.summary_statistics.SummaryStatistics.window_based_clumping","title":"window_based_clumping(distance: int, gwas_significance: float = 5e-08, baseline_significance: float = 0.05, locus_collect_distance: int | None = None) -> StudyLocus","text":"

Generate study-locus from summary statistics by distance based clumping + collect locus.

Parameters:

Name Type Description Default distance int

Distance in base pairs to be used for clumping.

required gwas_significance float

GWAS significance threshold. Defaults to 5e-8.

5e-08 baseline_significance float

Baseline significance threshold for inclusion in the locus. Defaults to 0.05.

0.05 locus_collect_distance int | None

The distance to collect locus around semi-indices. If not provided, defaults to distance.

None

Returns:

Name Type Description StudyLocus StudyLocus

Clumped study-locus containing variants based on window.

Source code in src/otg/dataset/summary_statistics.py
def window_based_clumping(\n    self: SummaryStatistics,\n    distance: int,\n    gwas_significance: float = 5e-8,\n    baseline_significance: float = 0.05,\n    locus_collect_distance: int | None = None,\n) -> StudyLocus:\n    \"\"\"Generate study-locus from summary statistics by distance based clumping + collect locus.\n\n    Args:\n        distance (int): Distance in base pairs to be used for clumping.\n        gwas_significance (float, optional): GWAS significance threshold. Defaults to 5e-8.\n        baseline_significance (float, optional): Baseline significance threshold for inclusion in the locus. Defaults to 0.05.\n        locus_collect_distance (int | None): The distance to collect locus around semi-indices. If not provided, defaults to `distance`.\n\n    Returns:\n        StudyLocus: Clumped study-locus containing variants based on window.\n    \"\"\"\n    # If locus collect distance is present, collect locus with the provided distance:\n    if locus_collect_distance:\n        clumped_df = WindowBasedClumping.clump_with_locus(\n            self,\n            window_length=distance,\n            p_value_significance=gwas_significance,\n            p_value_baseline=baseline_significance,\n            locus_window_length=locus_collect_distance,\n        )\n    else:\n        clumped_df = WindowBasedClumping.clump(\n            self, window_length=distance, p_value_significance=gwas_significance\n        )\n\n    return clumped_df\n
"},{"location":"python_api/dataset/summary_statistics/#schema","title":"Schema","text":"
root\n |-- studyId: string (nullable = false)\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- beta: double (nullable = false)\n |-- betaConfidenceIntervalLower: double (nullable = true)\n |-- betaConfidenceIntervalUpper: double (nullable = true)\n |-- pValueMantissa: float (nullable = false)\n |-- pValueExponent: integer (nullable = false)\n |-- effectAlleleFrequencyFromSource: float (nullable = true)\n |-- standardError: double (nullable = true)\n
"},{"location":"python_api/dataset/variant_annotation/","title":"Variant annotation","text":""},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation","title":"otg.dataset.variant_annotation.VariantAnnotation dataclass","text":"

Bases: Dataset

Dataset with variant-level annotations.

Source code in src/otg/dataset/variant_annotation.py
@dataclass\nclass VariantAnnotation(Dataset):\n    \"\"\"Dataset with variant-level annotations.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[VariantAnnotation]) -> StructType:\n        \"\"\"Provides the schema for the VariantAnnotation dataset.\n\n        Returns:\n            StructType: Schema for the VariantAnnotation dataset\n        \"\"\"\n        return parse_spark_schema(\"variant_annotation.json\")\n\n    def max_maf(self: VariantAnnotation) -> Column:\n        \"\"\"Maximum minor allele frequency accross all populations.\n\n        Returns:\n            Column: Maximum minor allele frequency accross all populations.\n        \"\"\"\n        return f.array_max(\n            f.transform(\n                self.df.alleleFrequencies,\n                lambda af: f.when(\n                    af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n                ).otherwise(af.alleleFrequency),\n            )\n        )\n\n    def filter_by_variant_df(\n        self: VariantAnnotation, df: DataFrame\n    ) -> VariantAnnotation:\n        \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n        Args:\n            df (DataFrame): A dataframe of variants\n\n        Returns:\n            VariantAnnotation: A filtered variant annotation dataset\n        \"\"\"\n        self.df = self._df.join(\n            f.broadcast(df.select(\"variantId\", \"chromosome\")),\n            on=[\"variantId\", \"chromosome\"],\n            how=\"inner\",\n        )\n        return self\n\n    def get_transcript_consequence_df(\n        self: VariantAnnotation, gene_index: GeneIndex | None = None\n    ) -> DataFrame:\n        \"\"\"Dataframe of exploded transcript consequences.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            gene_index (GeneIndex | None): A gene index. Defaults to None.\n\n        Returns:\n            DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n        \"\"\"\n        # exploding the array removes records without VEP annotation\n        transript_consequences = self.df.withColumn(\n            \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n        ).select(\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            \"transcriptConsequence\",\n            f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n        )\n        if gene_index:\n            transript_consequences = transript_consequences.join(\n                f.broadcast(gene_index.df),\n                on=[\"chromosome\", \"geneId\"],\n            )\n        return transript_consequences.persist()\n\n    def get_most_severe_vep_v2g(\n        self: VariantAnnotation,\n        vep_consequences: DataFrame,\n        gene_index: GeneIndex,\n    ) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            vep_consequences (DataFrame): A dataframe of VEP consequences\n            gene_index (GeneIndex): A gene index to filter by. Defaults to None.\n\n        Returns:\n            V2G: High and medium severity variant to gene assignments\n        \"\"\"\n        return V2G(\n            _df=self.get_transcript_consequence_df(gene_index)\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n                f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n            )\n            .join(\n                f.broadcast(vep_consequences),\n                on=\"label\",\n                how=\"inner\",\n            )\n            .drop(\"label\")\n            .filter(f.col(\"score\") != 0)\n            # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n            .transform(\n                lambda df: get_record_with_maximum_value(\n                    df, [\"variantId\", \"geneId\"], \"score\"\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_polyphen_v2g(\n        self: VariantAnnotation, gene_index: GeneIndex | None = None\n    ) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n        Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows:\n            - 0.0 to 0.15 -- Predicted to be benign.\n            - 0.15 to 1.0 -- Possibly damaging.\n            - 0.85 to 1.0 -- Predicted to be damaging.\n\n        Args:\n            gene_index (GeneIndex | None): A gene index to filter by. Defaults to None.\n\n        Returns:\n            V2G: variant to gene assignments with their polyphen scores\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"polyphen\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_sift_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n        SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows:\n            - 0.0 to 0.05 -- Likely to be deleterious.\n            - 0.05 to 1.0 -- Likely to be tolerated.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n\n        Returns:\n            V2G: variant to gene assignments with their SIFT scores\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"sift\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_plof_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n        \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n        Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n\n        Returns:\n            V2G: variant to gene assignments from the LOFTEE algorithm\n        \"\"\"\n        return V2G(\n            _df=(\n                self.get_transcript_consequence_df(gene_index)\n                .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n                .withColumn(\n                    \"isHighQualityPlof\",\n                    f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n                        f.col(\"transcriptConsequence.lof\") == \"LC\", False\n                    ),\n                )\n                .withColumn(\n                    \"score\",\n                    f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n                        ~f.col(\"isHighQualityPlof\"), 0\n                    ),\n                )\n                .select(\n                    \"variantId\",\n                    \"chromosome\",\n                    \"geneId\",\n                    \"isHighQualityPlof\",\n                    f.col(\"score\"),\n                    f.lit(\"vep\").alias(\"datatypeId\"),\n                    f.lit(\"loftee\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n\n    def get_distance_to_tss(\n        self: VariantAnnotation,\n        gene_index: GeneIndex,\n        max_distance: int = 500_000,\n    ) -> V2G:\n        \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n        Args:\n            gene_index (GeneIndex): A gene index to filter by.\n            max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n        Returns:\n            V2G: variant to gene assignments with their distance to the TSS\n        \"\"\"\n        return V2G(\n            _df=(\n                self.df.alias(\"variant\")\n                .join(\n                    f.broadcast(gene_index.locations_lut()).alias(\"gene\"),\n                    on=[\n                        f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n                        f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                        <= max_distance,\n                    ],\n                    how=\"inner\",\n                )\n                .withColumn(\n                    \"distance\", f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                )\n                .withColumn(\n                    \"inverse_distance\",\n                    max_distance - f.col(\"distance\"),\n                )\n                .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n                .select(\n                    \"variantId\",\n                    f.col(\"variant.chromosome\").alias(\"chromosome\"),\n                    \"distance\",\n                    \"geneId\",\n                    \"score\",\n                    f.lit(\"distance\").alias(\"datatypeId\"),\n                    f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n                )\n            ),\n            _schema=V2G.get_schema(),\n        )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.filter_by_variant_df","title":"filter_by_variant_df(df: DataFrame) -> VariantAnnotation","text":"

Filter variant annotation dataset by a variant dataframe.

Parameters:

Name Type Description Default df DataFrame

A dataframe of variants

required

Returns:

Name Type Description VariantAnnotation VariantAnnotation

A filtered variant annotation dataset

Source code in src/otg/dataset/variant_annotation.py
def filter_by_variant_df(\n    self: VariantAnnotation, df: DataFrame\n) -> VariantAnnotation:\n    \"\"\"Filter variant annotation dataset by a variant dataframe.\n\n    Args:\n        df (DataFrame): A dataframe of variants\n\n    Returns:\n        VariantAnnotation: A filtered variant annotation dataset\n    \"\"\"\n    self.df = self._df.join(\n        f.broadcast(df.select(\"variantId\", \"chromosome\")),\n        on=[\"variantId\", \"chromosome\"],\n        how=\"inner\",\n    )\n    return self\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_distance_to_tss","title":"get_distance_to_tss(gene_index: GeneIndex, max_distance: int = 500000) -> V2G","text":"

Extracts variant to gene assignments for variants falling within a window of a gene's TSS.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required max_distance int

The maximum distance from the TSS to consider. Defaults to 500_000.

500000

Returns:

Name Type Description V2G V2G

variant to gene assignments with their distance to the TSS

Source code in src/otg/dataset/variant_annotation.py
def get_distance_to_tss(\n    self: VariantAnnotation,\n    gene_index: GeneIndex,\n    max_distance: int = 500_000,\n) -> V2G:\n    \"\"\"Extracts variant to gene assignments for variants falling within a window of a gene's TSS.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n        max_distance (int): The maximum distance from the TSS to consider. Defaults to 500_000.\n\n    Returns:\n        V2G: variant to gene assignments with their distance to the TSS\n    \"\"\"\n    return V2G(\n        _df=(\n            self.df.alias(\"variant\")\n            .join(\n                f.broadcast(gene_index.locations_lut()).alias(\"gene\"),\n                on=[\n                    f.col(\"variant.chromosome\") == f.col(\"gene.chromosome\"),\n                    f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n                    <= max_distance,\n                ],\n                how=\"inner\",\n            )\n            .withColumn(\n                \"distance\", f.abs(f.col(\"variant.position\") - f.col(\"gene.tss\"))\n            )\n            .withColumn(\n                \"inverse_distance\",\n                max_distance - f.col(\"distance\"),\n            )\n            .transform(lambda df: normalise_column(df, \"inverse_distance\", \"score\"))\n            .select(\n                \"variantId\",\n                f.col(\"variant.chromosome\").alias(\"chromosome\"),\n                \"distance\",\n                \"geneId\",\n                \"score\",\n                f.lit(\"distance\").alias(\"datatypeId\"),\n                f.lit(\"canonical_tss\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_most_severe_vep_v2g","title":"get_most_severe_vep_v2g(vep_consequences: DataFrame, gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default vep_consequences DataFrame

A dataframe of VEP consequences

required gene_index GeneIndex

A gene index to filter by. Defaults to None.

required

Returns:

Name Type Description V2G V2G

High and medium severity variant to gene assignments

Source code in src/otg/dataset/variant_annotation.py
def get_most_severe_vep_v2g(\n    self: VariantAnnotation,\n    vep_consequences: DataFrame,\n    gene_index: GeneIndex,\n) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments based on VEP's predicted consequence of the transcript.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        vep_consequences (DataFrame): A dataframe of VEP consequences\n        gene_index (GeneIndex): A gene index to filter by. Defaults to None.\n\n    Returns:\n        V2G: High and medium severity variant to gene assignments\n    \"\"\"\n    return V2G(\n        _df=self.get_transcript_consequence_df(gene_index)\n        .select(\n            \"variantId\",\n            \"chromosome\",\n            f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n            f.explode(\"transcriptConsequence.consequenceTerms\").alias(\"label\"),\n            f.lit(\"vep\").alias(\"datatypeId\"),\n            f.lit(\"variantConsequence\").alias(\"datasourceId\"),\n        )\n        .join(\n            f.broadcast(vep_consequences),\n            on=\"label\",\n            how=\"inner\",\n        )\n        .drop(\"label\")\n        .filter(f.col(\"score\") != 0)\n        # A variant can have multiple predicted consequences on a transcript, the most severe one is selected\n        .transform(\n            lambda df: get_record_with_maximum_value(\n                df, [\"variantId\", \"geneId\"], \"score\"\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_plof_v2g","title":"get_plof_v2g(gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required

Returns:

Name Type Description V2G V2G

variant to gene assignments from the LOFTEE algorithm

Source code in src/otg/dataset/variant_annotation.py
def get_plof_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n\n    Returns:\n        V2G: variant to gene assignments from the LOFTEE algorithm\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.lof\").isNotNull())\n            .withColumn(\n                \"isHighQualityPlof\",\n                f.when(f.col(\"transcriptConsequence.lof\") == \"HC\", True).when(\n                    f.col(\"transcriptConsequence.lof\") == \"LC\", False\n                ),\n            )\n            .withColumn(\n                \"score\",\n                f.when(f.col(\"isHighQualityPlof\"), 1.0).when(\n                    ~f.col(\"isHighQualityPlof\"), 0\n                ),\n            )\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                \"isHighQualityPlof\",\n                f.col(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"loftee\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_polyphen_v2g","title":"get_polyphen_v2g(gene_index: GeneIndex | None = None) -> V2G","text":"

Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.

Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows: - 0.0 to 0.15 -- Predicted to be benign. - 0.15 to 1.0 -- Possibly damaging. - 0.85 to 1.0 -- Predicted to be damaging.

Parameters:

Name Type Description Default gene_index GeneIndex | None

A gene index to filter by. Defaults to None.

None

Returns:

Name Type Description V2G V2G

variant to gene assignments with their polyphen scores

Source code in src/otg/dataset/variant_annotation.py
def get_polyphen_v2g(\n    self: VariantAnnotation, gene_index: GeneIndex | None = None\n) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a PolyPhen's predicted score on the transcript.\n\n    Polyphen informs about the probability that a substitution is damaging.The score can be interpreted as follows:\n        - 0.0 to 0.15 -- Predicted to be benign.\n        - 0.15 to 1.0 -- Possibly damaging.\n        - 0.85 to 1.0 -- Predicted to be damaging.\n\n    Args:\n        gene_index (GeneIndex | None): A gene index to filter by. Defaults to None.\n\n    Returns:\n        V2G: variant to gene assignments with their polyphen scores\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.polyphenScore\").isNotNull())\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                f.col(\"transcriptConsequence.polyphenScore\").alias(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"polyphen\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the VariantAnnotation dataset.

Returns:

Name Type Description StructType StructType

Schema for the VariantAnnotation dataset

Source code in src/otg/dataset/variant_annotation.py
@classmethod\ndef get_schema(cls: type[VariantAnnotation]) -> StructType:\n    \"\"\"Provides the schema for the VariantAnnotation dataset.\n\n    Returns:\n        StructType: Schema for the VariantAnnotation dataset\n    \"\"\"\n    return parse_spark_schema(\"variant_annotation.json\")\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_sift_v2g","title":"get_sift_v2g(gene_index: GeneIndex) -> V2G","text":"

Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.

SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows: - 0.0 to 0.05 -- Likely to be deleterious. - 0.05 to 1.0 -- Likely to be tolerated.

Parameters:

Name Type Description Default gene_index GeneIndex

A gene index to filter by.

required

Returns:

Name Type Description V2G V2G

variant to gene assignments with their SIFT scores

Source code in src/otg/dataset/variant_annotation.py
def get_sift_v2g(self: VariantAnnotation, gene_index: GeneIndex) -> V2G:\n    \"\"\"Creates a dataset with variant to gene assignments with a SIFT's predicted score on the transcript.\n\n    SIFT informs about the probability that a substitution is tolerated. The score can be interpreted as follows:\n        - 0.0 to 0.05 -- Likely to be deleterious.\n        - 0.05 to 1.0 -- Likely to be tolerated.\n\n    Args:\n        gene_index (GeneIndex): A gene index to filter by.\n\n    Returns:\n        V2G: variant to gene assignments with their SIFT scores\n    \"\"\"\n    return V2G(\n        _df=(\n            self.get_transcript_consequence_df(gene_index)\n            .filter(f.col(\"transcriptConsequence.siftScore\").isNotNull())\n            .select(\n                \"variantId\",\n                \"chromosome\",\n                \"geneId\",\n                f.expr(\"1 - transcriptConsequence.siftScore\").alias(\"score\"),\n                f.lit(\"vep\").alias(\"datatypeId\"),\n                f.lit(\"sift\").alias(\"datasourceId\"),\n            )\n        ),\n        _schema=V2G.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.get_transcript_consequence_df","title":"get_transcript_consequence_df(gene_index: GeneIndex | None = None) -> DataFrame","text":"

Dataframe of exploded transcript consequences.

Optionally the trancript consequences can be reduced to the universe of a gene index.

Parameters:

Name Type Description Default gene_index GeneIndex | None

A gene index. Defaults to None.

None

Returns:

Name Type Description DataFrame DataFrame

A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence

Source code in src/otg/dataset/variant_annotation.py
def get_transcript_consequence_df(\n    self: VariantAnnotation, gene_index: GeneIndex | None = None\n) -> DataFrame:\n    \"\"\"Dataframe of exploded transcript consequences.\n\n    Optionally the trancript consequences can be reduced to the universe of a gene index.\n\n    Args:\n        gene_index (GeneIndex | None): A gene index. Defaults to None.\n\n    Returns:\n        DataFrame: A dataframe exploded by transcript consequences with the columns variantId, chromosome, transcriptConsequence\n    \"\"\"\n    # exploding the array removes records without VEP annotation\n    transript_consequences = self.df.withColumn(\n        \"transcriptConsequence\", f.explode(\"vep.transcriptConsequences\")\n    ).select(\n        \"variantId\",\n        \"chromosome\",\n        \"position\",\n        \"transcriptConsequence\",\n        f.col(\"transcriptConsequence.geneId\").alias(\"geneId\"),\n    )\n    if gene_index:\n        transript_consequences = transript_consequences.join(\n            f.broadcast(gene_index.df),\n            on=[\"chromosome\", \"geneId\"],\n        )\n    return transript_consequences.persist()\n
"},{"location":"python_api/dataset/variant_annotation/#otg.dataset.variant_annotation.VariantAnnotation.max_maf","title":"max_maf() -> Column","text":"

Maximum minor allele frequency accross all populations.

Returns:

Name Type Description Column Column

Maximum minor allele frequency accross all populations.

Source code in src/otg/dataset/variant_annotation.py
def max_maf(self: VariantAnnotation) -> Column:\n    \"\"\"Maximum minor allele frequency accross all populations.\n\n    Returns:\n        Column: Maximum minor allele frequency accross all populations.\n    \"\"\"\n    return f.array_max(\n        f.transform(\n            self.df.alleleFrequencies,\n            lambda af: f.when(\n                af.alleleFrequency > 0.5, 1 - af.alleleFrequency\n            ).otherwise(af.alleleFrequency),\n        )\n    )\n
"},{"location":"python_api/dataset/variant_annotation/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- gnomad3VariantId: string (nullable = false)\n |-- referenceAllele: string (nullable = false)\n |-- alternateAllele: string (nullable = false)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- alleleType: string (nullable = true)\n |-- rsIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n |-- alleleFrequencies: array (nullable = false)\n |    |-- element: struct (containsNull = true)\n |    |    |-- populationName: string (nullable = true)\n |    |    |-- alleleFrequency: double (nullable = true)\n |-- cadd: struct (nullable = true)\n |    |-- phred: float (nullable = true)\n |    |-- raw: float (nullable = true)\n |-- vep: struct (nullable = false)\n |    |-- mostSevereConsequence: string (nullable = true)\n |    |-- transcriptConsequences: array (nullable = true)\n |    |    |-- element: struct (containsNull = true)\n |    |    |    |-- aminoAcids: string (nullable = true)\n |    |    |    |-- consequenceTerms: array (nullable = true)\n |    |    |    |    |-- element: string (containsNull = true)\n |    |    |    |-- geneId: string (nullable = true)\n |    |    |    |-- lof: string (nullable = true)\n |    |    |    |-- polyphenScore: double (nullable = true)\n |    |    |    |-- polyphenPrediction: string (nullable = true)\n |    |    |    |-- siftScore: double (nullable = true)\n |    |    |    |-- siftPrediction: string (nullable = true)\n
"},{"location":"python_api/dataset/variant_index/","title":"Variant index","text":""},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex","title":"otg.dataset.variant_index.VariantIndex dataclass","text":"

Bases: Dataset

Variant index dataset.

Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.

Source code in src/otg/dataset/variant_index.py
@dataclass\nclass VariantIndex(Dataset):\n    \"\"\"Variant index dataset.\n\n    Variant index dataset is the result of intersecting the variant annotation dataset with the variants with V2D available information.\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[VariantIndex]) -> StructType:\n        \"\"\"Provides the schema for the VariantIndex dataset.\n\n        Returns:\n            StructType: Schema for the VariantIndex dataset\n        \"\"\"\n        return parse_spark_schema(\"variant_index.json\")\n\n    @classmethod\n    def from_variant_annotation(\n        cls: type[VariantIndex],\n        variant_annotation: VariantAnnotation,\n        study_locus: StudyLocus,\n    ) -> VariantIndex:\n        \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\n\n        Args:\n            variant_annotation (VariantAnnotation): Variant annotation dataset\n            study_locus (StudyLocus): Study locus dataset with the variants to intersect with the variant annotation dataset\n\n        Returns:\n            VariantIndex: Variant index dataset\n        \"\"\"\n        unchanged_cols = [\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"chromosomeB37\",\n            \"positionB37\",\n            \"alleleType\",\n            \"alleleFrequencies\",\n            \"cadd\",\n        ]\n        va_slimmed = variant_annotation.filter_by_variant_df(\n            study_locus.unique_variants_in_locus()\n        )\n        return cls(\n            _df=(\n                va_slimmed.df.select(\n                    *unchanged_cols,\n                    f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n                    # filters/rsid are arrays that can be empty, in this case we convert them to null\n                    nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n                )\n                .repartition(400, \"chromosome\")\n                .sortWithinPartitions(\"chromosome\", \"position\")\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex.from_variant_annotation","title":"from_variant_annotation(variant_annotation: VariantAnnotation, study_locus: StudyLocus) -> VariantIndex classmethod","text":"

Initialise VariantIndex from pre-existing variant annotation dataset.

Parameters:

Name Type Description Default variant_annotation VariantAnnotation

Variant annotation dataset

required study_locus StudyLocus

Study locus dataset with the variants to intersect with the variant annotation dataset

required

Returns:

Name Type Description VariantIndex VariantIndex

Variant index dataset

Source code in src/otg/dataset/variant_index.py
@classmethod\ndef from_variant_annotation(\n    cls: type[VariantIndex],\n    variant_annotation: VariantAnnotation,\n    study_locus: StudyLocus,\n) -> VariantIndex:\n    \"\"\"Initialise VariantIndex from pre-existing variant annotation dataset.\n\n    Args:\n        variant_annotation (VariantAnnotation): Variant annotation dataset\n        study_locus (StudyLocus): Study locus dataset with the variants to intersect with the variant annotation dataset\n\n    Returns:\n        VariantIndex: Variant index dataset\n    \"\"\"\n    unchanged_cols = [\n        \"variantId\",\n        \"chromosome\",\n        \"position\",\n        \"referenceAllele\",\n        \"alternateAllele\",\n        \"chromosomeB37\",\n        \"positionB37\",\n        \"alleleType\",\n        \"alleleFrequencies\",\n        \"cadd\",\n    ]\n    va_slimmed = variant_annotation.filter_by_variant_df(\n        study_locus.unique_variants_in_locus()\n    )\n    return cls(\n        _df=(\n            va_slimmed.df.select(\n                *unchanged_cols,\n                f.col(\"vep.mostSevereConsequence\").alias(\"mostSevereConsequence\"),\n                # filters/rsid are arrays that can be empty, in this case we convert them to null\n                nullify_empty_array(f.col(\"rsIds\")).alias(\"rsIds\"),\n            )\n            .repartition(400, \"chromosome\")\n            .sortWithinPartitions(\"chromosome\", \"position\")\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/dataset/variant_index/#otg.dataset.variant_index.VariantIndex.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the VariantIndex dataset.

Returns:

Name Type Description StructType StructType

Schema for the VariantIndex dataset

Source code in src/otg/dataset/variant_index.py
@classmethod\ndef get_schema(cls: type[VariantIndex]) -> StructType:\n    \"\"\"Provides the schema for the VariantIndex dataset.\n\n    Returns:\n        StructType: Schema for the VariantIndex dataset\n    \"\"\"\n    return parse_spark_schema(\"variant_index.json\")\n
"},{"location":"python_api/dataset/variant_index/#schema","title":"Schema","text":"
root\n |-- variantId: string (nullable = false)\n |-- chromosome: string (nullable = false)\n |-- position: integer (nullable = false)\n |-- referenceAllele: string (nullable = false)\n |-- alternateAllele: string (nullable = false)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- alleleType: string (nullable = false)\n |-- alleleFrequencies: array (nullable = false)\n |    |-- element: struct (containsNull = true)\n |    |    |-- populationName: string (nullable = true)\n |    |    |-- alleleFrequency: double (nullable = true)\n |-- cadd: struct (nullable = true)\n |    |-- phred: float (nullable = true)\n |    |-- raw: float (nullable = true)\n |-- mostSevereConsequence: string (nullable = true)\n |-- rsIds: array (nullable = true)\n |    |-- element: string (containsNull = true)\n
"},{"location":"python_api/dataset/variant_to_gene/","title":"Variant-to-gene","text":""},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G","title":"otg.dataset.v2g.V2G dataclass","text":"

Bases: Dataset

Variant-to-gene (V2G) evidence dataset.

A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific biofeatures (e.g. cell types)

Source code in src/otg/dataset/v2g.py
@dataclass\nclass V2G(Dataset):\n    \"\"\"Variant-to-gene (V2G) evidence dataset.\n\n    A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a likely causal gene. The evidence can sometimes be context-specific and refer to specific `biofeatures` (e.g. cell types)\n    \"\"\"\n\n    @classmethod\n    def get_schema(cls: type[V2G]) -> StructType:\n        \"\"\"Provides the schema for the V2G dataset.\n\n        Returns:\n            StructType: Schema for the V2G dataset\n        \"\"\"\n        return parse_spark_schema(\"v2g.json\")\n\n    def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n        \"\"\"Filter V2G dataset by genes.\n\n        Args:\n            genes (GeneIndex): Gene index dataset to filter by\n\n        Returns:\n            V2G: V2G dataset filtered by genes\n        \"\"\"\n        self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n        return self\n\n    def extract_distance_tss_minimum(self: V2G) -> None:\n        \"\"\"Extract minimum distance to TSS.\"\"\"\n        self.df = self._df.filter(f.col(\"distance\")).withColumn(\n            \"distanceTssMinimum\",\n            f.expr(\"min(distTss) OVER (PARTITION BY studyLocusId)\"),\n        )\n
"},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G.extract_distance_tss_minimum","title":"extract_distance_tss_minimum() -> None","text":"

Extract minimum distance to TSS.

Source code in src/otg/dataset/v2g.py
def extract_distance_tss_minimum(self: V2G) -> None:\n    \"\"\"Extract minimum distance to TSS.\"\"\"\n    self.df = self._df.filter(f.col(\"distance\")).withColumn(\n        \"distanceTssMinimum\",\n        f.expr(\"min(distTss) OVER (PARTITION BY studyLocusId)\"),\n    )\n
"},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G.filter_by_genes","title":"filter_by_genes(genes: GeneIndex) -> V2G","text":"

Filter V2G dataset by genes.

Parameters:

Name Type Description Default genes GeneIndex

Gene index dataset to filter by

required

Returns:

Name Type Description V2G V2G

V2G dataset filtered by genes

Source code in src/otg/dataset/v2g.py
def filter_by_genes(self: V2G, genes: GeneIndex) -> V2G:\n    \"\"\"Filter V2G dataset by genes.\n\n    Args:\n        genes (GeneIndex): Gene index dataset to filter by\n\n    Returns:\n        V2G: V2G dataset filtered by genes\n    \"\"\"\n    self.df = self._df.join(genes.df.select(\"geneId\"), on=\"geneId\", how=\"inner\")\n    return self\n
"},{"location":"python_api/dataset/variant_to_gene/#otg.dataset.v2g.V2G.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the V2G dataset.

Returns:

Name Type Description StructType StructType

Schema for the V2G dataset

Source code in src/otg/dataset/v2g.py
@classmethod\ndef get_schema(cls: type[V2G]) -> StructType:\n    \"\"\"Provides the schema for the V2G dataset.\n\n    Returns:\n        StructType: Schema for the V2G dataset\n    \"\"\"\n    return parse_spark_schema(\"v2g.json\")\n
"},{"location":"python_api/dataset/variant_to_gene/#schema","title":"Schema","text":"
root\n |-- geneId: string (nullable = false)\n |-- variantId: string (nullable = false)\n |-- distance: long (nullable = true)\n |-- chromosome: string (nullable = false)\n |-- datatypeId: string (nullable = false)\n |-- datasourceId: string (nullable = false)\n |-- score: double (nullable = true)\n |-- resourceScore: double (nullable = true)\n |-- pmid: string (nullable = true)\n |-- biofeature: string (nullable = true)\n |-- variantFunctionalConsequenceId: string (nullable = true)\n |-- isHighQualityPlof: boolean (nullable = true)\n
"},{"location":"python_api/datasource/_datasource/","title":"Data Source","text":"

TBC

"},{"location":"python_api/datasource/finngen/_finngen/","title":"FinnGen","text":""},{"location":"python_api/datasource/finngen/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/finngen/study_index/#otg.datasource.finngen.study_index.FinnGenStudyIndex","title":"otg.datasource.finngen.study_index.FinnGenStudyIndex","text":"

Bases: StudyIndex

Study index dataset from FinnGen.

The following information is aggregated/extracted:

  • Study ID in the special format (FINNGEN_R9_*)
  • Trait name (for example, Amoebiasis)
  • Number of cases and controls
  • Link to the summary statistics location

Some fields are also populated as constants, such as study type and the initial sample size.

Source code in src/otg/datasource/finngen/study_index.py
class FinnGenStudyIndex(StudyIndex):\n    \"\"\"Study index dataset from FinnGen.\n\n    The following information is aggregated/extracted:\n\n    - Study ID in the special format (FINNGEN_R9_*)\n    - Trait name (for example, Amoebiasis)\n    - Number of cases and controls\n    - Link to the summary statistics location\n\n    Some fields are also populated as constants, such as study type and the initial sample size.\n    \"\"\"\n\n    @classmethod\n    def from_source(\n        cls: type[FinnGenStudyIndex],\n        finngen_studies: DataFrame,\n        finngen_release_prefix: str,\n        finngen_summary_stats_url_prefix: str,\n        finngen_summary_stats_url_suffix: str,\n    ) -> FinnGenStudyIndex:\n        \"\"\"This function ingests study level metadata from FinnGen.\n\n        Args:\n            finngen_studies (DataFrame): FinnGen raw study table\n            finngen_release_prefix (str): Release prefix pattern.\n            finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n            finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n        Returns:\n            FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n        \"\"\"\n        return FinnGenStudyIndex(\n            _df=finngen_studies.select(\n                f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n                    \"studyId\"\n                ),\n                f.col(\"phenostring\").alias(\"traitFromSource\"),\n                f.col(\"num_cases\").alias(\"nCases\"),\n                f.col(\"num_controls\").alias(\"nControls\"),\n                (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n                f.lit(finngen_release_prefix).alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.lit(True).alias(\"hasSumstats\"),\n                f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n                    \"initialSampleSize\"\n                ),\n                f.array(\n                    f.struct(\n                        f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n                        f.lit(\"Finnish\").alias(\"ancestry\"),\n                    )\n                ).alias(\"discoverySamples\"),\n                f.concat(\n                    f.lit(finngen_summary_stats_url_prefix),\n                    f.col(\"phenocode\"),\n                    f.lit(finngen_summary_stats_url_suffix),\n                ).alias(\"summarystatsLocation\"),\n            ).withColumn(\n                \"ldPopulationStructure\",\n                cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/finngen/study_index/#otg.datasource.finngen.study_index.FinnGenStudyIndex.from_source","title":"from_source(finngen_studies: DataFrame, finngen_release_prefix: str, finngen_summary_stats_url_prefix: str, finngen_summary_stats_url_suffix: str) -> FinnGenStudyIndex classmethod","text":"

This function ingests study level metadata from FinnGen.

Parameters:

Name Type Description Default finngen_studies DataFrame

FinnGen raw study table

required finngen_release_prefix str

Release prefix pattern.

required finngen_summary_stats_url_prefix str

URL prefix for summary statistics location.

required finngen_summary_stats_url_suffix str

URL prefix suffix for summary statistics location.

required

Returns:

Name Type Description FinnGenStudyIndex FinnGenStudyIndex

Parsed and annotated FinnGen study table.

Source code in src/otg/datasource/finngen/study_index.py
@classmethod\ndef from_source(\n    cls: type[FinnGenStudyIndex],\n    finngen_studies: DataFrame,\n    finngen_release_prefix: str,\n    finngen_summary_stats_url_prefix: str,\n    finngen_summary_stats_url_suffix: str,\n) -> FinnGenStudyIndex:\n    \"\"\"This function ingests study level metadata from FinnGen.\n\n    Args:\n        finngen_studies (DataFrame): FinnGen raw study table\n        finngen_release_prefix (str): Release prefix pattern.\n        finngen_summary_stats_url_prefix (str): URL prefix for summary statistics location.\n        finngen_summary_stats_url_suffix (str): URL prefix suffix for summary statistics location.\n\n    Returns:\n        FinnGenStudyIndex: Parsed and annotated FinnGen study table.\n    \"\"\"\n    return FinnGenStudyIndex(\n        _df=finngen_studies.select(\n            f.concat(f.lit(f\"{finngen_release_prefix}_\"), f.col(\"phenocode\")).alias(\n                \"studyId\"\n            ),\n            f.col(\"phenostring\").alias(\"traitFromSource\"),\n            f.col(\"num_cases\").alias(\"nCases\"),\n            f.col(\"num_controls\").alias(\"nControls\"),\n            (f.col(\"num_cases\") + f.col(\"num_controls\")).alias(\"nSamples\"),\n            f.lit(finngen_release_prefix).alias(\"projectId\"),\n            f.lit(\"gwas\").alias(\"studyType\"),\n            f.lit(True).alias(\"hasSumstats\"),\n            f.lit(\"377,277 (210,870 females and 166,407 males)\").alias(\n                \"initialSampleSize\"\n            ),\n            f.array(\n                f.struct(\n                    f.lit(377277).cast(\"long\").alias(\"sampleSize\"),\n                    f.lit(\"Finnish\").alias(\"ancestry\"),\n                )\n            ).alias(\"discoverySamples\"),\n            f.concat(\n                f.lit(finngen_summary_stats_url_prefix),\n                f.col(\"phenocode\"),\n                f.lit(finngen_summary_stats_url_suffix),\n            ).alias(\"summarystatsLocation\"),\n        ).withColumn(\n            \"ldPopulationStructure\",\n            cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gnomad/_gnomad/","title":"GnomAD","text":""},{"location":"python_api/datasource/gnomad/gnomad_ld/","title":"LD Matrix","text":""},{"location":"python_api/datasource/gnomad/gnomad_ld/#otg.datasource.gnomad.ld.GnomADLDMatrix","title":"otg.datasource.gnomad.ld.GnomADLDMatrix","text":"

Importer of LD information from GnomAD.

The information comes from LD matrices made available by GnomAD in Hail's native format. We aggregate the LD information across 8 ancestries. The basic steps to generate the LDIndex are:

  1. Convert a LD matrix to a Spark DataFrame.
  2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.
  3. Aggregate the LD information across populations.
Source code in src/otg/datasource/gnomad/ld.py
class GnomADLDMatrix:\n    \"\"\"Importer of LD information from GnomAD.\n\n    The information comes from LD matrices [made available by GnomAD](https://gnomad.broadinstitute.org/downloads/#v2-linkage-disequilibrium) in Hail's native format. We aggregate the LD information across 8 ancestries.\n    The basic steps to generate the LDIndex are:\n\n    1. Convert a LD matrix to a Spark DataFrame.\n    2. Resolve the matrix indices to variant IDs by lifting over the coordinates to GRCh38.\n    3. Aggregate the LD information across populations.\n\n    \"\"\"\n\n    @staticmethod\n    def _aggregate_ld_index_across_populations(\n        unaggregated_ld_index: DataFrame,\n    ) -> DataFrame:\n        \"\"\"Aggregate LDIndex across populations.\n\n        Args:\n            unaggregated_ld_index (DataFrame): Unaggregate LDIndex index dataframe  each row is a variant pair in a population\n\n        Returns:\n            DataFrame: Aggregated LDIndex index dataframe  each row is a variant with the LD set across populations\n\n        Examples:\n            >>> data = [(\"1.0\", \"var1\", \"X\", \"var1\", \"pop1\"), (\"1.0\", \"X\", \"var2\", \"var2\", \"pop1\"),\n            ...         (\"0.5\", \"var1\", \"X\", \"var2\", \"pop1\"), (\"0.5\", \"var1\", \"X\", \"var2\", \"pop2\"),\n            ...         (\"0.5\", \"var2\", \"X\", \"var1\", \"pop1\"), (\"0.5\", \"X\", \"var2\", \"var1\", \"pop2\")]\n            >>> df = spark.createDataFrame(data, [\"r\", \"variantId\", \"chromosome\", \"tagvariantId\", \"population\"])\n            >>> GnomADLDMatrix._aggregate_ld_index_across_populations(df).printSchema()\n            root\n             |-- variantId: string (nullable = true)\n             |-- chromosome: string (nullable = true)\n             |-- ldSet: array (nullable = false)\n             |    |-- element: struct (containsNull = false)\n             |    |    |-- tagVariantId: string (nullable = true)\n             |    |    |-- rValues: array (nullable = false)\n             |    |    |    |-- element: struct (containsNull = false)\n             |    |    |    |    |-- population: string (nullable = true)\n             |    |    |    |    |-- r: string (nullable = true)\n            <BLANKLINE>\n        \"\"\"\n        return (\n            unaggregated_ld_index\n            # First level of aggregation: get r/population for each variant/tagVariant pair\n            .withColumn(\"r_pop_struct\", f.struct(\"population\", \"r\"))\n            .groupBy(\"chromosome\", \"variantId\", \"tagVariantId\")\n            .agg(\n                f.collect_set(\"r_pop_struct\").alias(\"rValues\"),\n            )\n            # Second level of aggregation: get r/population for each variant\n            .withColumn(\"r_pop_tag_struct\", f.struct(\"tagVariantId\", \"rValues\"))\n            .groupBy(\"variantId\", \"chromosome\")\n            .agg(\n                f.collect_set(\"r_pop_tag_struct\").alias(\"ldSet\"),\n            )\n        )\n\n    @staticmethod\n    def _convert_ld_matrix_to_table(\n        block_matrix: BlockMatrix, min_r2: float\n    ) -> DataFrame:\n        \"\"\"Convert LD matrix to table.\n\n        Args:\n            block_matrix (BlockMatrix): LD matrix\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            DataFrame: LD matrix as a Spark DataFrame\n        \"\"\"\n        table = block_matrix.entries(keyed=False)\n        return (\n            table.filter(hl.abs(table.entry) >= min_r2**0.5)\n            .to_spark()\n            .withColumnRenamed(\"entry\", \"r\")\n        )\n\n    @staticmethod\n    def _create_ldindex_for_population(\n        population_id: str,\n        ld_matrix_path: str,\n        ld_index_raw_path: str,\n        grch37_to_grch38_chain_path: str,\n        min_r2: float,\n    ) -> DataFrame:\n        \"\"\"Create LDIndex for a specific population.\n\n        Args:\n            population_id (str): Population ID\n            ld_matrix_path (str): Path to the LD matrix\n            ld_index_raw_path (str): Path to the LD index\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            DataFrame: LDIndex for a specific population\n        \"\"\"\n        # Prepare LD Block matrix\n        ld_matrix = GnomADLDMatrix._convert_ld_matrix_to_table(\n            BlockMatrix.read(ld_matrix_path), min_r2\n        )\n\n        # Prepare table with variant indices\n        ld_index = GnomADLDMatrix._process_variant_indices(\n            hl.read_table(ld_index_raw_path),\n            grch37_to_grch38_chain_path,\n        )\n\n        return GnomADLDMatrix._resolve_variant_indices(ld_index, ld_matrix).select(\n            \"*\",\n            f.lit(population_id).alias(\"population\"),\n        )\n\n    @staticmethod\n    def _process_variant_indices(\n        ld_index_raw: hl.Table, grch37_to_grch38_chain_path: str\n    ) -> DataFrame:\n        \"\"\"Creates a look up table between variants and their coordinates in the LD Matrix.\n\n        !!! info \"Gnomad's LD Matrix and Index are based on GRCh37 coordinates. This function will lift over the coordinates to GRCh38 to build the lookup table.\"\n\n        Args:\n            ld_index_raw (hl.Table): LD index table from GnomAD\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n\n        Returns:\n            DataFrame: Look up table between variants in build hg38 and their coordinates in the LD Matrix\n        \"\"\"\n        ld_index_38 = _liftover_loci(\n            ld_index_raw, grch37_to_grch38_chain_path, \"GRCh38\"\n        )\n\n        return (\n            ld_index_38.to_spark()\n            # Filter out variants where the liftover failed\n            .filter(f.col(\"`locus_GRCh38.position`\").isNotNull())\n            .withColumn(\n                \"chromosome\", f.regexp_replace(\"`locus_GRCh38.contig`\", \"chr\", \"\")\n            )\n            .withColumn(\n                \"position\",\n                convert_gnomad_position_to_ensembl(\n                    f.col(\"`locus_GRCh38.position`\"),\n                    f.col(\"`alleles`\").getItem(0),\n                    f.col(\"`alleles`\").getItem(1),\n                ),\n            )\n            .select(\n                \"chromosome\",\n                f.concat_ws(\n                    \"_\",\n                    f.col(\"chromosome\"),\n                    f.col(\"position\"),\n                    f.col(\"`alleles`\").getItem(0),\n                    f.col(\"`alleles`\").getItem(1),\n                ).alias(\"variantId\"),\n                f.col(\"idx\"),\n            )\n            # Filter out ambiguous liftover results: multiple indices for the same variant\n            .withColumn(\"count\", f.count(\"*\").over(Window.partitionBy([\"variantId\"])))\n            .filter(f.col(\"count\") == 1)\n            .drop(\"count\")\n        )\n\n    @staticmethod\n    def _resolve_variant_indices(\n        ld_index: DataFrame, ld_matrix: DataFrame\n    ) -> DataFrame:\n        \"\"\"Resolve the `i` and `j` indices of the block matrix to variant IDs (build 38).\n\n        Args:\n            ld_index (DataFrame): Dataframe with resolved variant indices\n            ld_matrix (DataFrame): Dataframe with the filtered LD matrix\n\n        Returns:\n            DataFrame: Dataframe with variant IDs instead of `i` and `j` indices\n        \"\"\"\n        ld_index_i = ld_index.selectExpr(\n            \"idx as i\", \"variantId as variantId_i\", \"chromosome\"\n        )\n        ld_index_j = ld_index.selectExpr(\"idx as j\", \"variantId as variantId_j\")\n        return (\n            ld_matrix.join(ld_index_i, on=\"i\", how=\"inner\")\n            .join(ld_index_j, on=\"j\", how=\"inner\")\n            .drop(\"i\", \"j\")\n        )\n\n    @staticmethod\n    def _transpose_ld_matrix(ld_matrix: DataFrame) -> DataFrame:\n        \"\"\"Transpose LD matrix to a square matrix format.\n\n        Args:\n            ld_matrix (DataFrame): Triangular LD matrix converted to a Spark DataFrame\n\n        Returns:\n            DataFrame: Square LD matrix without diagonal duplicates\n\n        Examples:\n            >>> df = spark.createDataFrame(\n            ...     [\n            ...         (1, 1, 1.0, \"1\", \"AFR\"),\n            ...         (1, 2, 0.5, \"1\", \"AFR\"),\n            ...         (2, 2, 1.0, \"1\", \"AFR\"),\n            ...     ],\n            ...     [\"variantId_i\", \"variantId_j\", \"r\", \"chromosome\", \"population\"],\n            ... )\n            >>> GnomADLDMatrix._transpose_ld_matrix(df).show()\n            +-----------+-----------+---+----------+----------+\n            |variantId_i|variantId_j|  r|chromosome|population|\n            +-----------+-----------+---+----------+----------+\n            |          1|          2|0.5|         1|       AFR|\n            |          1|          1|1.0|         1|       AFR|\n            |          2|          1|0.5|         1|       AFR|\n            |          2|          2|1.0|         1|       AFR|\n            +-----------+-----------+---+----------+----------+\n            <BLANKLINE>\n        \"\"\"\n        ld_matrix_transposed = ld_matrix.selectExpr(\n            \"variantId_i as variantId_j\",\n            \"variantId_j as variantId_i\",\n            \"r\",\n            \"chromosome\",\n            \"population\",\n        )\n        return ld_matrix.filter(\n            f.col(\"variantId_i\") != f.col(\"variantId_j\")\n        ).unionByName(ld_matrix_transposed)\n\n    @classmethod\n    def as_ld_index(\n        cls: type[GnomADLDMatrix],\n        ld_populations: list[str],\n        ld_matrix_template: str,\n        ld_index_raw_template: str,\n        grch37_to_grch38_chain_path: str,\n        min_r2: float,\n    ) -> LDIndex:\n        \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\n\n        Args:\n            ld_populations (list[str]): List of populations to aggregate\n            ld_matrix_template (str): Template path to the LD matrix\n            ld_index_raw_template (str): Template path to the LD variants index\n            grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n            min_r2 (float): Minimum r2 value to keep in the table\n\n        Returns:\n            LDIndex: LDIndex dataset\n        \"\"\"\n        ld_indices_unaggregated = []\n        for pop in ld_populations:\n            try:\n                ld_matrix_path = ld_matrix_template.format(POP=pop)\n                ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n                pop_ld_index = cls._create_ldindex_for_population(\n                    pop,\n                    ld_matrix_path,\n                    ld_index_raw_path.format(pop),\n                    grch37_to_grch38_chain_path,\n                    min_r2,\n                )\n                ld_indices_unaggregated.append(pop_ld_index)\n            except Exception as e:\n                print(f\"Failed to create LDIndex for population {pop}: {e}\")\n                sys.exit(1)\n\n        ld_index_unaggregated = (\n            GnomADLDMatrix._transpose_ld_matrix(\n                reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n            )\n            .withColumnRenamed(\"variantId_i\", \"variantId\")\n            .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n        )\n        return LDIndex(\n            _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n            _schema=LDIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gnomad/gnomad_ld/#otg.datasource.gnomad.ld.GnomADLDMatrix.as_ld_index","title":"as_ld_index(ld_populations: list[str], ld_matrix_template: str, ld_index_raw_template: str, grch37_to_grch38_chain_path: str, min_r2: float) -> LDIndex classmethod","text":"

Create LDIndex dataset aggregating the LD information across a set of populations.

Parameters:

Name Type Description Default ld_populations list[str]

List of populations to aggregate

required ld_matrix_template str

Template path to the LD matrix

required ld_index_raw_template str

Template path to the LD variants index

required grch37_to_grch38_chain_path str

Path to the chain file used to lift over the coordinates

required min_r2 float

Minimum r2 value to keep in the table

required

Returns:

Name Type Description LDIndex LDIndex

LDIndex dataset

Source code in src/otg/datasource/gnomad/ld.py
@classmethod\ndef as_ld_index(\n    cls: type[GnomADLDMatrix],\n    ld_populations: list[str],\n    ld_matrix_template: str,\n    ld_index_raw_template: str,\n    grch37_to_grch38_chain_path: str,\n    min_r2: float,\n) -> LDIndex:\n    \"\"\"Create LDIndex dataset aggregating the LD information across a set of populations.\n\n    Args:\n        ld_populations (list[str]): List of populations to aggregate\n        ld_matrix_template (str): Template path to the LD matrix\n        ld_index_raw_template (str): Template path to the LD variants index\n        grch37_to_grch38_chain_path (str): Path to the chain file used to lift over the coordinates\n        min_r2 (float): Minimum r2 value to keep in the table\n\n    Returns:\n        LDIndex: LDIndex dataset\n    \"\"\"\n    ld_indices_unaggregated = []\n    for pop in ld_populations:\n        try:\n            ld_matrix_path = ld_matrix_template.format(POP=pop)\n            ld_index_raw_path = ld_index_raw_template.format(POP=pop)\n            pop_ld_index = cls._create_ldindex_for_population(\n                pop,\n                ld_matrix_path,\n                ld_index_raw_path.format(pop),\n                grch37_to_grch38_chain_path,\n                min_r2,\n            )\n            ld_indices_unaggregated.append(pop_ld_index)\n        except Exception as e:\n            print(f\"Failed to create LDIndex for population {pop}: {e}\")\n            sys.exit(1)\n\n    ld_index_unaggregated = (\n        GnomADLDMatrix._transpose_ld_matrix(\n            reduce(lambda df1, df2: df1.unionByName(df2), ld_indices_unaggregated)\n        )\n        .withColumnRenamed(\"variantId_i\", \"variantId\")\n        .withColumnRenamed(\"variantId_j\", \"tagVariantId\")\n    )\n    return LDIndex(\n        _df=cls._aggregate_ld_index_across_populations(ld_index_unaggregated),\n        _schema=LDIndex.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gnomad/gnomad_variants/","title":"Variants","text":""},{"location":"python_api/datasource/gnomad/gnomad_variants/#otg.datasource.gnomad.variants.GnomADVariants","title":"otg.datasource.gnomad.variants.GnomADVariants","text":"

GnomAD variants included in the GnomAD genomes dataset.

Source code in src/otg/datasource/gnomad/variants.py
class GnomADVariants:\n    \"\"\"GnomAD variants included in the GnomAD genomes dataset.\"\"\"\n\n    @staticmethod\n    def _convert_gnomad_position_to_ensembl_hail(\n        position: Int32Expression,\n        reference: StringExpression,\n        alternate: StringExpression,\n    ) -> Int32Expression:\n        \"\"\"Convert GnomAD variant position to Ensembl variant position in hail table.\n\n        For indels (the reference or alternate allele is longer than 1), then adding 1 to the position, for SNPs, the position is unchanged.\n        More info about the problem: https://www.biostars.org/p/84686/\n\n        Args:\n            position (Int32Expression): Position of the variant in the GnomAD genome.\n            reference (StringExpression): The reference allele.\n            alternate (StringExpression): The alternate allele\n\n        Returns:\n            Int32Expression: The position of the variant according to Ensembl genome.\n        \"\"\"\n        return hl.if_else(\n            (reference.length() > 1) | (alternate.length() > 1), position + 1, position\n        )\n\n    @classmethod\n    def as_variant_annotation(\n        cls: type[GnomADVariants],\n        gnomad_file: str,\n        grch38_to_grch37_chain: str,\n        populations: list,\n    ) -> VariantAnnotation:\n        \"\"\"Generate variant annotation dataset from gnomAD.\n\n        Some relevant modifications to the original dataset are:\n\n        1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n        2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n        3. Field names are converted to camel case to follow the convention.\n\n        Args:\n            gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n            grch38_to_grch37_chain (str): Path to chain file for liftover\n            populations (list): List of populations to include in the dataset\n\n        Returns:\n            VariantAnnotation: Variant annotation dataset\n        \"\"\"\n        # Load variants dataset\n        ht = hl.read_table(\n            gnomad_file,\n            _load_refs=False,\n        )\n\n        # Liftover\n        grch37 = hl.get_reference(\"GRCh37\")\n        grch38 = hl.get_reference(\"GRCh38\")\n        grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n        # Drop non biallelic variants\n        ht = ht.filter(ht.alleles.length() == 2)\n        # Liftover\n        ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n        # Select relevant fields and nested records to create class\n        return VariantAnnotation(\n            _df=(\n                ht.select(\n                    gnomad3VariantId=hl.str(\"-\").join(\n                        [\n                            ht.locus.contig.replace(\"chr\", \"\"),\n                            hl.str(ht.locus.position),\n                            ht.alleles[0],\n                            ht.alleles[1],\n                        ]\n                    ),\n                    chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n                    position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                        ht.locus.position, ht.alleles[0], ht.alleles[1]\n                    ),\n                    variantId=hl.str(\"_\").join(\n                        [\n                            ht.locus.contig.replace(\"chr\", \"\"),\n                            hl.str(\n                                GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                                    ht.locus.position, ht.alleles[0], ht.alleles[1]\n                                )\n                            ),\n                            ht.alleles[0],\n                            ht.alleles[1],\n                        ]\n                    ),\n                    chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n                    positionB37=ht.locus_GRCh37.position,\n                    referenceAllele=ht.alleles[0],\n                    alternateAllele=ht.alleles[1],\n                    rsIds=ht.rsid,\n                    alleleType=ht.allele_info.allele_type,\n                    cadd=hl.struct(\n                        phred=ht.cadd.phred,\n                        raw=ht.cadd.raw_score,\n                    ),\n                    alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n                        lambda p: hl.struct(\n                            populationName=p,\n                            alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n                        )\n                    ),\n                    vep=hl.struct(\n                        mostSevereConsequence=ht.vep.most_severe_consequence,\n                        transcriptConsequences=hl.map(\n                            lambda x: hl.struct(\n                                aminoAcids=x.amino_acids,\n                                consequenceTerms=x.consequence_terms,\n                                geneId=x.gene_id,\n                                lof=x.lof,\n                                polyphenScore=x.polyphen_score,\n                                polyphenPrediction=x.polyphen_prediction,\n                                siftScore=x.sift_score,\n                                siftPrediction=x.sift_prediction,\n                            ),\n                            # Only keeping canonical transcripts\n                            ht.vep.transcript_consequences.filter(\n                                lambda x: (x.canonical == 1)\n                                & (x.gene_symbol_source == \"HGNC\")\n                            ),\n                        ),\n                    ),\n                )\n                .key_by(\"chromosome\", \"position\")\n                .drop(\"locus\", \"alleles\")\n                .select_globals()\n                .to_spark(flatten=False)\n            ),\n            _schema=VariantAnnotation.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gnomad/gnomad_variants/#otg.datasource.gnomad.variants.GnomADVariants.as_variant_annotation","title":"as_variant_annotation(gnomad_file: str, grch38_to_grch37_chain: str, populations: list) -> VariantAnnotation classmethod","text":"

Generate variant annotation dataset from gnomAD.

Some relevant modifications to the original dataset are:

  1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.
  2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.
  3. Field names are converted to camel case to follow the convention.

Parameters:

Name Type Description Default gnomad_file str

Path to gnomad.genomes.vX.X.X.sites.ht gnomAD dataset

required grch38_to_grch37_chain str

Path to chain file for liftover

required populations list

List of populations to include in the dataset

required

Returns:

Name Type Description VariantAnnotation VariantAnnotation

Variant annotation dataset

Source code in src/otg/datasource/gnomad/variants.py
@classmethod\ndef as_variant_annotation(\n    cls: type[GnomADVariants],\n    gnomad_file: str,\n    grch38_to_grch37_chain: str,\n    populations: list,\n) -> VariantAnnotation:\n    \"\"\"Generate variant annotation dataset from gnomAD.\n\n    Some relevant modifications to the original dataset are:\n\n    1. The transcript consequences features provided by VEP are filtered to only refer to the Ensembl canonical transcript.\n    2. Genome coordinates are liftovered from GRCh38 to GRCh37 to keep as annotation.\n    3. Field names are converted to camel case to follow the convention.\n\n    Args:\n        gnomad_file (str): Path to `gnomad.genomes.vX.X.X.sites.ht` gnomAD dataset\n        grch38_to_grch37_chain (str): Path to chain file for liftover\n        populations (list): List of populations to include in the dataset\n\n    Returns:\n        VariantAnnotation: Variant annotation dataset\n    \"\"\"\n    # Load variants dataset\n    ht = hl.read_table(\n        gnomad_file,\n        _load_refs=False,\n    )\n\n    # Liftover\n    grch37 = hl.get_reference(\"GRCh37\")\n    grch38 = hl.get_reference(\"GRCh38\")\n    grch38.add_liftover(grch38_to_grch37_chain, grch37)\n\n    # Drop non biallelic variants\n    ht = ht.filter(ht.alleles.length() == 2)\n    # Liftover\n    ht = ht.annotate(locus_GRCh37=hl.liftover(ht.locus, \"GRCh37\"))\n    # Select relevant fields and nested records to create class\n    return VariantAnnotation(\n        _df=(\n            ht.select(\n                gnomad3VariantId=hl.str(\"-\").join(\n                    [\n                        ht.locus.contig.replace(\"chr\", \"\"),\n                        hl.str(ht.locus.position),\n                        ht.alleles[0],\n                        ht.alleles[1],\n                    ]\n                ),\n                chromosome=ht.locus.contig.replace(\"chr\", \"\"),\n                position=GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                    ht.locus.position, ht.alleles[0], ht.alleles[1]\n                ),\n                variantId=hl.str(\"_\").join(\n                    [\n                        ht.locus.contig.replace(\"chr\", \"\"),\n                        hl.str(\n                            GnomADVariants._convert_gnomad_position_to_ensembl_hail(\n                                ht.locus.position, ht.alleles[0], ht.alleles[1]\n                            )\n                        ),\n                        ht.alleles[0],\n                        ht.alleles[1],\n                    ]\n                ),\n                chromosomeB37=ht.locus_GRCh37.contig.replace(\"chr\", \"\"),\n                positionB37=ht.locus_GRCh37.position,\n                referenceAllele=ht.alleles[0],\n                alternateAllele=ht.alleles[1],\n                rsIds=ht.rsid,\n                alleleType=ht.allele_info.allele_type,\n                cadd=hl.struct(\n                    phred=ht.cadd.phred,\n                    raw=ht.cadd.raw_score,\n                ),\n                alleleFrequencies=hl.set([f\"{pop}-adj\" for pop in populations]).map(\n                    lambda p: hl.struct(\n                        populationName=p,\n                        alleleFrequency=ht.freq[ht.globals.freq_index_dict[p]].AF,\n                    )\n                ),\n                vep=hl.struct(\n                    mostSevereConsequence=ht.vep.most_severe_consequence,\n                    transcriptConsequences=hl.map(\n                        lambda x: hl.struct(\n                            aminoAcids=x.amino_acids,\n                            consequenceTerms=x.consequence_terms,\n                            geneId=x.gene_id,\n                            lof=x.lof,\n                            polyphenScore=x.polyphen_score,\n                            polyphenPrediction=x.polyphen_prediction,\n                            siftScore=x.sift_score,\n                            siftPrediction=x.sift_prediction,\n                        ),\n                        # Only keeping canonical transcripts\n                        ht.vep.transcript_consequences.filter(\n                            lambda x: (x.canonical == 1)\n                            & (x.gene_symbol_source == \"HGNC\")\n                        ),\n                    ),\n                ),\n            )\n            .key_by(\"chromosome\", \"position\")\n            .drop(\"locus\", \"alleles\")\n            .select_globals()\n            .to_spark(flatten=False)\n        ),\n        _schema=VariantAnnotation.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/_gwas_catalog/","title":"GWAS Catalog","text":"GWAS Catalog"},{"location":"python_api/datasource/gwas_catalog/associations/","title":"Associations","text":""},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations","title":"otg.datasource.gwas_catalog.associations.GWASCatalogAssociations dataclass","text":"

Bases: StudyLocus

Study-locus dataset derived from GWAS Catalog.

Source code in src/otg/datasource/gwas_catalog/associations.py
@dataclass\nclass GWASCatalogAssociations(StudyLocus):\n    \"\"\"Study-locus dataset derived from GWAS Catalog.\"\"\"\n\n    @staticmethod\n    def _parse_pvalue(pvalue: Column) -> tuple[Column, Column]:\n        \"\"\"Parse p-value column.\n\n        Args:\n            pvalue (Column): p-value [string]\n\n        Returns:\n            tuple[Column, Column]: p-value mantissa and exponent\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"1.0\"), (\"0.5\"), (\"1E-20\"), (\"3E-3\"), (\"1E-1000\")]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.select('value',*GWASCatalogAssociations._parse_pvalue(f.col('value'))).show()\n            +-------+--------------+--------------+\n            |  value|pValueMantissa|pValueExponent|\n            +-------+--------------+--------------+\n            |    1.0|           1.0|             1|\n            |    0.5|           0.5|             1|\n            |  1E-20|           1.0|           -20|\n            |   3E-3|           3.0|            -3|\n            |1E-1000|           1.0|         -1000|\n            +-------+--------------+--------------+\n            <BLANKLINE>\n\n        \"\"\"\n        split = f.split(pvalue, \"E\")\n        return split.getItem(0).cast(\"float\").alias(\"pValueMantissa\"), f.coalesce(\n            split.getItem(1).cast(\"integer\"), f.lit(1)\n        ).alias(\"pValueExponent\")\n\n    @staticmethod\n    def _normalise_pvaluetext(p_value_text: Column) -> Column:\n        \"\"\"Normalised p-value text column to a standardised format.\n\n        For cases where there is no mapping, the value is set to null.\n\n        Args:\n            p_value_text (Column): `pValueText` column from GWASCatalog\n\n        Returns:\n            Column: Array column after using GWAS Catalog mappings. There might be multiple mappings for a single p-value text.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"European Ancestry\"), (\"African ancestry\"), (\"Alzheimer\u2019s Disease\"), (\"(progression)\"), (\"\"), (None)]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.withColumn('normalised', GWASCatalogAssociations._normalise_pvaluetext(f.col('value'))).show()\n            +-------------------+----------+\n            |              value|normalised|\n            +-------------------+----------+\n            |  European Ancestry|      [EA]|\n            |   African ancestry|      [AA]|\n            |Alzheimer\u2019s Disease|      [AD]|\n            |      (progression)|      null|\n            |                   |      null|\n            |               null|      null|\n            +-------------------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # GWAS Catalog to p-value mapping\n        json_dict = json.loads(\n            pkg_resources.read_text(data, \"gwas_pValueText_map.json\", encoding=\"utf-8\")\n        )\n        map_expr = f.create_map(*[f.lit(x) for x in chain(*json_dict.items())])\n\n        splitted_col = f.split(f.regexp_replace(p_value_text, r\"[\\(\\)]\", \"\"), \",\")\n        mapped_col = f.transform(splitted_col, lambda x: map_expr[x])\n        return f.when(f.forall(mapped_col, lambda x: x.isNull()), None).otherwise(\n            mapped_col\n        )\n\n    @staticmethod\n    def _normalise_risk_allele(risk_allele: Column) -> Column:\n        \"\"\"Normalised risk allele column to a standardised format.\n\n        If multiple risk alleles are present, the first one is returned.\n\n        Args:\n            risk_allele (Column): `riskAllele` column from GWASCatalog\n\n        Returns:\n            Column: mapped using GWAS Catalog mapping\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [(\"rs1234-A-G\"), (\"rs1234-A\"), (\"rs1234-A; rs1235-G\")]\n            >>> df = spark.createDataFrame(d, t.StringType())\n            >>> df.withColumn('normalised', GWASCatalogAssociations._normalise_risk_allele(f.col('value'))).show()\n            +------------------+----------+\n            |             value|normalised|\n            +------------------+----------+\n            |        rs1234-A-G|         A|\n            |          rs1234-A|         A|\n            |rs1234-A; rs1235-G|         A|\n            +------------------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # GWAS Catalog to risk allele mapping\n        return f.split(f.split(risk_allele, \"; \").getItem(0), \"-\").getItem(1)\n\n    @staticmethod\n    def _collect_rsids(\n        snp_id: Column, snp_id_current: Column, risk_allele: Column\n    ) -> Column:\n        \"\"\"It takes three columns, and returns an array of distinct values from those columns.\n\n        Args:\n            snp_id (Column): The original snp id from the GWAS catalog.\n            snp_id_current (Column): The current snp id field is just a number at the moment (stored as a string). Adding 'rs' prefix if looks good.\n            risk_allele (Column): The risk allele for the SNP.\n\n        Returns:\n            Column: An array of distinct values.\n        \"\"\"\n        # The current snp id field is just a number at the moment (stored as a string). Adding 'rs' prefix if looks good.\n        snp_id_current = f.when(\n            snp_id_current.rlike(\"^[0-9]*$\"),\n            f.format_string(\"rs%s\", snp_id_current),\n        )\n        # Cleaning risk allele:\n        risk_allele = f.split(risk_allele, \"-\").getItem(0)\n\n        # Collecting all values:\n        return f.array_distinct(f.array(snp_id, snp_id_current, risk_allele))\n\n    @staticmethod\n    def _map_to_variant_annotation_variants(\n        gwas_associations: DataFrame, variant_annotation: VariantAnnotation\n    ) -> DataFrame:\n        \"\"\"Add variant metadata in associations.\n\n        Args:\n            gwas_associations (DataFrame): raw GWAS Catalog associations\n            variant_annotation (VariantAnnotation): variant annotation dataset\n\n        Returns:\n            DataFrame: GWAS Catalog associations data including `variantId`, `referenceAllele`,\n            `alternateAllele`, `chromosome`, `position` with variant metadata\n        \"\"\"\n        # Subset of GWAS Catalog associations required for resolving variant IDs:\n        gwas_associations_subset = gwas_associations.select(\n            \"studyLocusId\",\n            f.col(\"CHR_ID\").alias(\"chromosome\"),\n            f.col(\"CHR_POS\").cast(IntegerType()).alias(\"position\"),\n            # List of all SNPs associated with the variant\n            GWASCatalogAssociations._collect_rsids(\n                f.split(f.col(\"SNPS\"), \"; \").getItem(0),\n                f.col(\"SNP_ID_CURRENT\"),\n                f.split(f.col(\"STRONGEST SNP-RISK ALLELE\"), \"; \").getItem(0),\n            ).alias(\"rsIdsGwasCatalog\"),\n            GWASCatalogAssociations._normalise_risk_allele(\n                f.col(\"STRONGEST SNP-RISK ALLELE\")\n            ).alias(\"riskAllele\"),\n        )\n\n        # Subset of variant annotation required for GWAS Catalog annotations:\n        va_subset = variant_annotation.df.select(\n            \"variantId\",\n            \"chromosome\",\n            \"position\",\n            f.col(\"rsIds\").alias(\"rsIdsGnomad\"),\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"alleleFrequencies\",\n            variant_annotation.max_maf().alias(\"maxMaf\"),\n        ).join(\n            f.broadcast(\n                gwas_associations_subset.select(\"chromosome\", \"position\").distinct()\n            ),\n            on=[\"chromosome\", \"position\"],\n            how=\"inner\",\n        )\n\n        # Semi-resolved ids (still contains duplicates when conclusion was not possible to make\n        # based on rsIds or allele concordance)\n        filtered_associations = (\n            gwas_associations_subset.join(\n                f.broadcast(va_subset),\n                on=[\"chromosome\", \"position\"],\n                how=\"left\",\n            )\n            .withColumn(\n                \"rsIdFilter\",\n                GWASCatalogAssociations._flag_mappings_to_retain(\n                    f.col(\"studyLocusId\"),\n                    GWASCatalogAssociations._compare_rsids(\n                        f.col(\"rsIdsGnomad\"), f.col(\"rsIdsGwasCatalog\")\n                    ),\n                ),\n            )\n            .withColumn(\n                \"concordanceFilter\",\n                GWASCatalogAssociations._flag_mappings_to_retain(\n                    f.col(\"studyLocusId\"),\n                    GWASCatalogAssociations._check_concordance(\n                        f.col(\"riskAllele\"),\n                        f.col(\"referenceAllele\"),\n                        f.col(\"alternateAllele\"),\n                    ),\n                ),\n            )\n            .filter(\n                # Filter out rows where GWAS Catalog rsId does not match with GnomAD rsId,\n                # but there is corresponding variant for the same association\n                f.col(\"rsIdFilter\")\n                # or filter out rows where GWAS Catalog alleles are not concordant with GnomAD alleles,\n                # but there is corresponding variant for the same association\n                | f.col(\"concordanceFilter\")\n            )\n        )\n\n        # Keep only highest maxMaf variant per studyLocusId\n        fully_mapped_associations = get_record_with_maximum_value(\n            filtered_associations, grouping_col=\"studyLocusId\", sorting_col=\"maxMaf\"\n        ).select(\n            \"studyLocusId\",\n            \"variantId\",\n            \"referenceAllele\",\n            \"alternateAllele\",\n            \"chromosome\",\n            \"position\",\n        )\n\n        return gwas_associations.join(\n            fully_mapped_associations, on=\"studyLocusId\", how=\"left\"\n        )\n\n    @staticmethod\n    def _compare_rsids(gnomad: Column, gwas: Column) -> Column:\n        \"\"\"If the intersection of the two arrays is greater than 0, return True, otherwise return False.\n\n        Args:\n            gnomad (Column): rsids from gnomad\n            gwas (Column): rsids from the GWAS Catalog\n\n        Returns:\n            Column: A boolean column that is true if the GnomAD rsIDs can be found in the GWAS rsIDs.\n\n        Examples:\n            >>> d = [\n            ...    (1, [\"rs123\", \"rs523\"], [\"rs123\"]),\n            ...    (2, [], [\"rs123\"]),\n            ...    (3, [\"rs123\", \"rs523\"], []),\n            ...    (4, [], []),\n            ... ]\n            >>> df = spark.createDataFrame(d, ['associationId', 'gnomad', 'gwas'])\n            >>> df.withColumn(\"rsid_matches\", GWASCatalogAssociations._compare_rsids(f.col(\"gnomad\"),f.col('gwas'))).show()\n            +-------------+--------------+-------+------------+\n            |associationId|        gnomad|   gwas|rsid_matches|\n            +-------------+--------------+-------+------------+\n            |            1|[rs123, rs523]|[rs123]|        true|\n            |            2|            []|[rs123]|       false|\n            |            3|[rs123, rs523]|     []|       false|\n            |            4|            []|     []|       false|\n            +-------------+--------------+-------+------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return f.when(f.size(f.array_intersect(gnomad, gwas)) > 0, True).otherwise(\n            False\n        )\n\n    @staticmethod\n    def _flag_mappings_to_retain(\n        association_id: Column, filter_column: Column\n    ) -> Column:\n        \"\"\"Flagging mappings to drop for each association.\n\n        Some associations have multiple mappings. Some has matching rsId others don't. We only\n        want to drop the non-matching mappings, when a matching is available for the given association.\n        This logic can be generalised for other measures eg. allele concordance.\n\n        Args:\n            association_id (Column): association identifier column\n            filter_column (Column): boolean col indicating to keep a mapping\n\n        Returns:\n            Column: A column with a boolean value.\n\n        Examples:\n        >>> d = [\n        ...    (1, False),\n        ...    (1, False),\n        ...    (2, False),\n        ...    (2, True),\n        ...    (3, True),\n        ...    (3, True),\n        ... ]\n        >>> df = spark.createDataFrame(d, ['associationId', 'filter'])\n        >>> df.withColumn(\"isConcordant\", GWASCatalogAssociations._flag_mappings_to_retain(f.col(\"associationId\"),f.col('filter'))).show()\n        +-------------+------+------------+\n        |associationId|filter|isConcordant|\n        +-------------+------+------------+\n        |            1| false|        true|\n        |            1| false|        true|\n        |            2| false|       false|\n        |            2|  true|        true|\n        |            3|  true|        true|\n        |            3|  true|        true|\n        +-------------+------+------------+\n        <BLANKLINE>\n\n        \"\"\"\n        w = Window.partitionBy(association_id)\n\n        # Generating a boolean column informing if the filter column contains true anywhere for the association:\n        aggregated_filter = f.when(\n            f.array_contains(f.collect_set(filter_column).over(w), True), True\n        ).otherwise(False)\n\n        # Generate a filter column:\n        return f.when(aggregated_filter & (~filter_column), False).otherwise(True)\n\n    @staticmethod\n    def _check_concordance(\n        risk_allele: Column, reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the risk allele is concordant with the alt or ref allele.\n\n        If the risk allele is the same as the reference or alternate allele, or if the reverse complement of\n        the risk allele is the same as the reference or alternate allele, then the allele is concordant.\n        If no mapping is available (ref/alt is null), the function returns True.\n\n        Args:\n            risk_allele (Column): The allele that is associated with the risk of the disease.\n            reference_allele (Column): The reference allele from the GWAS catalog\n            alternate_allele (Column): The alternate allele of the variant.\n\n        Returns:\n            Column: A boolean column that is True if the risk allele is the same as the reference or alternate allele,\n            or if the reverse complement of the risk allele is the same as the reference or alternate allele.\n\n        Examples:\n            >>> d = [\n            ...     ('A', 'A', 'G'),\n            ...     ('A', 'T', 'G'),\n            ...     ('A', 'C', 'G'),\n            ...     ('A', 'A', '?'),\n            ...     (None, None, 'A'),\n            ... ]\n            >>> df = spark.createDataFrame(d, ['riskAllele', 'referenceAllele', 'alternateAllele'])\n            >>> df.withColumn(\"isConcordant\", GWASCatalogAssociations._check_concordance(f.col(\"riskAllele\"),f.col('referenceAllele'), f.col('alternateAllele'))).show()\n            +----------+---------------+---------------+------------+\n            |riskAllele|referenceAllele|alternateAllele|isConcordant|\n            +----------+---------------+---------------+------------+\n            |         A|              A|              G|        true|\n            |         A|              T|              G|        true|\n            |         A|              C|              G|       false|\n            |         A|              A|              ?|        true|\n            |      null|           null|              A|        true|\n            +----------+---------------+---------------+------------+\n            <BLANKLINE>\n\n        \"\"\"\n        # Calculating the reverse complement of the risk allele:\n        risk_allele_reverse_complement = f.when(\n            risk_allele.rlike(r\"^[ACTG]+$\"),\n            f.reverse(f.translate(risk_allele, \"ACTG\", \"TGAC\")),\n        ).otherwise(risk_allele)\n\n        # OK, is the risk allele or the reverse complent is the same as the mapped alleles:\n        return (\n            f.when(\n                (risk_allele == reference_allele) | (risk_allele == alternate_allele),\n                True,\n            )\n            # If risk allele is found on the negative strand:\n            .when(\n                (risk_allele_reverse_complement == reference_allele)\n                | (risk_allele_reverse_complement == alternate_allele),\n                True,\n            )\n            # If risk allele is ambiguous, still accepted: < This condition could be reconsidered\n            .when(risk_allele == \"?\", True)\n            # If the association could not be mapped we keep it:\n            .when(reference_allele.isNull(), True)\n            # Allele is discordant:\n            .otherwise(False)\n        )\n\n    @staticmethod\n    def _get_reverse_complement(allele_col: Column) -> Column:\n        \"\"\"A function to return the reverse complement of an allele column.\n\n        It takes a string and returns the reverse complement of that string if it's a DNA sequence,\n        otherwise it returns the original string. Assumes alleles in upper case.\n\n        Args:\n            allele_col (Column): The column containing the allele to reverse complement.\n\n        Returns:\n            Column: A column that is the reverse complement of the allele column.\n\n        Examples:\n            >>> d = [{\"allele\": 'A'}, {\"allele\": 'T'},{\"allele\": 'G'}, {\"allele\": 'C'},{\"allele\": 'AC'}, {\"allele\": 'GTaatc'},{\"allele\": '?'}, {\"allele\": None}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"revcom_allele\", GWASCatalogAssociations._get_reverse_complement(f.col(\"allele\"))).show()\n            +------+-------------+\n            |allele|revcom_allele|\n            +------+-------------+\n            |     A|            T|\n            |     T|            A|\n            |     G|            C|\n            |     C|            G|\n            |    AC|           GT|\n            |GTaatc|       GATTAC|\n            |     ?|            ?|\n            |  null|         null|\n            +------+-------------+\n            <BLANKLINE>\n\n        \"\"\"\n        allele_col = f.upper(allele_col)\n        return f.when(\n            allele_col.rlike(\"[ACTG]+\"),\n            f.reverse(f.translate(allele_col, \"ACTG\", \"TGAC\")),\n        ).otherwise(allele_col)\n\n    @staticmethod\n    def _effect_needs_harmonisation(\n        risk_allele: Column, reference_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the effect allele needs to be harmonised.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Effect allele column\n\n        Returns:\n            Column: A boolean column indicating if the effect allele needs to be harmonised.\n\n        Examples:\n            >>> d = [{\"risk\": 'A', \"reference\": 'A'}, {\"risk\": 'A', \"reference\": 'T'}, {\"risk\": 'AT', \"reference\": 'TA'}, {\"risk\": 'AT', \"reference\": 'AT'}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"needs_harmonisation\", GWASCatalogAssociations._effect_needs_harmonisation(f.col(\"risk\"), f.col(\"reference\"))).show()\n            +---------+----+-------------------+\n            |reference|risk|needs_harmonisation|\n            +---------+----+-------------------+\n            |        A|   A|               true|\n            |        T|   A|               true|\n            |       TA|  AT|              false|\n            |       AT|  AT|               true|\n            +---------+----+-------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return (risk_allele == reference_allele) | (\n            risk_allele\n            == GWASCatalogAssociations._get_reverse_complement(reference_allele)\n        )\n\n    @staticmethod\n    def _are_alleles_palindromic(\n        reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"A function to check if the alleles are palindromic.\n\n        Args:\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n\n        Returns:\n            Column: A boolean column indicating if the alleles are palindromic.\n\n        Examples:\n            >>> d = [{\"reference\": 'A', \"alternate\": 'T'}, {\"reference\": 'AT', \"alternate\": 'AG'}, {\"reference\": 'AT', \"alternate\": 'AT'}, {\"reference\": 'CATATG', \"alternate\": 'CATATG'}, {\"reference\": '-', \"alternate\": None}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"is_palindromic\", GWASCatalogAssociations._are_alleles_palindromic(f.col(\"reference\"), f.col(\"alternate\"))).show()\n            +---------+---------+--------------+\n            |alternate|reference|is_palindromic|\n            +---------+---------+--------------+\n            |        T|        A|          true|\n            |       AG|       AT|         false|\n            |       AT|       AT|          true|\n            |   CATATG|   CATATG|          true|\n            |     null|        -|         false|\n            +---------+---------+--------------+\n            <BLANKLINE>\n\n        \"\"\"\n        revcomp = GWASCatalogAssociations._get_reverse_complement(alternate_allele)\n        return (\n            f.when(reference_allele == revcomp, True)\n            .when(revcomp.isNull(), False)\n            .otherwise(False)\n        )\n\n    @staticmethod\n    def _harmonise_beta(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n    ) -> Column:\n        \"\"\"A function to extract the beta value from the effect size and confidence interval.\n\n        If the confidence interval contains the word \"increase\" or \"decrease\" it indicates, we are dealing with betas.\n        If it's \"increase\" and the effect size needs to be harmonized, then multiply the effect size by -1\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n\n        Returns:\n            Column: A column containing the beta value.\n        \"\"\"\n        return (\n            f.when(\n                GWASCatalogAssociations._are_alleles_palindromic(\n                    reference_allele, alternate_allele\n                ),\n                None,\n            )\n            .when(\n                (\n                    GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & confidence_interval.contains(\"increase\")\n                )\n                | (\n                    ~GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & confidence_interval.contains(\"decrease\")\n                ),\n                -effect_size,\n            )\n            .otherwise(effect_size)\n            .cast(DoubleType())\n        )\n\n    @staticmethod\n    def _harmonise_beta_ci(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n        p_value: Column,\n        direction: str,\n    ) -> Column:\n        \"\"\"Calculating confidence intervals for beta values.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n            p_value (Column): GWAS Catalog p-value column\n            direction (str): This is the direction of the confidence interval. It can be either \"upper\" or \"lower\".\n\n        Returns:\n            Column: The upper and lower bounds of the confidence interval for the beta coefficient.\n        \"\"\"\n        zscore_95 = f.lit(1.96)\n        beta = GWASCatalogAssociations._harmonise_beta(\n            risk_allele,\n            reference_allele,\n            alternate_allele,\n            effect_size,\n            confidence_interval,\n        )\n        zscore = pvalue_to_zscore(p_value)\n        return (\n            f.when(f.lit(direction) == \"upper\", beta + f.abs(zscore_95 * beta) / zscore)\n            .when(f.lit(direction) == \"lower\", beta - f.abs(zscore_95 * beta) / zscore)\n            .otherwise(None)\n        )\n\n    @staticmethod\n    def _harmonise_odds_ratio(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n    ) -> Column:\n        \"\"\"Harmonizing odds ratio.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n\n        Returns:\n            Column: A column with the odds ratio, or 1/odds_ratio if harmonization required.\n        \"\"\"\n        return (\n            f.when(\n                GWASCatalogAssociations._are_alleles_palindromic(\n                    reference_allele, alternate_allele\n                ),\n                None,\n            )\n            .when(\n                (\n                    GWASCatalogAssociations._effect_needs_harmonisation(\n                        risk_allele, reference_allele\n                    )\n                    & ~confidence_interval.rlike(\"|\".join([\"decrease\", \"increase\"]))\n                ),\n                1 / effect_size,\n            )\n            .otherwise(effect_size)\n            .cast(DoubleType())\n        )\n\n    @staticmethod\n    def _harmonise_odds_ratio_ci(\n        risk_allele: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        effect_size: Column,\n        confidence_interval: Column,\n        p_value: Column,\n        direction: str,\n    ) -> Column:\n        \"\"\"Calculating confidence intervals for beta values.\n\n        Args:\n            risk_allele (Column): Risk allele column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            effect_size (Column): GWAS Catalog effect size column\n            confidence_interval (Column): GWAS Catalog confidence interval column\n            p_value (Column): GWAS Catalog p-value column\n            direction (str): This is the direction of the confidence interval. It can be either \"upper\" or \"lower\".\n\n        Returns:\n            Column: The upper and lower bounds of the 95% confidence interval for the odds ratio.\n        \"\"\"\n        zscore_95 = f.lit(1.96)\n        odds_ratio = GWASCatalogAssociations._harmonise_odds_ratio(\n            risk_allele,\n            reference_allele,\n            alternate_allele,\n            effect_size,\n            confidence_interval,\n        )\n        odds_ratio_estimate = f.log(odds_ratio)\n        zscore = pvalue_to_zscore(p_value)\n        odds_ratio_se = odds_ratio_estimate / zscore\n        return f.when(\n            f.lit(direction) == \"upper\",\n            f.exp(odds_ratio_estimate + f.abs(zscore_95 * odds_ratio_se)),\n        ).when(\n            f.lit(direction) == \"lower\",\n            f.exp(odds_ratio_estimate - f.abs(zscore_95 * odds_ratio_se)),\n        )\n\n    @staticmethod\n    def _concatenate_substudy_description(\n        association_trait: Column, pvalue_text: Column, mapped_trait_uri: Column\n    ) -> Column:\n        \"\"\"Substudy description parsing. Complex string containing metadata about the substudy (e.g. QTL, specific EFO, etc.).\n\n        Args:\n            association_trait (Column): GWAS Catalog association trait column\n            pvalue_text (Column): GWAS Catalog p-value text column\n            mapped_trait_uri (Column): GWAS Catalog mapped trait URI column\n\n        Returns:\n            Column: A column with the substudy description in the shape trait|pvaluetext1_pvaluetext2|EFO1_EFO2.\n\n        Examples:\n        >>> df = spark.createDataFrame([\n        ...    (\"Height\", \"http://www.ebi.ac.uk/efo/EFO_0000001,http://www.ebi.ac.uk/efo/EFO_0000002\", \"European Ancestry\"),\n        ...    (\"Schizophrenia\", \"http://www.ebi.ac.uk/efo/MONDO_0005090\", None)],\n        ...    [\"association_trait\", \"mapped_trait_uri\", \"pvalue_text\"]\n        ... )\n        >>> df.withColumn('substudy_description', GWASCatalogAssociations._concatenate_substudy_description(df.association_trait, df.pvalue_text, df.mapped_trait_uri)).show(truncate=False)\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        |association_trait|mapped_trait_uri                                                         |pvalue_text      |substudy_description                      |\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        |Height           |http://www.ebi.ac.uk/efo/EFO_0000001,http://www.ebi.ac.uk/efo/EFO_0000002|European Ancestry|Height|EA|EFO_0000001/EFO_0000002         |\n        |Schizophrenia    |http://www.ebi.ac.uk/efo/MONDO_0005090                                   |null             |Schizophrenia|no_pvalue_text|MONDO_0005090|\n        +-----------------+-------------------------------------------------------------------------+-----------------+------------------------------------------+\n        <BLANKLINE>\n        \"\"\"\n        p_value_text = f.coalesce(\n            GWASCatalogAssociations._normalise_pvaluetext(pvalue_text),\n            f.array(f.lit(\"no_pvalue_text\")),\n        )\n        return f.concat_ws(\n            \"|\",\n            association_trait,\n            f.concat_ws(\n                \"/\",\n                p_value_text,\n            ),\n            f.concat_ws(\n                \"/\",\n                parse_efos(mapped_trait_uri),\n            ),\n        )\n\n    @staticmethod\n    def _qc_all(\n        qc: Column,\n        chromosome: Column,\n        position: Column,\n        reference_allele: Column,\n        alternate_allele: Column,\n        strongest_snp_risk_allele: Column,\n        p_value_mantissa: Column,\n        p_value_exponent: Column,\n        p_value_cutoff: float,\n    ) -> Column:\n        \"\"\"Flag associations that fail any QC.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column\n            position (Column): Position column\n            reference_allele (Column): Reference allele column\n            alternate_allele (Column): Alternate allele column\n            strongest_snp_risk_allele (Column): Strongest SNP risk allele column\n            p_value_mantissa (Column): P-value mantissa column\n            p_value_exponent (Column): P-value exponent column\n            p_value_cutoff (float): P-value cutoff\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        qc = GWASCatalogAssociations._qc_variant_interactions(\n            qc, strongest_snp_risk_allele\n        )\n        qc = GWASCatalogAssociations._qc_subsignificant_associations(\n            qc, p_value_mantissa, p_value_exponent, p_value_cutoff\n        )\n        qc = GWASCatalogAssociations._qc_genomic_location(qc, chromosome, position)\n        qc = GWASCatalogAssociations._qc_variant_inconsistencies(\n            qc, chromosome, position, strongest_snp_risk_allele\n        )\n        qc = GWASCatalogAssociations._qc_unmapped_variants(qc, alternate_allele)\n        qc = GWASCatalogAssociations._qc_palindromic_alleles(\n            qc, reference_allele, alternate_allele\n        )\n        return qc\n\n    @staticmethod\n    def _qc_variant_interactions(\n        qc: Column, strongest_snp_risk_allele: Column\n    ) -> Column:\n        \"\"\"Flag associations based on variant x variant interactions.\n\n        Args:\n            qc (Column): QC column\n            strongest_snp_risk_allele (Column): Column with the strongest SNP risk allele\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            strongest_snp_risk_allele.contains(\";\"),\n            StudyLocusQualityCheck.COMPOSITE_FLAG,\n        )\n\n    @staticmethod\n    def _qc_subsignificant_associations(\n        qc: Column,\n        p_value_mantissa: Column,\n        p_value_exponent: Column,\n        pvalue_cutoff: float,\n    ) -> Column:\n        \"\"\"Flag associations below significant threshold.\n\n        Args:\n            qc (Column): QC column\n            p_value_mantissa (Column): P-value mantissa column\n            p_value_exponent (Column): P-value exponent column\n            pvalue_cutoff (float): association p-value cut-off\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Examples:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -7}, {'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -8}, {'qc': None, 'p_value_mantissa': 5, 'p_value_exponent': -8}, {'qc': None, 'p_value_mantissa': 1, 'p_value_exponent': -9}]\n            >>> df = spark.createDataFrame(d, t.StructType([t.StructField('qc', t.ArrayType(t.StringType()), True), t.StructField('p_value_mantissa', t.IntegerType()), t.StructField('p_value_exponent', t.IntegerType())]))\n            >>> df.withColumn('qc', GWASCatalogAssociations._qc_subsignificant_associations(f.col(\"qc\"), f.col(\"p_value_mantissa\"), f.col(\"p_value_exponent\"), 5e-8)).show(truncate = False)\n            +------------------------+----------------+----------------+\n            |qc                      |p_value_mantissa|p_value_exponent|\n            +------------------------+----------------+----------------+\n            |[Subsignificant p-value]|1               |-7              |\n            |[]                      |1               |-8              |\n            |[]                      |5               |-8              |\n            |[]                      |1               |-9              |\n            +------------------------+----------------+----------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            calculate_neglog_pvalue(p_value_mantissa, p_value_exponent)\n            < f.lit(-np.log10(pvalue_cutoff)),\n            StudyLocusQualityCheck.SUBSIGNIFICANT_FLAG,\n        )\n\n    @staticmethod\n    def _qc_genomic_location(\n        qc: Column, chromosome: Column, position: Column\n    ) -> Column:\n        \"\"\"Flag associations without genomic location in GWAS Catalog.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column in GWAS Catalog\n            position (Column): Position column in GWAS Catalog\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Examples:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'qc': None, 'chromosome': None, 'position': None}, {'qc': None, 'chromosome': '1', 'position': None}, {'qc': None, 'chromosome': None, 'position': 1}, {'qc': None, 'chromosome': '1', 'position': 1}]\n            >>> df = spark.createDataFrame(d, schema=t.StructType([t.StructField('qc', t.ArrayType(t.StringType()), True), t.StructField('chromosome', t.StringType()), t.StructField('position', t.IntegerType())]))\n            >>> df.withColumn('qc', GWASCatalogAssociations._qc_genomic_location(df.qc, df.chromosome, df.position)).show(truncate=False)\n            +----------------------------+----------+--------+\n            |qc                          |chromosome|position|\n            +----------------------------+----------+--------+\n            |[Incomplete genomic mapping]|null      |null    |\n            |[Incomplete genomic mapping]|1         |null    |\n            |[Incomplete genomic mapping]|null      |1       |\n            |[]                          |1         |1       |\n            +----------------------------+----------+--------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            position.isNull() | chromosome.isNull(),\n            StudyLocusQualityCheck.NO_GENOMIC_LOCATION_FLAG,\n        )\n\n    @staticmethod\n    def _qc_variant_inconsistencies(\n        qc: Column,\n        chromosome: Column,\n        position: Column,\n        strongest_snp_risk_allele: Column,\n    ) -> Column:\n        \"\"\"Flag associations with inconsistencies in the variant annotation.\n\n        Args:\n            qc (Column): QC column\n            chromosome (Column): Chromosome column in GWAS Catalog\n            position (Column): Position column in GWAS Catalog\n            strongest_snp_risk_allele (Column): Strongest SNP risk allele column in GWAS Catalog\n\n        Returns:\n            Column: Updated QC column with flag.\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            # Number of chromosomes does not correspond to the number of positions:\n            (f.size(f.split(chromosome, \";\")) != f.size(f.split(position, \";\")))\n            # Number of chromosome values different from riskAllele values:\n            | (\n                f.size(f.split(chromosome, \";\"))\n                != f.size(f.split(strongest_snp_risk_allele, \";\"))\n            ),\n            StudyLocusQualityCheck.INCONSISTENCY_FLAG,\n        )\n\n    @staticmethod\n    def _qc_unmapped_variants(qc: Column, alternate_allele: Column) -> Column:\n        \"\"\"Flag associations with variants not mapped to variantAnnotation.\n\n        Args:\n            qc (Column): QC column\n            alternate_allele (Column): alternate allele\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> d = [{'alternate_allele': 'A', 'qc': None}, {'alternate_allele': None, 'qc': None}]\n            >>> schema = t.StructType([t.StructField('alternate_allele', t.StringType(), True), t.StructField('qc', t.ArrayType(t.StringType()), True)])\n            >>> df = spark.createDataFrame(data=d, schema=schema)\n            >>> df.withColumn(\"new_qc\", GWASCatalogAssociations._qc_unmapped_variants(f.col(\"qc\"), f.col(\"alternate_allele\"))).show()\n            +----------------+----+--------------------+\n            |alternate_allele|  qc|              new_qc|\n            +----------------+----+--------------------+\n            |               A|null|                  []|\n            |            null|null|[No mapping in Gn...|\n            +----------------+----+--------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return GWASCatalogAssociations._update_quality_flag(\n            qc,\n            alternate_allele.isNull(),\n            StudyLocusQualityCheck.NON_MAPPED_VARIANT_FLAG,\n        )\n\n    @staticmethod\n    def _qc_palindromic_alleles(\n        qc: Column, reference_allele: Column, alternate_allele: Column\n    ) -> Column:\n        \"\"\"Flag associations with palindromic variants which effects can not be harmonised.\n\n        Args:\n            qc (Column): QC column\n            reference_allele (Column): reference allele\n            alternate_allele (Column): alternate allele\n\n        Returns:\n            Column: Updated QC column with flag.\n\n        Example:\n            >>> import pyspark.sql.types as t\n            >>> schema = t.StructType([t.StructField('reference_allele', t.StringType(), True), t.StructField('alternate_allele', t.StringType(), True), t.StructField('qc', t.ArrayType(t.StringType()), True)])\n            >>> d = [{'reference_allele': 'A', 'alternate_allele': 'T', 'qc': None}, {'reference_allele': 'AT', 'alternate_allele': 'TA', 'qc': None}, {'reference_allele': 'AT', 'alternate_allele': 'AT', 'qc': None}]\n            >>> df = spark.createDataFrame(data=d, schema=schema)\n            >>> df.withColumn(\"qc\", GWASCatalogAssociations._qc_palindromic_alleles(f.col(\"qc\"), f.col(\"reference_allele\"), f.col(\"alternate_allele\"))).show(truncate=False)\n            +----------------+----------------+---------------------------------------+\n            |reference_allele|alternate_allele|qc                                     |\n            +----------------+----------------+---------------------------------------+\n            |A               |T               |[Palindrome alleles - cannot harmonize]|\n            |AT              |TA              |[]                                     |\n            |AT              |AT              |[Palindrome alleles - cannot harmonize]|\n            +----------------+----------------+---------------------------------------+\n            <BLANKLINE>\n\n        \"\"\"\n        return StudyLocus._update_quality_flag(\n            qc,\n            GWASCatalogAssociations._are_alleles_palindromic(\n                reference_allele, alternate_allele\n            ),\n            StudyLocusQualityCheck.PALINDROMIC_ALLELE_FLAG,\n        )\n\n    @classmethod\n    def from_source(\n        cls: type[GWASCatalogAssociations],\n        gwas_associations: DataFrame,\n        variant_annotation: VariantAnnotation,\n        pvalue_threshold: float = 5e-8,\n    ) -> GWASCatalogAssociations:\n        \"\"\"Read GWASCatalog associations.\n\n        It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and\n        applies some pre-defined filters on the data:\n\n        Args:\n            gwas_associations (DataFrame): GWAS Catalog raw associations dataset\n            variant_annotation (VariantAnnotation): Variant annotation dataset\n            pvalue_threshold (float): P-value threshold for flagging associations\n\n        Returns:\n            GWASCatalogAssociations: GWASCatalogAssociations dataset\n        \"\"\"\n        return GWASCatalogAssociations(\n            _df=gwas_associations.withColumn(\n                \"studyLocusId\", f.monotonically_increasing_id().cast(LongType())\n            )\n            .transform(\n                # Map/harmonise variants to variant annotation dataset:\n                # This function adds columns: variantId, referenceAllele, alternateAllele, chromosome, position\n                lambda df: GWASCatalogAssociations._map_to_variant_annotation_variants(\n                    df, variant_annotation\n                )\n            )\n            .withColumn(\n                # Perform all quality control checks:\n                \"qualityControls\",\n                GWASCatalogAssociations._qc_all(\n                    f.array().alias(\"qualityControls\"),\n                    f.col(\"CHR_ID\"),\n                    f.col(\"CHR_POS\").cast(IntegerType()),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"STRONGEST SNP-RISK ALLELE\"),\n                    *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                    pvalue_threshold,\n                ),\n            )\n            .select(\n                # INSIDE STUDY-LOCUS SCHEMA:\n                \"studyLocusId\",\n                \"variantId\",\n                # Mapped genomic location of the variant (; separated list)\n                \"chromosome\",\n                \"position\",\n                f.col(\"STUDY ACCESSION\").alias(\"studyId\"),\n                # beta value of the association\n                GWASCatalogAssociations._harmonise_beta(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                ).alias(\"beta\"),\n                # odds ratio of the association\n                GWASCatalogAssociations._harmonise_odds_ratio(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                ).alias(\"oddsRatio\"),\n                # CI lower of the beta value\n                GWASCatalogAssociations._harmonise_beta_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"lower\",\n                ).alias(\"betaConfidenceIntervalLower\"),\n                # CI upper for the beta value\n                GWASCatalogAssociations._harmonise_beta_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"upper\",\n                ).alias(\"betaConfidenceIntervalUpper\"),\n                # CI lower of the odds ratio value\n                GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"lower\",\n                ).alias(\"oddsRatioConfidenceIntervalLower\"),\n                # CI upper of the odds ratio value\n                GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                    GWASCatalogAssociations._normalise_risk_allele(\n                        f.col(\"STRONGEST SNP-RISK ALLELE\")\n                    ),\n                    f.col(\"referenceAllele\"),\n                    f.col(\"alternateAllele\"),\n                    f.col(\"OR or BETA\"),\n                    f.col(\"95% CI (TEXT)\"),\n                    f.col(\"P-VALUE\"),\n                    \"upper\",\n                ).alias(\"oddsRatioConfidenceIntervalUpper\"),\n                # p-value of the association, string: split into exponent and mantissa.\n                *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                # Capturing phenotype granularity at the association level\n                GWASCatalogAssociations._concatenate_substudy_description(\n                    f.col(\"DISEASE/TRAIT\"),\n                    f.col(\"P-VALUE (TEXT)\"),\n                    f.col(\"MAPPED_TRAIT_URI\"),\n                ).alias(\"subStudyDescription\"),\n                # Quality controls (array of strings)\n                \"qualityControls\",\n            ),\n            _schema=GWASCatalogAssociations.get_schema(),\n        )\n\n    def update_study_id(\n        self: GWASCatalogAssociations, study_annotation: DataFrame\n    ) -> GWASCatalogAssociations:\n        \"\"\"Update final studyId and studyLocusId with a dataframe containing study annotation.\n\n        Args:\n            study_annotation (DataFrame): Dataframe containing `updatedStudyId` and key columns `studyId` and `subStudyDescription`.\n\n        Returns:\n            GWASCatalogAssociations: Updated study locus with new `studyId` and `studyLocusId`.\n        \"\"\"\n        self.df = (\n            self._df.join(\n                study_annotation, on=[\"studyId\", \"subStudyDescription\"], how=\"left\"\n            )\n            .withColumn(\"studyId\", f.coalesce(\"updatedStudyId\", \"studyId\"))\n            .drop(\"subStudyDescription\", \"updatedStudyId\")\n        ).withColumn(\n            \"studyLocusId\",\n            StudyLocus.assign_study_locus_id(f.col(\"studyId\"), f.col(\"variantId\")),\n        )\n        return self\n\n    def _qc_ambiguous_study(self: GWASCatalogAssociations) -> GWASCatalogAssociations:\n        \"\"\"Flag associations with variants that can not be unambiguously associated with one study.\n\n        Returns:\n            GWASCatalogAssociations: Updated study locus.\n        \"\"\"\n        assoc_ambiguity_window = Window.partitionBy(\n            f.col(\"studyId\"), f.col(\"variantId\")\n        )\n\n        self._df.withColumn(\n            \"qualityControls\",\n            StudyLocus._update_quality_flag(\n                f.col(\"qualityControls\"),\n                f.count(f.col(\"variantId\")).over(assoc_ambiguity_window) > 1,\n                StudyLocusQualityCheck.AMBIGUOUS_STUDY,\n            ),\n        )\n        return self\n
"},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations.from_source","title":"from_source(gwas_associations: DataFrame, variant_annotation: VariantAnnotation, pvalue_threshold: float = 5e-08) -> GWASCatalogAssociations classmethod","text":"

Read GWASCatalog associations.

It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and applies some pre-defined filters on the data:

Parameters:

Name Type Description Default gwas_associations DataFrame

GWAS Catalog raw associations dataset

required variant_annotation VariantAnnotation

Variant annotation dataset

required pvalue_threshold float

P-value threshold for flagging associations

5e-08

Returns:

Name Type Description GWASCatalogAssociations GWASCatalogAssociations

GWASCatalogAssociations dataset

Source code in src/otg/datasource/gwas_catalog/associations.py
@classmethod\ndef from_source(\n    cls: type[GWASCatalogAssociations],\n    gwas_associations: DataFrame,\n    variant_annotation: VariantAnnotation,\n    pvalue_threshold: float = 5e-8,\n) -> GWASCatalogAssociations:\n    \"\"\"Read GWASCatalog associations.\n\n    It reads the GWAS Catalog association dataset, selects and renames columns, casts columns, and\n    applies some pre-defined filters on the data:\n\n    Args:\n        gwas_associations (DataFrame): GWAS Catalog raw associations dataset\n        variant_annotation (VariantAnnotation): Variant annotation dataset\n        pvalue_threshold (float): P-value threshold for flagging associations\n\n    Returns:\n        GWASCatalogAssociations: GWASCatalogAssociations dataset\n    \"\"\"\n    return GWASCatalogAssociations(\n        _df=gwas_associations.withColumn(\n            \"studyLocusId\", f.monotonically_increasing_id().cast(LongType())\n        )\n        .transform(\n            # Map/harmonise variants to variant annotation dataset:\n            # This function adds columns: variantId, referenceAllele, alternateAllele, chromosome, position\n            lambda df: GWASCatalogAssociations._map_to_variant_annotation_variants(\n                df, variant_annotation\n            )\n        )\n        .withColumn(\n            # Perform all quality control checks:\n            \"qualityControls\",\n            GWASCatalogAssociations._qc_all(\n                f.array().alias(\"qualityControls\"),\n                f.col(\"CHR_ID\"),\n                f.col(\"CHR_POS\").cast(IntegerType()),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"STRONGEST SNP-RISK ALLELE\"),\n                *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n                pvalue_threshold,\n            ),\n        )\n        .select(\n            # INSIDE STUDY-LOCUS SCHEMA:\n            \"studyLocusId\",\n            \"variantId\",\n            # Mapped genomic location of the variant (; separated list)\n            \"chromosome\",\n            \"position\",\n            f.col(\"STUDY ACCESSION\").alias(\"studyId\"),\n            # beta value of the association\n            GWASCatalogAssociations._harmonise_beta(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n            ).alias(\"beta\"),\n            # odds ratio of the association\n            GWASCatalogAssociations._harmonise_odds_ratio(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n            ).alias(\"oddsRatio\"),\n            # CI lower of the beta value\n            GWASCatalogAssociations._harmonise_beta_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"lower\",\n            ).alias(\"betaConfidenceIntervalLower\"),\n            # CI upper for the beta value\n            GWASCatalogAssociations._harmonise_beta_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"upper\",\n            ).alias(\"betaConfidenceIntervalUpper\"),\n            # CI lower of the odds ratio value\n            GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"lower\",\n            ).alias(\"oddsRatioConfidenceIntervalLower\"),\n            # CI upper of the odds ratio value\n            GWASCatalogAssociations._harmonise_odds_ratio_ci(\n                GWASCatalogAssociations._normalise_risk_allele(\n                    f.col(\"STRONGEST SNP-RISK ALLELE\")\n                ),\n                f.col(\"referenceAllele\"),\n                f.col(\"alternateAllele\"),\n                f.col(\"OR or BETA\"),\n                f.col(\"95% CI (TEXT)\"),\n                f.col(\"P-VALUE\"),\n                \"upper\",\n            ).alias(\"oddsRatioConfidenceIntervalUpper\"),\n            # p-value of the association, string: split into exponent and mantissa.\n            *GWASCatalogAssociations._parse_pvalue(f.col(\"P-VALUE\")),\n            # Capturing phenotype granularity at the association level\n            GWASCatalogAssociations._concatenate_substudy_description(\n                f.col(\"DISEASE/TRAIT\"),\n                f.col(\"P-VALUE (TEXT)\"),\n                f.col(\"MAPPED_TRAIT_URI\"),\n            ).alias(\"subStudyDescription\"),\n            # Quality controls (array of strings)\n            \"qualityControls\",\n        ),\n        _schema=GWASCatalogAssociations.get_schema(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/associations/#otg.datasource.gwas_catalog.associations.GWASCatalogAssociations.update_study_id","title":"update_study_id(study_annotation: DataFrame) -> GWASCatalogAssociations","text":"

Update final studyId and studyLocusId with a dataframe containing study annotation.

Parameters:

Name Type Description Default study_annotation DataFrame

Dataframe containing updatedStudyId and key columns studyId and subStudyDescription.

required

Returns:

Name Type Description GWASCatalogAssociations GWASCatalogAssociations

Updated study locus with new studyId and studyLocusId.

Source code in src/otg/datasource/gwas_catalog/associations.py
def update_study_id(\n    self: GWASCatalogAssociations, study_annotation: DataFrame\n) -> GWASCatalogAssociations:\n    \"\"\"Update final studyId and studyLocusId with a dataframe containing study annotation.\n\n    Args:\n        study_annotation (DataFrame): Dataframe containing `updatedStudyId` and key columns `studyId` and `subStudyDescription`.\n\n    Returns:\n        GWASCatalogAssociations: Updated study locus with new `studyId` and `studyLocusId`.\n    \"\"\"\n    self.df = (\n        self._df.join(\n            study_annotation, on=[\"studyId\", \"subStudyDescription\"], how=\"left\"\n        )\n        .withColumn(\"studyId\", f.coalesce(\"updatedStudyId\", \"studyId\"))\n        .drop(\"subStudyDescription\", \"updatedStudyId\")\n    ).withColumn(\n        \"studyLocusId\",\n        StudyLocus.assign_study_locus_id(f.col(\"studyId\"), f.col(\"variantId\")),\n    )\n    return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex","title":"otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex dataclass","text":"

Bases: StudyIndex

Study index from GWAS Catalog.

The following information is harmonised from the GWAS Catalog:

  • All publication related information retained.
  • Mapped measured and background traits parsed.
  • Flagged if harmonized summary statistics datasets available.
  • If available, the ftp path to these files presented.
  • Ancestries from the discovery and replication stages are structured with sample counts.
  • Case/control counts extracted.
  • The number of samples with European ancestry extracted.
Source code in src/otg/datasource/gwas_catalog/study_index.py
@dataclass\nclass GWASCatalogStudyIndex(StudyIndex):\n    \"\"\"Study index from GWAS Catalog.\n\n    The following information is harmonised from the GWAS Catalog:\n\n    - All publication related information retained.\n    - Mapped measured and background traits parsed.\n    - Flagged if harmonized summary statistics datasets available.\n    - If available, the ftp path to these files presented.\n    - Ancestries from the discovery and replication stages are structured with sample counts.\n    - Case/control counts extracted.\n    - The number of samples with European ancestry extracted.\n\n    \"\"\"\n\n    @staticmethod\n    def _parse_discovery_samples(discovery_samples: Column) -> Column:\n        \"\"\"Parse discovery sample sizes from GWAS Catalog.\n\n        This is a curated field. From publication sometimes it is not clear how the samples were split\n        across the reported ancestries. In such cases we are assuming the ancestries were evenly presented\n        and the total sample size is split:\n\n        [\"European, African\", 100] -> [\"European, 50], [\"African\", 50]\n\n        Args:\n            discovery_samples (Column): Raw discovery sample sizes\n\n        Returns:\n            Column: Parsed and de-duplicated list of discovery ancestries with sample size.\n\n        Examples:\n            >>> data = [('s1', \"European\", 10), ('s1', \"African\", 10), ('s2', \"European, African, Asian\", 100), ('s2', \"European\", 50)]\n            >>> df = (\n            ...    spark.createDataFrame(data, ['studyId', 'ancestry', 'sampleSize'])\n            ...    .groupBy('studyId')\n            ...    .agg(\n            ...        f.collect_set(\n            ...            f.struct('ancestry', 'sampleSize')\n            ...        ).alias('discoverySampleSize')\n            ...    )\n            ...    .orderBy('studyId')\n            ...    .withColumn('discoverySampleSize', GWASCatalogStudyIndex._parse_discovery_samples(f.col('discoverySampleSize')))\n            ...    .select('discoverySampleSize')\n            ...    .show(truncate=False)\n            ... )\n            +--------------------------------------------+\n            |discoverySampleSize                         |\n            +--------------------------------------------+\n            |[{African, 10}, {European, 10}]             |\n            |[{European, 83}, {African, 33}, {Asian, 33}]|\n            +--------------------------------------------+\n            <BLANKLINE>\n        \"\"\"\n        # To initialize return objects for aggregate functions, schema has to be definied:\n        schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"ancestry\", t.StringType(), True),\n                    t.StructField(\"sampleSize\", t.IntegerType(), True),\n                ]\n            )\n        )\n\n        # Splitting comma separated ancestries:\n        exploded_ancestries = f.transform(\n            discovery_samples,\n            lambda sample: f.split(sample.ancestry, r\",\\s(?![^()]*\\))\"),\n        )\n\n        # Initialize discoverySample object from unique list of ancestries:\n        unique_ancestries = f.transform(\n            f.aggregate(\n                exploded_ancestries,\n                f.array().cast(t.ArrayType(t.StringType())),\n                lambda x, y: f.array_union(x, y),\n                f.array_distinct,\n            ),\n            lambda ancestry: f.struct(\n                ancestry.alias(\"ancestry\"),\n                f.lit(0).cast(t.LongType()).alias(\"sampleSize\"),\n            ),\n        )\n\n        # Computing sample sizes for ancestries when splitting is needed:\n        resolved_sample_count = f.transform(\n            f.arrays_zip(\n                f.transform(exploded_ancestries, lambda pop: f.size(pop)).alias(\n                    \"pop_size\"\n                ),\n                f.transform(discovery_samples, lambda pop: pop.sampleSize).alias(\n                    \"pop_count\"\n                ),\n            ),\n            lambda pop: (pop.pop_count / pop.pop_size).cast(t.IntegerType()),\n        )\n\n        # Flattening out ancestries with sample sizes:\n        parsed_sample_size = f.aggregate(\n            f.transform(\n                f.arrays_zip(\n                    exploded_ancestries.alias(\"ancestries\"),\n                    resolved_sample_count.alias(\"sample_count\"),\n                ),\n                GWASCatalogStudyIndex._merge_ancestries_and_counts,\n            ),\n            f.array().cast(schema),\n            lambda x, y: f.array_union(x, y),\n        )\n\n        # Normalize ancestries:\n        return f.aggregate(\n            parsed_sample_size,\n            unique_ancestries,\n            GWASCatalogStudyIndex._normalize_ancestries,\n        )\n\n    @staticmethod\n    def _normalize_ancestries(merged: Column, ancestry: Column) -> Column:\n        \"\"\"Normalize ancestries from a list of structs.\n\n        As some ancestry label might be repeated with different sample counts,\n        these counts need to be collected.\n\n        Args:\n            merged (Column): Resulting list of struct with unique ancestries.\n            ancestry (Column): One ancestry object coming from raw.\n\n        Returns:\n            Column: Unique list of ancestries with the sample counts.\n        \"\"\"\n        # Iterating over the list of unique ancestries and adding the sample size if label matches:\n        return f.transform(\n            merged,\n            lambda a: f.when(\n                a.ancestry == ancestry.ancestry,\n                f.struct(\n                    a.ancestry.alias(\"ancestry\"),\n                    (a.sampleSize + ancestry.sampleSize)\n                    .cast(t.LongType())\n                    .alias(\"sampleSize\"),\n                ),\n            ).otherwise(a),\n        )\n\n    @staticmethod\n    def _merge_ancestries_and_counts(ancestry_group: Column) -> Column:\n        \"\"\"Merge ancestries with sample sizes.\n\n        After splitting ancestry annotations, all resulting ancestries needs to be assigned\n        with the proper sample size.\n\n        Args:\n            ancestry_group (Column): Each element is a struct with `sample_count` (int) and `ancestries` (list)\n\n        Returns:\n            Column: a list of structs with `ancestry` and `sampleSize` fields.\n\n        Examples:\n            >>> data = [(12, ['African', 'European']),(12, ['African'])]\n            >>> (\n            ...     spark.createDataFrame(data, ['sample_count', 'ancestries'])\n            ...     .select(GWASCatalogStudyIndex._merge_ancestries_and_counts(f.struct('sample_count', 'ancestries')).alias('test'))\n            ...     .show(truncate=False)\n            ... )\n            +-------------------------------+\n            |test                           |\n            +-------------------------------+\n            |[{African, 12}, {European, 12}]|\n            |[{African, 12}]                |\n            +-------------------------------+\n            <BLANKLINE>\n        \"\"\"\n        # Extract sample size for the ancestry group:\n        count = ancestry_group.sample_count\n\n        # We need to loop through the ancestries:\n        return f.transform(\n            ancestry_group.ancestries,\n            lambda ancestry: f.struct(\n                ancestry.alias(\"ancestry\"),\n                count.alias(\"sampleSize\"),\n            ),\n        )\n\n    @classmethod\n    def _parse_study_table(\n        cls: type[GWASCatalogStudyIndex], catalog_studies: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Harmonise GWASCatalog study table with `StudyIndex` schema.\n\n        Args:\n            catalog_studies (DataFrame): GWAS Catalog study table\n\n        Returns:\n            GWASCatalogStudyIndex: Parsed and annotated GWAS Catalog study table.\n        \"\"\"\n        return GWASCatalogStudyIndex(\n            _df=catalog_studies.select(\n                f.coalesce(\n                    f.col(\"STUDY ACCESSION\"), f.monotonically_increasing_id()\n                ).alias(\"studyId\"),\n                f.lit(\"GCST\").alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.col(\"PUBMED ID\").alias(\"pubmedId\"),\n                f.col(\"FIRST AUTHOR\").alias(\"publicationFirstAuthor\"),\n                f.col(\"DATE\").alias(\"publicationDate\"),\n                f.col(\"JOURNAL\").alias(\"publicationJournal\"),\n                f.col(\"STUDY\").alias(\"publicationTitle\"),\n                f.coalesce(f.col(\"DISEASE/TRAIT\"), f.lit(\"Unreported\")).alias(\n                    \"traitFromSource\"\n                ),\n                f.col(\"INITIAL SAMPLE SIZE\").alias(\"initialSampleSize\"),\n                parse_efos(f.col(\"MAPPED_TRAIT_URI\")).alias(\"traitFromSourceMappedIds\"),\n                parse_efos(f.col(\"MAPPED BACKGROUND TRAIT URI\")).alias(\n                    \"backgroundTraitFromSourceMappedIds\"\n                ),\n            ),\n            _schema=GWASCatalogStudyIndex.get_schema(),\n        )\n\n    @classmethod\n    def from_source(\n        cls: type[GWASCatalogStudyIndex],\n        catalog_studies: DataFrame,\n        ancestry_file: DataFrame,\n        sumstats_lut: DataFrame,\n    ) -> StudyIndex:\n        \"\"\"Ingests study level metadata from the GWAS Catalog.\n\n        Args:\n            catalog_studies (DataFrame): GWAS Catalog raw study table\n            ancestry_file (DataFrame): GWAS Catalog ancestry table.\n            sumstats_lut (DataFrame): GWAS Catalog summary statistics list.\n\n        Returns:\n            StudyIndex: Parsed and annotated GWAS Catalog study table.\n        \"\"\"\n        # Read GWAS Catalogue raw data\n        return (\n            cls._parse_study_table(catalog_studies)\n            ._annotate_ancestries(ancestry_file)\n            ._annotate_sumstats_info(sumstats_lut)\n            ._annotate_discovery_sample_sizes()\n        )\n\n    def update_study_id(\n        self: GWASCatalogStudyIndex, study_annotation: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Update studyId with a dataframe containing study.\n\n        Args:\n            study_annotation (DataFrame): Dataframe containing `updatedStudyId`, `traitFromSource`, `traitFromSourceMappedIds` and key column `studyId`.\n\n        Returns:\n            GWASCatalogStudyIndex: Updated study table.\n        \"\"\"\n        self.df = (\n            self._df.join(\n                study_annotation.select(\n                    *[\n                        f.col(c).alias(f\"updated{c}\")\n                        if c not in [\"studyId\", \"updatedStudyId\"]\n                        else f.col(c)\n                        for c in study_annotation.columns\n                    ]\n                ),\n                on=\"studyId\",\n                how=\"left\",\n            )\n            .withColumn(\n                \"studyId\",\n                f.coalesce(f.col(\"updatedStudyId\"), f.col(\"studyId\")),\n            )\n            .withColumn(\n                \"traitFromSource\",\n                f.coalesce(f.col(\"updatedtraitFromSource\"), f.col(\"traitFromSource\")),\n            )\n            .withColumn(\n                \"traitFromSourceMappedIds\",\n                f.coalesce(\n                    f.col(\"updatedtraitFromSourceMappedIds\"),\n                    f.col(\"traitFromSourceMappedIds\"),\n                ),\n            )\n            .select(self._df.columns)\n        )\n\n        return self\n\n    def _annotate_ancestries(\n        self: GWASCatalogStudyIndex, ancestry_lut: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Extracting sample sizes and ancestry information.\n\n        This function parses the ancestry data. Also get counts for the europeans in the same\n        discovery stage.\n\n        Args:\n            ancestry_lut (DataFrame): Ancestry table as downloaded from the GWAS Catalog\n\n        Returns:\n            GWASCatalogStudyIndex: Slimmed and cleaned version of the ancestry annotation.\n        \"\"\"\n        ancestry = (\n            ancestry_lut\n            # Convert column headers to camelcase:\n            .transform(\n                lambda df: df.select(\n                    *[f.expr(column2camel_case(x)) for x in df.columns]\n                )\n            ).withColumnRenamed(\n                \"studyAccession\", \"studyId\"\n            )  # studyId has not been split yet\n        )\n\n        # Get a high resolution dataset on experimental stage:\n        ancestry_stages = (\n            ancestry.groupBy(\"studyId\")\n            .pivot(\"stage\")\n            .agg(\n                f.collect_set(\n                    f.struct(\n                        f.col(\"broadAncestralCategory\").alias(\"ancestry\"),\n                        f.col(\"numberOfIndividuals\")\n                        .cast(t.LongType())\n                        .alias(\"sampleSize\"),\n                    )\n                )\n            )\n            .withColumn(\n                \"discoverySamples\", self._parse_discovery_samples(f.col(\"initial\"))\n            )\n            .withColumnRenamed(\"replication\", \"replicationSamples\")\n            # Mapping discovery stage ancestries to LD reference:\n            .withColumn(\n                \"ldPopulationStructure\",\n                self.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            )\n            .drop(\"initial\")\n            .persist()\n        )\n\n        # Generate information on the ancestry composition of the discovery stage, and calculate\n        # the proportion of the Europeans:\n        europeans_deconvoluted = (\n            ancestry\n            # Focus on discovery stage:\n            .filter(f.col(\"stage\") == \"initial\")\n            # Sorting ancestries if European:\n            .withColumn(\n                \"ancestryFlag\",\n                # Excluding finnish:\n                f.when(\n                    f.col(\"initialSampleDescription\").contains(\"Finnish\"),\n                    f.lit(\"other\"),\n                )\n                # Excluding Icelandic population:\n                .when(\n                    f.col(\"initialSampleDescription\").contains(\"Icelandic\"),\n                    f.lit(\"other\"),\n                )\n                # Including European ancestry:\n                .when(f.col(\"broadAncestralCategory\") == \"European\", f.lit(\"european\"))\n                # Exclude all other population:\n                .otherwise(\"other\"),\n            )\n            # Grouping by study accession and initial sample description:\n            .groupBy(\"studyId\")\n            .pivot(\"ancestryFlag\")\n            .agg(\n                # Summarizing sample sizes for all ancestries:\n                f.sum(f.col(\"numberOfIndividuals\"))\n            )\n            # Do arithmetics to make sure we have the right proportion of european in the set:\n            .withColumn(\n                \"initialSampleCountEuropean\",\n                f.when(f.col(\"european\").isNull(), f.lit(0)).otherwise(\n                    f.col(\"european\")\n                ),\n            )\n            .withColumn(\n                \"initialSampleCountOther\",\n                f.when(f.col(\"other\").isNull(), f.lit(0)).otherwise(f.col(\"other\")),\n            )\n            .withColumn(\n                \"initialSampleCount\",\n                f.col(\"initialSampleCountEuropean\") + f.col(\"other\"),\n            )\n            .drop(\n                \"european\",\n                \"other\",\n                \"initialSampleCount\",\n                \"initialSampleCountEuropean\",\n                \"initialSampleCountOther\",\n            )\n        )\n\n        parsed_ancestry_lut = ancestry_stages.join(\n            europeans_deconvoluted, on=\"studyId\", how=\"outer\"\n        )\n\n        self.df = self.df.join(parsed_ancestry_lut, on=\"studyId\", how=\"left\")\n        return self\n\n    def _annotate_sumstats_info(\n        self: GWASCatalogStudyIndex, sumstats_lut: DataFrame\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Annotate summary stat locations.\n\n        Args:\n            sumstats_lut (DataFrame): listing GWAS Catalog summary stats paths\n\n        Returns:\n            GWASCatalogStudyIndex: including `summarystatsLocation` and `hasSumstats` columns\n        \"\"\"\n        gwas_sumstats_base_uri = (\n            \"ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/\"\n        )\n\n        parsed_sumstats_lut = sumstats_lut.withColumn(\n            \"summarystatsLocation\",\n            f.concat(\n                f.lit(gwas_sumstats_base_uri),\n                f.regexp_replace(f.col(\"_c0\"), r\"^\\.\\/\", \"\"),\n            ),\n        ).select(\n            f.regexp_extract(f.col(\"summarystatsLocation\"), r\"\\/(GCST\\d+)\\/\", 1).alias(\n                \"studyId\"\n            ),\n            \"summarystatsLocation\",\n            f.lit(True).alias(\"hasSumstats\"),\n        )\n\n        self.df = (\n            self.df.drop(\"hasSumstats\")\n            .join(parsed_sumstats_lut, on=\"studyId\", how=\"left\")\n            .withColumn(\"hasSumstats\", f.coalesce(f.col(\"hasSumstats\"), f.lit(False)))\n        )\n        return self\n\n    def _annotate_discovery_sample_sizes(\n        self: GWASCatalogStudyIndex,\n    ) -> GWASCatalogStudyIndex:\n        \"\"\"Extract the sample size of the discovery stage of the study as annotated in the GWAS Catalog.\n\n        For some studies that measure quantitative traits, nCases and nControls can't be extracted. Therefore, we assume these are 0.\n\n        Returns:\n            GWASCatalogStudyIndex: object with columns `nCases`, `nControls`, and `nSamples` per `studyId` correctly extracted.\n        \"\"\"\n        sample_size_lut = (\n            self.df.select(\n                \"studyId\",\n                f.explode_outer(f.split(f.col(\"initialSampleSize\"), r\",\\s+\")).alias(\n                    \"samples\"\n                ),\n            )\n            # Extracting the sample size from the string:\n            .withColumn(\n                \"sampleSize\",\n                f.regexp_extract(\n                    f.regexp_replace(f.col(\"samples\"), \",\", \"\"), r\"[0-9,]+\", 0\n                ).cast(t.IntegerType()),\n            )\n            .select(\n                \"studyId\",\n                \"sampleSize\",\n                f.when(f.col(\"samples\").contains(\"cases\"), f.col(\"sampleSize\"))\n                .otherwise(f.lit(0))\n                .alias(\"nCases\"),\n                f.when(f.col(\"samples\").contains(\"controls\"), f.col(\"sampleSize\"))\n                .otherwise(f.lit(0))\n                .alias(\"nControls\"),\n            )\n            # Aggregating sample sizes for all ancestries:\n            .groupBy(\"studyId\")  # studyId has not been split yet\n            .agg(\n                f.sum(\"nCases\").alias(\"nCases\"),\n                f.sum(\"nControls\").alias(\"nControls\"),\n                f.sum(\"sampleSize\").alias(\"nSamples\"),\n            )\n        )\n        self.df = self.df.join(sample_size_lut, on=\"studyId\", how=\"left\")\n        return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex.from_source","title":"from_source(catalog_studies: DataFrame, ancestry_file: DataFrame, sumstats_lut: DataFrame) -> StudyIndex classmethod","text":"

Ingests study level metadata from the GWAS Catalog.

Parameters:

Name Type Description Default catalog_studies DataFrame

GWAS Catalog raw study table

required ancestry_file DataFrame

GWAS Catalog ancestry table.

required sumstats_lut DataFrame

GWAS Catalog summary statistics list.

required

Returns:

Name Type Description StudyIndex StudyIndex

Parsed and annotated GWAS Catalog study table.

Source code in src/otg/datasource/gwas_catalog/study_index.py
@classmethod\ndef from_source(\n    cls: type[GWASCatalogStudyIndex],\n    catalog_studies: DataFrame,\n    ancestry_file: DataFrame,\n    sumstats_lut: DataFrame,\n) -> StudyIndex:\n    \"\"\"Ingests study level metadata from the GWAS Catalog.\n\n    Args:\n        catalog_studies (DataFrame): GWAS Catalog raw study table\n        ancestry_file (DataFrame): GWAS Catalog ancestry table.\n        sumstats_lut (DataFrame): GWAS Catalog summary statistics list.\n\n    Returns:\n        StudyIndex: Parsed and annotated GWAS Catalog study table.\n    \"\"\"\n    # Read GWAS Catalogue raw data\n    return (\n        cls._parse_study_table(catalog_studies)\n        ._annotate_ancestries(ancestry_file)\n        ._annotate_sumstats_info(sumstats_lut)\n        ._annotate_discovery_sample_sizes()\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/study_index/#otg.datasource.gwas_catalog.study_index.GWASCatalogStudyIndex.update_study_id","title":"update_study_id(study_annotation: DataFrame) -> GWASCatalogStudyIndex","text":"

Update studyId with a dataframe containing study.

Parameters:

Name Type Description Default study_annotation DataFrame

Dataframe containing updatedStudyId, traitFromSource, traitFromSourceMappedIds and key column studyId.

required

Returns:

Name Type Description GWASCatalogStudyIndex GWASCatalogStudyIndex

Updated study table.

Source code in src/otg/datasource/gwas_catalog/study_index.py
def update_study_id(\n    self: GWASCatalogStudyIndex, study_annotation: DataFrame\n) -> GWASCatalogStudyIndex:\n    \"\"\"Update studyId with a dataframe containing study.\n\n    Args:\n        study_annotation (DataFrame): Dataframe containing `updatedStudyId`, `traitFromSource`, `traitFromSourceMappedIds` and key column `studyId`.\n\n    Returns:\n        GWASCatalogStudyIndex: Updated study table.\n    \"\"\"\n    self.df = (\n        self._df.join(\n            study_annotation.select(\n                *[\n                    f.col(c).alias(f\"updated{c}\")\n                    if c not in [\"studyId\", \"updatedStudyId\"]\n                    else f.col(c)\n                    for c in study_annotation.columns\n                ]\n            ),\n            on=\"studyId\",\n            how=\"left\",\n        )\n        .withColumn(\n            \"studyId\",\n            f.coalesce(f.col(\"updatedStudyId\"), f.col(\"studyId\")),\n        )\n        .withColumn(\n            \"traitFromSource\",\n            f.coalesce(f.col(\"updatedtraitFromSource\"), f.col(\"traitFromSource\")),\n        )\n        .withColumn(\n            \"traitFromSourceMappedIds\",\n            f.coalesce(\n                f.col(\"updatedtraitFromSourceMappedIds\"),\n                f.col(\"traitFromSourceMappedIds\"),\n            ),\n        )\n        .select(self._df.columns)\n    )\n\n    return self\n
"},{"location":"python_api/datasource/gwas_catalog/study_splitter/","title":"Study Splitter","text":""},{"location":"python_api/datasource/gwas_catalog/study_splitter/#otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter","title":"otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter","text":"

Splitting multi-trait GWAS Catalog studies.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
class GWASCatalogStudySplitter:\n    \"\"\"Splitting multi-trait GWAS Catalog studies.\"\"\"\n\n    @staticmethod\n    def _resolve_trait(\n        study_trait: Column, association_trait: Column, p_value_text: Column\n    ) -> Column:\n        \"\"\"Resolve trait names by consolidating association-level and study-level trait names.\n\n        Args:\n            study_trait (Column): Study-level trait name.\n            association_trait (Column): Association-level trait name.\n            p_value_text (Column): P-value text.\n\n        Returns:\n            Column: Resolved trait name.\n        \"\"\"\n        return (\n            f.when(\n                (p_value_text.isNotNull()) & (p_value_text != (\"no_pvalue_text\")),\n                f.concat(\n                    association_trait,\n                    f.lit(\" [\"),\n                    p_value_text,\n                    f.lit(\"]\"),\n                ),\n            )\n            .when(\n                association_trait.isNotNull(),\n                association_trait,\n            )\n            .otherwise(study_trait)\n        )\n\n    @staticmethod\n    def _resolve_efo(association_efo: Column, study_efo: Column) -> Column:\n        \"\"\"Resolve EFOs by consolidating association-level and study-level EFOs.\n\n        Args:\n            association_efo (Column): EFO column from the association table.\n            study_efo (Column): EFO column from the study table.\n\n        Returns:\n            Column: Consolidated EFO column.\n        \"\"\"\n        return f.coalesce(f.split(association_efo, r\"\\/\"), study_efo)\n\n    @staticmethod\n    def _resolve_study_id(study_id: Column, sub_study_description: Column) -> Column:\n        \"\"\"Resolve study IDs by exploding association-level information (e.g. pvalue_text, EFO).\n\n        Args:\n            study_id (Column): Study ID column.\n            sub_study_description (Column): Sub-study description column from the association table.\n\n        Returns:\n            Column: Resolved study ID column.\n        \"\"\"\n        split_w = Window.partitionBy(study_id).orderBy(sub_study_description)\n        row_number = f.dense_rank().over(split_w)\n        substudy_count = f.count(row_number).over(split_w)\n        return f.when(substudy_count == 1, study_id).otherwise(\n            f.concat_ws(\"_\", study_id, row_number)\n        )\n\n    @classmethod\n    def split(\n        cls: type[GWASCatalogStudySplitter],\n        studies: GWASCatalogStudyIndex,\n        associations: GWASCatalogAssociations,\n    ) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n        \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n        If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n        Then disease EFOs, trait names and study ID are consolidated\n\n        Args:\n            studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n            associations (GWASCatalogAssociations): GWAS Catalog associations.\n\n        Returns:\n            Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.\n        \"\"\"\n        # Composite of studies and associations to resolve scattered information\n        st_ass = (\n            associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n            .select(\n                \"studyId\",\n                \"subStudyDescription\",\n                cls._resolve_study_id(\n                    f.col(\"studyId\"), f.col(\"subStudyDescription\")\n                ).alias(\"updatedStudyId\"),\n                cls._resolve_trait(\n                    f.col(\"traitFromSource\"),\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n                ).alias(\"traitFromSource\"),\n                cls._resolve_efo(\n                    f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n                    f.col(\"traitFromSourceMappedIds\"),\n                ).alias(\"traitFromSourceMappedIds\"),\n            )\n            .persist()\n        )\n\n        return (\n            studies.update_study_id(\n                st_ass.select(\n                    \"studyId\",\n                    \"updatedStudyId\",\n                    \"traitFromSource\",\n                    \"traitFromSourceMappedIds\",\n                ).distinct()\n            ),\n            associations.update_study_id(\n                st_ass.select(\n                    \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n                ).distinct()\n            )._qc_ambiguous_study(),\n        )\n
"},{"location":"python_api/datasource/gwas_catalog/study_splitter/#otg.datasource.gwas_catalog.study_splitter.GWASCatalogStudySplitter.split","title":"split(studies: GWASCatalogStudyIndex, associations: GWASCatalogAssociations) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations] classmethod","text":"

Splitting multi-trait GWAS Catalog studies.

If assigned disease of the study and the association don't agree, we assume the study needs to be split. Then disease EFOs, trait names and study ID are consolidated

Parameters:

Name Type Description Default studies GWASCatalogStudyIndex

GWAS Catalog studies.

required associations GWASCatalogAssociations

GWAS Catalog associations.

required

Returns:

Type Description Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]

Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.

Source code in src/otg/datasource/gwas_catalog/study_splitter.py
@classmethod\ndef split(\n    cls: type[GWASCatalogStudySplitter],\n    studies: GWASCatalogStudyIndex,\n    associations: GWASCatalogAssociations,\n) -> Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]:\n    \"\"\"Splitting multi-trait GWAS Catalog studies.\n\n    If assigned disease of the study and the association don't agree, we assume the study needs to be split.\n    Then disease EFOs, trait names and study ID are consolidated\n\n    Args:\n        studies (GWASCatalogStudyIndex): GWAS Catalog studies.\n        associations (GWASCatalogAssociations): GWAS Catalog associations.\n\n    Returns:\n        Tuple[GWASCatalogStudyIndex, GWASCatalogAssociations]: Split studies and associations.\n    \"\"\"\n    # Composite of studies and associations to resolve scattered information\n    st_ass = (\n        associations.df.join(f.broadcast(studies.df), on=\"studyId\", how=\"inner\")\n        .select(\n            \"studyId\",\n            \"subStudyDescription\",\n            cls._resolve_study_id(\n                f.col(\"studyId\"), f.col(\"subStudyDescription\")\n            ).alias(\"updatedStudyId\"),\n            cls._resolve_trait(\n                f.col(\"traitFromSource\"),\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(0),\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(1),\n            ).alias(\"traitFromSource\"),\n            cls._resolve_efo(\n                f.split(\"subStudyDescription\", r\"\\|\").getItem(2),\n                f.col(\"traitFromSourceMappedIds\"),\n            ).alias(\"traitFromSourceMappedIds\"),\n        )\n        .persist()\n    )\n\n    return (\n        studies.update_study_id(\n            st_ass.select(\n                \"studyId\",\n                \"updatedStudyId\",\n                \"traitFromSource\",\n                \"traitFromSourceMappedIds\",\n            ).distinct()\n        ),\n        associations.update_study_id(\n            st_ass.select(\n                \"updatedStudyId\", \"studyId\", \"subStudyDescription\"\n            ).distinct()\n        )._qc_ambiguous_study(),\n    )\n
"},{"location":"python_api/datasource/gwas_catalog/summary_statistics/","title":"Summary statistics","text":""},{"location":"python_api/datasource/gwas_catalog/summary_statistics/#otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics","title":"otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics dataclass","text":"

Bases: SummaryStatistics

GWAS Catalog Summary Statistics reader.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
@dataclass\nclass GWASCatalogSummaryStatistics(SummaryStatistics):\n    \"\"\"GWAS Catalog Summary Statistics reader.\"\"\"\n\n    @classmethod\n    def from_gwas_harmonized_summary_stats(\n        cls: type[GWASCatalogSummaryStatistics],\n        sumstats_df: DataFrame,\n        study_id: str,\n    ) -> GWASCatalogSummaryStatistics:\n        \"\"\"Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.\n\n        Args:\n            sumstats_df (DataFrame): Harmonized dataset read as a spark dataframe from GWAS Catalog.\n            study_id (str): GWAS Catalog study accession.\n\n        Returns:\n            GWASCatalogSummaryStatistics: Summary statistics object.\n        \"\"\"\n        # The effect allele frequency is an optional column, we have to test if it is there:\n        allele_frequency_expression = (\n            f.col(\"hm_effect_allele_frequency\").cast(t.FloatType())\n            if \"hm_effect_allele_frequency\" in sumstats_df.columns\n            else f.lit(None)\n        )\n\n        # Processing columns of interest:\n        processed_sumstats_df = (\n            sumstats_df\n            # Dropping rows which doesn't have proper position:\n            .filter(f.col(\"hm_pos\").cast(t.IntegerType()).isNotNull())\n            .select(\n                # Adding study identifier:\n                f.lit(study_id).cast(t.StringType()).alias(\"studyId\"),\n                # Adding variant identifier:\n                f.col(\"hm_variant_id\").alias(\"variantId\"),\n                f.col(\"hm_chrom\").alias(\"chromosome\"),\n                f.col(\"hm_pos\").cast(t.IntegerType()).alias(\"position\"),\n                # Parsing p-value mantissa and exponent:\n                *parse_pvalue(f.col(\"p_value\")),\n                # Converting/calculating effect and confidence interval:\n                *convert_odds_ratio_to_beta(\n                    f.col(\"hm_beta\").cast(t.DoubleType()),\n                    f.col(\"hm_odds_ratio\").cast(t.DoubleType()),\n                    f.col(\"standard_error\").cast(t.DoubleType()),\n                ),\n                allele_frequency_expression.alias(\"effectAlleleFrequencyFromSource\"),\n            )\n            # The previous select expression generated the necessary fields for calculating the confidence intervals:\n            .select(\n                \"*\",\n                *calculate_confidence_interval(\n                    f.col(\"pValueMantissa\"),\n                    f.col(\"pValueExponent\"),\n                    f.col(\"beta\"),\n                    f.col(\"standardError\"),\n                ),\n            )\n            .repartition(200, \"chromosome\")\n            .sortWithinPartitions(\"position\")\n        )\n\n        # Initializing summary statistics object:\n        return cls(\n            _df=processed_sumstats_df,\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/gwas_catalog/summary_statistics/#otg.datasource.gwas_catalog.summary_statistics.GWASCatalogSummaryStatistics.from_gwas_harmonized_summary_stats","title":"from_gwas_harmonized_summary_stats(sumstats_df: DataFrame, study_id: str) -> GWASCatalogSummaryStatistics classmethod","text":"

Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.

Parameters:

Name Type Description Default sumstats_df DataFrame

Harmonized dataset read as a spark dataframe from GWAS Catalog.

required study_id str

GWAS Catalog study accession.

required

Returns:

Name Type Description GWASCatalogSummaryStatistics GWASCatalogSummaryStatistics

Summary statistics object.

Source code in src/otg/datasource/gwas_catalog/summary_statistics.py
@classmethod\ndef from_gwas_harmonized_summary_stats(\n    cls: type[GWASCatalogSummaryStatistics],\n    sumstats_df: DataFrame,\n    study_id: str,\n) -> GWASCatalogSummaryStatistics:\n    \"\"\"Create summary statistics object from summary statistics flatfile, harmonized by the GWAS Catalog.\n\n    Args:\n        sumstats_df (DataFrame): Harmonized dataset read as a spark dataframe from GWAS Catalog.\n        study_id (str): GWAS Catalog study accession.\n\n    Returns:\n        GWASCatalogSummaryStatistics: Summary statistics object.\n    \"\"\"\n    # The effect allele frequency is an optional column, we have to test if it is there:\n    allele_frequency_expression = (\n        f.col(\"hm_effect_allele_frequency\").cast(t.FloatType())\n        if \"hm_effect_allele_frequency\" in sumstats_df.columns\n        else f.lit(None)\n    )\n\n    # Processing columns of interest:\n    processed_sumstats_df = (\n        sumstats_df\n        # Dropping rows which doesn't have proper position:\n        .filter(f.col(\"hm_pos\").cast(t.IntegerType()).isNotNull())\n        .select(\n            # Adding study identifier:\n            f.lit(study_id).cast(t.StringType()).alias(\"studyId\"),\n            # Adding variant identifier:\n            f.col(\"hm_variant_id\").alias(\"variantId\"),\n            f.col(\"hm_chrom\").alias(\"chromosome\"),\n            f.col(\"hm_pos\").cast(t.IntegerType()).alias(\"position\"),\n            # Parsing p-value mantissa and exponent:\n            *parse_pvalue(f.col(\"p_value\")),\n            # Converting/calculating effect and confidence interval:\n            *convert_odds_ratio_to_beta(\n                f.col(\"hm_beta\").cast(t.DoubleType()),\n                f.col(\"hm_odds_ratio\").cast(t.DoubleType()),\n                f.col(\"standard_error\").cast(t.DoubleType()),\n            ),\n            allele_frequency_expression.alias(\"effectAlleleFrequencyFromSource\"),\n        )\n        # The previous select expression generated the necessary fields for calculating the confidence intervals:\n        .select(\n            \"*\",\n            *calculate_confidence_interval(\n                f.col(\"pValueMantissa\"),\n                f.col(\"pValueExponent\"),\n                f.col(\"beta\"),\n                f.col(\"standardError\"),\n            ),\n        )\n        .repartition(200, \"chromosome\")\n        .sortWithinPartitions(\"position\")\n    )\n\n    # Initializing summary statistics object:\n    return cls(\n        _df=processed_sumstats_df,\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/_intervals/","title":"Chromatin intervals","text":"

TBC

"},{"location":"python_api/datasource/intervals/andersson/","title":"Andersson et al.","text":""},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson","title":"otg.datasource.intervals.andersson.IntervalsAndersson","text":"

Bases: Intervals

Interval dataset from Andersson et al. 2014.

Source code in src/otg/datasource/intervals/andersson.py
class IntervalsAndersson(Intervals):\n    \"\"\"Interval dataset from Andersson et al. 2014.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read andersson2014 dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to the dataset\n\n        Returns:\n            DataFrame: Raw Andersson et al. dataframe\n        \"\"\"\n        input_schema = t.StructType.fromJson(\n            json.loads(\n                pkg_resources.read_text(schemas, \"andersson2014.json\", encoding=\"utf-8\")\n            )\n        )\n        return (\n            spark.read.option(\"delimiter\", \"\\t\")\n            .option(\"mode\", \"DROPMALFORMED\")\n            .option(\"header\", \"true\")\n            .schema(input_schema)\n            .csv(path)\n        )\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsAndersson],\n        raw_anderson_df: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse Andersson et al. 2014 dataset.\n\n        Args:\n            raw_anderson_df (DataFrame): Raw Andersson et al. dataset\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Intervals dataset\n        \"\"\"\n        # Constant values:\n        dataset_name = \"andersson2014\"\n        experiment_type = \"fantom5\"\n        pmid = \"24670763\"\n        bio_feature = \"aggregate\"\n        twosided_threshold = 2.45e6  # <-  this needs to phased out. Filter by percentile instead of absolute value.\n\n        # Read the anderson file:\n        parsed_anderson_df = (\n            raw_anderson_df\n            # Parsing score column and casting as float:\n            .withColumn(\"score\", f.col(\"score\").cast(\"float\") / f.lit(1000))\n            # Parsing the 'name' column:\n            .withColumn(\"parsedName\", f.split(f.col(\"name\"), \";\"))\n            .withColumn(\"gene_symbol\", f.col(\"parsedName\")[2])\n            .withColumn(\"location\", f.col(\"parsedName\")[0])\n            .withColumn(\n                \"chrom\",\n                f.regexp_replace(f.split(f.col(\"location\"), \":|-\")[0], \"chr\", \"\"),\n            )\n            .withColumn(\n                \"start\", f.split(f.col(\"location\"), \":|-\")[1].cast(t.IntegerType())\n            )\n            .withColumn(\n                \"end\", f.split(f.col(\"location\"), \":|-\")[2].cast(t.IntegerType())\n            )\n            # Select relevant columns:\n            .select(\"chrom\", \"start\", \"end\", \"gene_symbol\", \"score\")\n            # Drop rows with non-canonical chromosomes:\n            .filter(\n                f.col(\"chrom\").isin([str(x) for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"])\n            )\n            # For each region/gene, keep only one row with the highest score:\n            .groupBy(\"chrom\", \"start\", \"end\", \"gene_symbol\")\n            .agg(f.max(\"score\").alias(\"resourceScore\"))\n            .orderBy(\"chrom\", \"start\")\n        )\n\n        return cls(\n            _df=(\n                # Lift over the intervals:\n                lift.convert_intervals(parsed_anderson_df, \"chrom\", \"start\", \"end\")\n                .drop(\"start\", \"end\")\n                .withColumnRenamed(\"mapped_start\", \"start\")\n                .withColumnRenamed(\"mapped_end\", \"end\")\n                .distinct()\n                # Joining with the gene index\n                .alias(\"intervals\")\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[\n                        f.col(\"intervals.gene_symbol\") == f.col(\"genes.geneSymbol\"),\n                        # Drop rows where the TSS is far from the start of the region\n                        f.abs(\n                            (f.col(\"intervals.start\") + f.col(\"intervals.end\")) / 2\n                            - f.col(\"tss\")\n                        )\n                        <= twosided_threshold,\n                    ],\n                    how=\"left\",\n                )\n                # Select relevant columns:\n                .select(\n                    f.col(\"chrom\").alias(\"chromosome\"),\n                    f.col(\"intervals.start\").alias(\"start\"),\n                    f.col(\"intervals.end\").alias(\"end\"),\n                    \"geneId\",\n                    \"resourceScore\",\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                    f.lit(bio_feature).alias(\"biofeature\"),\n                )\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson.parse","title":"parse(raw_anderson_df: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse Andersson et al. 2014 dataset.

Parameters:

Name Type Description Default raw_anderson_df DataFrame

Raw Andersson et al. dataset

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Intervals dataset

Source code in src/otg/datasource/intervals/andersson.py
@classmethod\ndef parse(\n    cls: type[IntervalsAndersson],\n    raw_anderson_df: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse Andersson et al. 2014 dataset.\n\n    Args:\n        raw_anderson_df (DataFrame): Raw Andersson et al. dataset\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Intervals dataset\n    \"\"\"\n    # Constant values:\n    dataset_name = \"andersson2014\"\n    experiment_type = \"fantom5\"\n    pmid = \"24670763\"\n    bio_feature = \"aggregate\"\n    twosided_threshold = 2.45e6  # <-  this needs to phased out. Filter by percentile instead of absolute value.\n\n    # Read the anderson file:\n    parsed_anderson_df = (\n        raw_anderson_df\n        # Parsing score column and casting as float:\n        .withColumn(\"score\", f.col(\"score\").cast(\"float\") / f.lit(1000))\n        # Parsing the 'name' column:\n        .withColumn(\"parsedName\", f.split(f.col(\"name\"), \";\"))\n        .withColumn(\"gene_symbol\", f.col(\"parsedName\")[2])\n        .withColumn(\"location\", f.col(\"parsedName\")[0])\n        .withColumn(\n            \"chrom\",\n            f.regexp_replace(f.split(f.col(\"location\"), \":|-\")[0], \"chr\", \"\"),\n        )\n        .withColumn(\n            \"start\", f.split(f.col(\"location\"), \":|-\")[1].cast(t.IntegerType())\n        )\n        .withColumn(\n            \"end\", f.split(f.col(\"location\"), \":|-\")[2].cast(t.IntegerType())\n        )\n        # Select relevant columns:\n        .select(\"chrom\", \"start\", \"end\", \"gene_symbol\", \"score\")\n        # Drop rows with non-canonical chromosomes:\n        .filter(\n            f.col(\"chrom\").isin([str(x) for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"])\n        )\n        # For each region/gene, keep only one row with the highest score:\n        .groupBy(\"chrom\", \"start\", \"end\", \"gene_symbol\")\n        .agg(f.max(\"score\").alias(\"resourceScore\"))\n        .orderBy(\"chrom\", \"start\")\n    )\n\n    return cls(\n        _df=(\n            # Lift over the intervals:\n            lift.convert_intervals(parsed_anderson_df, \"chrom\", \"start\", \"end\")\n            .drop(\"start\", \"end\")\n            .withColumnRenamed(\"mapped_start\", \"start\")\n            .withColumnRenamed(\"mapped_end\", \"end\")\n            .distinct()\n            # Joining with the gene index\n            .alias(\"intervals\")\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.gene_symbol\") == f.col(\"genes.geneSymbol\"),\n                    # Drop rows where the TSS is far from the start of the region\n                    f.abs(\n                        (f.col(\"intervals.start\") + f.col(\"intervals.end\")) / 2\n                        - f.col(\"tss\")\n                    )\n                    <= twosided_threshold,\n                ],\n                how=\"left\",\n            )\n            # Select relevant columns:\n            .select(\n                f.col(\"chrom\").alias(\"chromosome\"),\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                \"geneId\",\n                \"resourceScore\",\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n                f.lit(bio_feature).alias(\"biofeature\"),\n            )\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/andersson/#otg.datasource.intervals.andersson.IntervalsAndersson.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read andersson2014 dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to the dataset

required

Returns:

Name Type Description DataFrame DataFrame

Raw Andersson et al. dataframe

Source code in src/otg/datasource/intervals/andersson.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read andersson2014 dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to the dataset\n\n    Returns:\n        DataFrame: Raw Andersson et al. dataframe\n    \"\"\"\n    input_schema = t.StructType.fromJson(\n        json.loads(\n            pkg_resources.read_text(schemas, \"andersson2014.json\", encoding=\"utf-8\")\n        )\n    )\n    return (\n        spark.read.option(\"delimiter\", \"\\t\")\n        .option(\"mode\", \"DROPMALFORMED\")\n        .option(\"header\", \"true\")\n        .schema(input_schema)\n        .csv(path)\n    )\n
"},{"location":"python_api/datasource/intervals/javierre/","title":"Javierre et al.","text":""},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre","title":"otg.datasource.intervals.javierre.IntervalsJavierre","text":"

Bases: Intervals

Interval dataset from Javierre et al. 2016.

Source code in src/otg/datasource/intervals/javierre.py
class IntervalsJavierre(Intervals):\n    \"\"\"Interval dataset from Javierre et al. 2016.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read Javierre dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: Raw Javierre dataset\n        \"\"\"\n        return spark.read.parquet(path)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsJavierre],\n        javierre_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse Javierre et al. 2016 dataset.\n\n        Args:\n            javierre_raw (DataFrame): Raw Javierre data\n            gene_index (GeneIndex): Gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Javierre et al. 2016 interval data\n        \"\"\"\n        # Constant values:\n        dataset_name = \"javierre2016\"\n        experiment_type = \"pchic\"\n        pmid = \"27863249\"\n        twosided_threshold = 2.45e6\n\n        # Read Javierre data:\n        javierre_parsed = (\n            javierre_raw\n            # Splitting name column into chromosome, start, end, and score:\n            .withColumn(\"name_split\", f.split(f.col(\"name\"), r\":|-|,\"))\n            .withColumn(\n                \"name_chr\",\n                f.regexp_replace(f.col(\"name_split\")[0], \"chr\", \"\").cast(\n                    t.StringType()\n                ),\n            )\n            .withColumn(\"name_start\", f.col(\"name_split\")[1].cast(t.IntegerType()))\n            .withColumn(\"name_end\", f.col(\"name_split\")[2].cast(t.IntegerType()))\n            .withColumn(\"name_score\", f.col(\"name_split\")[3].cast(t.FloatType()))\n            # Cleaning up chromosome:\n            .withColumn(\n                \"chrom\",\n                f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").cast(t.StringType()),\n            )\n            .drop(\"name_split\", \"name\", \"annotation\")\n            # Keep canonical chromosomes and consistent chromosomes with scores:\n            .filter(\n                (f.col(\"name_score\").isNotNull())\n                & (f.col(\"chrom\") == f.col(\"name_chr\"))\n                & f.col(\"name_chr\").isin(\n                    [f\"{x}\" for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"]\n                )\n            )\n        )\n\n        # Lifting over intervals:\n        javierre_remapped = (\n            javierre_parsed\n            # Lifting over to GRCh38 interval 1:\n            .transform(lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\"))\n            .drop(\"start\", \"end\")\n            .withColumnRenamed(\"mapped_chrom\", \"chrom\")\n            .withColumnRenamed(\"mapped_start\", \"start\")\n            .withColumnRenamed(\"mapped_end\", \"end\")\n            # Lifting over interval 2 to GRCh38:\n            .transform(\n                lambda df: lift.convert_intervals(\n                    df, \"name_chr\", \"name_start\", \"name_end\"\n                )\n            )\n            .drop(\"name_start\", \"name_end\")\n            .withColumnRenamed(\"mapped_name_chr\", \"name_chr\")\n            .withColumnRenamed(\"mapped_name_start\", \"name_start\")\n            .withColumnRenamed(\"mapped_name_end\", \"name_end\")\n        )\n\n        # Once the intervals are lifted, extracting the unique intervals:\n        unique_intervals_with_genes = (\n            javierre_remapped.select(\n                f.col(\"chrom\"),\n                f.col(\"start\").cast(t.IntegerType()),\n                f.col(\"end\").cast(t.IntegerType()),\n            )\n            .distinct()\n            .alias(\"intervals\")\n            .join(\n                gene_index.locations_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                    (\n                        (f.col(\"intervals.start\") >= f.col(\"genes.start\"))\n                        & (f.col(\"intervals.start\") <= f.col(\"genes.end\"))\n                    )\n                    | (\n                        (f.col(\"intervals.end\") >= f.col(\"genes.start\"))\n                        & (f.col(\"intervals.end\") <= f.col(\"genes.end\"))\n                    ),\n                ],\n                how=\"left\",\n            )\n            .select(\n                f.col(\"intervals.chrom\").alias(\"chrom\"),\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                f.col(\"genes.geneId\").alias(\"geneId\"),\n                f.col(\"genes.tss\").alias(\"tss\"),\n            )\n        )\n\n        # Joining back the data:\n        return cls(\n            _df=(\n                javierre_remapped.join(\n                    unique_intervals_with_genes,\n                    on=[\"chrom\", \"start\", \"end\"],\n                    how=\"left\",\n                )\n                .filter(\n                    # Drop rows where the TSS is far from the start of the region\n                    f.abs((f.col(\"start\") + f.col(\"end\")) / 2 - f.col(\"tss\"))\n                    <= twosided_threshold\n                )\n                # For each gene, keep only the highest scoring interval:\n                .groupBy(\"name_chr\", \"name_start\", \"name_end\", \"geneId\", \"bio_feature\")\n                .agg(f.max(f.col(\"name_score\")).alias(\"resourceScore\"))\n                # Create the output:\n                .select(\n                    f.col(\"name_chr\").alias(\"chromosome\"),\n                    f.col(\"name_start\").alias(\"start\"),\n                    f.col(\"name_end\").alias(\"end\"),\n                    f.col(\"resourceScore\").cast(t.DoubleType()),\n                    f.col(\"geneId\"),\n                    f.col(\"bio_feature\").alias(\"biofeature\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre.parse","title":"parse(javierre_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse Javierre et al. 2016 dataset.

Parameters:

Name Type Description Default javierre_raw DataFrame

Raw Javierre data

required gene_index GeneIndex

Gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Javierre et al. 2016 interval data

Source code in src/otg/datasource/intervals/javierre.py
@classmethod\ndef parse(\n    cls: type[IntervalsJavierre],\n    javierre_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse Javierre et al. 2016 dataset.\n\n    Args:\n        javierre_raw (DataFrame): Raw Javierre data\n        gene_index (GeneIndex): Gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Javierre et al. 2016 interval data\n    \"\"\"\n    # Constant values:\n    dataset_name = \"javierre2016\"\n    experiment_type = \"pchic\"\n    pmid = \"27863249\"\n    twosided_threshold = 2.45e6\n\n    # Read Javierre data:\n    javierre_parsed = (\n        javierre_raw\n        # Splitting name column into chromosome, start, end, and score:\n        .withColumn(\"name_split\", f.split(f.col(\"name\"), r\":|-|,\"))\n        .withColumn(\n            \"name_chr\",\n            f.regexp_replace(f.col(\"name_split\")[0], \"chr\", \"\").cast(\n                t.StringType()\n            ),\n        )\n        .withColumn(\"name_start\", f.col(\"name_split\")[1].cast(t.IntegerType()))\n        .withColumn(\"name_end\", f.col(\"name_split\")[2].cast(t.IntegerType()))\n        .withColumn(\"name_score\", f.col(\"name_split\")[3].cast(t.FloatType()))\n        # Cleaning up chromosome:\n        .withColumn(\n            \"chrom\",\n            f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").cast(t.StringType()),\n        )\n        .drop(\"name_split\", \"name\", \"annotation\")\n        # Keep canonical chromosomes and consistent chromosomes with scores:\n        .filter(\n            (f.col(\"name_score\").isNotNull())\n            & (f.col(\"chrom\") == f.col(\"name_chr\"))\n            & f.col(\"name_chr\").isin(\n                [f\"{x}\" for x in range(1, 23)] + [\"X\", \"Y\", \"MT\"]\n            )\n        )\n    )\n\n    # Lifting over intervals:\n    javierre_remapped = (\n        javierre_parsed\n        # Lifting over to GRCh38 interval 1:\n        .transform(lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\"))\n        .drop(\"start\", \"end\")\n        .withColumnRenamed(\"mapped_chrom\", \"chrom\")\n        .withColumnRenamed(\"mapped_start\", \"start\")\n        .withColumnRenamed(\"mapped_end\", \"end\")\n        # Lifting over interval 2 to GRCh38:\n        .transform(\n            lambda df: lift.convert_intervals(\n                df, \"name_chr\", \"name_start\", \"name_end\"\n            )\n        )\n        .drop(\"name_start\", \"name_end\")\n        .withColumnRenamed(\"mapped_name_chr\", \"name_chr\")\n        .withColumnRenamed(\"mapped_name_start\", \"name_start\")\n        .withColumnRenamed(\"mapped_name_end\", \"name_end\")\n    )\n\n    # Once the intervals are lifted, extracting the unique intervals:\n    unique_intervals_with_genes = (\n        javierre_remapped.select(\n            f.col(\"chrom\"),\n            f.col(\"start\").cast(t.IntegerType()),\n            f.col(\"end\").cast(t.IntegerType()),\n        )\n        .distinct()\n        .alias(\"intervals\")\n        .join(\n            gene_index.locations_lut().alias(\"genes\"),\n            on=[\n                f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                (\n                    (f.col(\"intervals.start\") >= f.col(\"genes.start\"))\n                    & (f.col(\"intervals.start\") <= f.col(\"genes.end\"))\n                )\n                | (\n                    (f.col(\"intervals.end\") >= f.col(\"genes.start\"))\n                    & (f.col(\"intervals.end\") <= f.col(\"genes.end\"))\n                ),\n            ],\n            how=\"left\",\n        )\n        .select(\n            f.col(\"intervals.chrom\").alias(\"chrom\"),\n            f.col(\"intervals.start\").alias(\"start\"),\n            f.col(\"intervals.end\").alias(\"end\"),\n            f.col(\"genes.geneId\").alias(\"geneId\"),\n            f.col(\"genes.tss\").alias(\"tss\"),\n        )\n    )\n\n    # Joining back the data:\n    return cls(\n        _df=(\n            javierre_remapped.join(\n                unique_intervals_with_genes,\n                on=[\"chrom\", \"start\", \"end\"],\n                how=\"left\",\n            )\n            .filter(\n                # Drop rows where the TSS is far from the start of the region\n                f.abs((f.col(\"start\") + f.col(\"end\")) / 2 - f.col(\"tss\"))\n                <= twosided_threshold\n            )\n            # For each gene, keep only the highest scoring interval:\n            .groupBy(\"name_chr\", \"name_start\", \"name_end\", \"geneId\", \"bio_feature\")\n            .agg(f.max(f.col(\"name_score\")).alias(\"resourceScore\"))\n            # Create the output:\n            .select(\n                f.col(\"name_chr\").alias(\"chromosome\"),\n                f.col(\"name_start\").alias(\"start\"),\n                f.col(\"name_end\").alias(\"end\"),\n                f.col(\"resourceScore\").cast(t.DoubleType()),\n                f.col(\"geneId\"),\n                f.col(\"bio_feature\").alias(\"biofeature\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/javierre/#otg.datasource.intervals.javierre.IntervalsJavierre.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read Javierre dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

Raw Javierre dataset

Source code in src/otg/datasource/intervals/javierre.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read Javierre dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: Raw Javierre dataset\n    \"\"\"\n    return spark.read.parquet(path)\n
"},{"location":"python_api/datasource/intervals/jung/","title":"Jung et al.","text":""},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung","title":"otg.datasource.intervals.jung.IntervalsJung","text":"

Bases: Intervals

Interval dataset from Jung et al. 2019.

Source code in src/otg/datasource/intervals/jung.py
class IntervalsJung(Intervals):\n    \"\"\"Interval dataset from Jung et al. 2019.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read jung dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: DataFrame with raw jung data\n        \"\"\"\n        return spark.read.csv(path, sep=\",\", header=True)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsJung],\n        jung_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse the Jung et al. 2019 dataset.\n\n        Args:\n            jung_raw (DataFrame): raw Jung et al. 2019 dataset\n            gene_index (GeneIndex): gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Interval dataset containing Jung et al. 2019 data\n        \"\"\"\n        dataset_name = \"jung2019\"\n        experiment_type = \"pchic\"\n        pmid = \"31501517\"\n\n        # Lifting over the coordinates:\n        return cls(\n            _df=(\n                jung_raw.withColumn(\n                    \"interval\", f.split(f.col(\"Interacting_fragment\"), r\"\\.\")\n                )\n                .select(\n                    # Parsing intervals:\n                    f.regexp_replace(f.col(\"interval\")[0], \"chr\", \"\").alias(\"chrom\"),\n                    f.col(\"interval\")[1].cast(t.IntegerType()).alias(\"start\"),\n                    f.col(\"interval\")[2].cast(t.IntegerType()).alias(\"end\"),\n                    # Extract other columns:\n                    f.col(\"Promoter\").alias(\"gene_name\"),\n                    f.col(\"Tissue_type\").alias(\"tissue\"),\n                )\n                # Lifting over to GRCh38 interval 1:\n                .transform(\n                    lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n                )\n                .select(\n                    \"chrom\",\n                    f.col(\"mapped_start\").alias(\"start\"),\n                    f.col(\"mapped_end\").alias(\"end\"),\n                    f.explode(f.split(f.col(\"gene_name\"), \";\")).alias(\"gene_name\"),\n                    \"tissue\",\n                )\n                .alias(\"intervals\")\n                # Joining with genes:\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\")],\n                    how=\"inner\",\n                )\n                # Finalize dataset:\n                .select(\n                    \"chromosome\",\n                    f.col(\"intervals.start\").alias(\"start\"),\n                    f.col(\"intervals.end\").alias(\"end\"),\n                    \"geneId\",\n                    f.col(\"tissue\").alias(\"biofeature\"),\n                    f.lit(1.0).alias(\"score\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n                .drop_duplicates()\n            ),\n            _schema=Intervals.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung.parse","title":"parse(jung_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse the Jung et al. 2019 dataset.

Parameters:

Name Type Description Default jung_raw DataFrame

raw Jung et al. 2019 dataset

required gene_index GeneIndex

gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Interval dataset containing Jung et al. 2019 data

Source code in src/otg/datasource/intervals/jung.py
@classmethod\ndef parse(\n    cls: type[IntervalsJung],\n    jung_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse the Jung et al. 2019 dataset.\n\n    Args:\n        jung_raw (DataFrame): raw Jung et al. 2019 dataset\n        gene_index (GeneIndex): gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Interval dataset containing Jung et al. 2019 data\n    \"\"\"\n    dataset_name = \"jung2019\"\n    experiment_type = \"pchic\"\n    pmid = \"31501517\"\n\n    # Lifting over the coordinates:\n    return cls(\n        _df=(\n            jung_raw.withColumn(\n                \"interval\", f.split(f.col(\"Interacting_fragment\"), r\"\\.\")\n            )\n            .select(\n                # Parsing intervals:\n                f.regexp_replace(f.col(\"interval\")[0], \"chr\", \"\").alias(\"chrom\"),\n                f.col(\"interval\")[1].cast(t.IntegerType()).alias(\"start\"),\n                f.col(\"interval\")[2].cast(t.IntegerType()).alias(\"end\"),\n                # Extract other columns:\n                f.col(\"Promoter\").alias(\"gene_name\"),\n                f.col(\"Tissue_type\").alias(\"tissue\"),\n            )\n            # Lifting over to GRCh38 interval 1:\n            .transform(\n                lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n            )\n            .select(\n                \"chrom\",\n                f.col(\"mapped_start\").alias(\"start\"),\n                f.col(\"mapped_end\").alias(\"end\"),\n                f.explode(f.split(f.col(\"gene_name\"), \";\")).alias(\"gene_name\"),\n                \"tissue\",\n            )\n            .alias(\"intervals\")\n            # Joining with genes:\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\")],\n                how=\"inner\",\n            )\n            # Finalize dataset:\n            .select(\n                \"chromosome\",\n                f.col(\"intervals.start\").alias(\"start\"),\n                f.col(\"intervals.end\").alias(\"end\"),\n                \"geneId\",\n                f.col(\"tissue\").alias(\"biofeature\"),\n                f.lit(1.0).alias(\"score\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n            .drop_duplicates()\n        ),\n        _schema=Intervals.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/jung/#otg.datasource.intervals.jung.IntervalsJung.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read jung dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

DataFrame with raw jung data

Source code in src/otg/datasource/intervals/jung.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read jung dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: DataFrame with raw jung data\n    \"\"\"\n    return spark.read.csv(path, sep=\",\", header=True)\n
"},{"location":"python_api/datasource/intervals/thurman/","title":"Thurman et al.","text":""},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman","title":"otg.datasource.intervals.thurman.IntervalsThurman","text":"

Bases: Intervals

Interval dataset from Thurman et al. 2012.

Source code in src/otg/datasource/intervals/thurman.py
class IntervalsThurman(Intervals):\n    \"\"\"Interval dataset from Thurman et al. 2012.\"\"\"\n\n    @staticmethod\n    def read(spark: SparkSession, path: str) -> DataFrame:\n        \"\"\"Read thurman dataset.\n\n        Args:\n            spark (SparkSession): Spark session\n            path (str): Path to dataset\n\n        Returns:\n            DataFrame: DataFrame with raw thurman data\n        \"\"\"\n        thurman_schema = t.StructType(\n            [\n                t.StructField(\"gene_chr\", t.StringType(), False),\n                t.StructField(\"gene_start\", t.IntegerType(), False),\n                t.StructField(\"gene_end\", t.IntegerType(), False),\n                t.StructField(\"gene_name\", t.StringType(), False),\n                t.StructField(\"chrom\", t.StringType(), False),\n                t.StructField(\"start\", t.IntegerType(), False),\n                t.StructField(\"end\", t.IntegerType(), False),\n                t.StructField(\"score\", t.FloatType(), False),\n            ]\n        )\n        return spark.read.csv(path, sep=\"\\t\", header=True, schema=thurman_schema)\n\n    @classmethod\n    def parse(\n        cls: type[IntervalsThurman],\n        thurman_raw: DataFrame,\n        gene_index: GeneIndex,\n        lift: LiftOverSpark,\n    ) -> Intervals:\n        \"\"\"Parse the Thurman et al. 2012 dataset.\n\n        Args:\n            thurman_raw (DataFrame): raw Thurman et al. 2019 dataset\n            gene_index (GeneIndex): gene index\n            lift (LiftOverSpark): LiftOverSpark instance\n\n        Returns:\n            Intervals: Interval dataset containing Thurman et al. 2012 data\n        \"\"\"\n        dataset_name = \"thurman2012\"\n        experiment_type = \"dhscor\"\n        pmid = \"22955617\"\n\n        return cls(\n            _df=(\n                thurman_raw.select(\n                    f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").alias(\"chrom\"),\n                    \"start\",\n                    \"end\",\n                    \"gene_name\",\n                    \"score\",\n                )\n                # Lift over to the GRCh38 build:\n                .transform(\n                    lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n                )\n                .alias(\"intervals\")\n                # Map gene names to gene IDs:\n                .join(\n                    gene_index.symbols_lut().alias(\"genes\"),\n                    on=[\n                        f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\"),\n                        f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                    ],\n                    how=\"inner\",\n                )\n                # Select relevant columns and add constant columns:\n                .select(\n                    f.col(\"chrom\").alias(\"chromosome\"),\n                    f.col(\"mapped_start\").alias(\"start\"),\n                    f.col(\"mapped_end\").alias(\"end\"),\n                    \"geneId\",\n                    f.col(\"score\").cast(t.DoubleType()).alias(\"resourceScore\"),\n                    f.lit(dataset_name).alias(\"datasourceId\"),\n                    f.lit(experiment_type).alias(\"datatypeId\"),\n                    f.lit(pmid).alias(\"pmid\"),\n                )\n                .distinct()\n            ),\n            _schema=cls.get_schema(),\n        )\n
"},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman.parse","title":"parse(thurman_raw: DataFrame, gene_index: GeneIndex, lift: LiftOverSpark) -> Intervals classmethod","text":"

Parse the Thurman et al. 2012 dataset.

Parameters:

Name Type Description Default thurman_raw DataFrame

raw Thurman et al. 2019 dataset

required gene_index GeneIndex

gene index

required lift LiftOverSpark

LiftOverSpark instance

required

Returns:

Name Type Description Intervals Intervals

Interval dataset containing Thurman et al. 2012 data

Source code in src/otg/datasource/intervals/thurman.py
@classmethod\ndef parse(\n    cls: type[IntervalsThurman],\n    thurman_raw: DataFrame,\n    gene_index: GeneIndex,\n    lift: LiftOverSpark,\n) -> Intervals:\n    \"\"\"Parse the Thurman et al. 2012 dataset.\n\n    Args:\n        thurman_raw (DataFrame): raw Thurman et al. 2019 dataset\n        gene_index (GeneIndex): gene index\n        lift (LiftOverSpark): LiftOverSpark instance\n\n    Returns:\n        Intervals: Interval dataset containing Thurman et al. 2012 data\n    \"\"\"\n    dataset_name = \"thurman2012\"\n    experiment_type = \"dhscor\"\n    pmid = \"22955617\"\n\n    return cls(\n        _df=(\n            thurman_raw.select(\n                f.regexp_replace(f.col(\"chrom\"), \"chr\", \"\").alias(\"chrom\"),\n                \"start\",\n                \"end\",\n                \"gene_name\",\n                \"score\",\n            )\n            # Lift over to the GRCh38 build:\n            .transform(\n                lambda df: lift.convert_intervals(df, \"chrom\", \"start\", \"end\")\n            )\n            .alias(\"intervals\")\n            # Map gene names to gene IDs:\n            .join(\n                gene_index.symbols_lut().alias(\"genes\"),\n                on=[\n                    f.col(\"intervals.gene_name\") == f.col(\"genes.geneSymbol\"),\n                    f.col(\"intervals.chrom\") == f.col(\"genes.chromosome\"),\n                ],\n                how=\"inner\",\n            )\n            # Select relevant columns and add constant columns:\n            .select(\n                f.col(\"chrom\").alias(\"chromosome\"),\n                f.col(\"mapped_start\").alias(\"start\"),\n                f.col(\"mapped_end\").alias(\"end\"),\n                \"geneId\",\n                f.col(\"score\").cast(t.DoubleType()).alias(\"resourceScore\"),\n                f.lit(dataset_name).alias(\"datasourceId\"),\n                f.lit(experiment_type).alias(\"datatypeId\"),\n                f.lit(pmid).alias(\"pmid\"),\n            )\n            .distinct()\n        ),\n        _schema=cls.get_schema(),\n    )\n
"},{"location":"python_api/datasource/intervals/thurman/#otg.datasource.intervals.thurman.IntervalsThurman.read","title":"read(spark: SparkSession, path: str) -> DataFrame staticmethod","text":"

Read thurman dataset.

Parameters:

Name Type Description Default spark SparkSession

Spark session

required path str

Path to dataset

required

Returns:

Name Type Description DataFrame DataFrame

DataFrame with raw thurman data

Source code in src/otg/datasource/intervals/thurman.py
@staticmethod\ndef read(spark: SparkSession, path: str) -> DataFrame:\n    \"\"\"Read thurman dataset.\n\n    Args:\n        spark (SparkSession): Spark session\n        path (str): Path to dataset\n\n    Returns:\n        DataFrame: DataFrame with raw thurman data\n    \"\"\"\n    thurman_schema = t.StructType(\n        [\n            t.StructField(\"gene_chr\", t.StringType(), False),\n            t.StructField(\"gene_start\", t.IntegerType(), False),\n            t.StructField(\"gene_end\", t.IntegerType(), False),\n            t.StructField(\"gene_name\", t.StringType(), False),\n            t.StructField(\"chrom\", t.StringType(), False),\n            t.StructField(\"start\", t.IntegerType(), False),\n            t.StructField(\"end\", t.IntegerType(), False),\n            t.StructField(\"score\", t.FloatType(), False),\n        ]\n    )\n    return spark.read.csv(path, sep=\"\\t\", header=True, schema=thurman_schema)\n
"},{"location":"python_api/datasource/open_targets/_open_targets/","title":"Open Targets","text":"

The Open Targets Platform is a comprehensive resource that aims to aggregate and harmonise various types of data to facilitate the identification, prioritisation, and validation of drug targets. By integrating publicly available datasets including data generated by the Open Targets consortium, the Platform builds and scores target-disease associations to assist in drug target identification and prioritisation. It also integrates relevant annotation information about targets, diseases, phenotypes, and drugs, as well as their most relevant relationships.

Genomic data from Open Targets integrates human genome-wide association studies (GWAS) and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes.

"},{"location":"python_api/datasource/open_targets/l2g_gold_standard/","title":"L2G Gold Standard","text":""},{"location":"python_api/datasource/open_targets/l2g_gold_standard/#otg.datasource.open_targets.l2g_gold_standard.OpenTargetsL2GGoldStandard","title":"otg.datasource.open_targets.l2g_gold_standard.OpenTargetsL2GGoldStandard","text":"

Parser for OTGenetics locus to gene gold standards curation.

The curation is processed to generate a dataset with 2 labels
  • Gold Standard Positive (GSP): Variant is within 500kb of gene
  • Gold Standard Negative (GSN): Variant is not within 500kb of gene
Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
class OpenTargetsL2GGoldStandard:\n    \"\"\"Parser for OTGenetics locus to gene gold standards curation.\n\n    The curation is processed to generate a dataset with 2 labels:\n        - Gold Standard Positive (GSP): Variant is within 500kb of gene\n        - Gold Standard Negative (GSN): Variant is not within 500kb of gene\n    \"\"\"\n\n    @staticmethod\n    def process_gene_interactions(interactions: DataFrame) -> DataFrame:\n        \"\"\"Extract top scoring gene-gene interaction from the interactions dataset of the Platform.\n\n        Args:\n            interactions (DataFrame): Gene-gene interactions dataset\n\n        Returns:\n            DataFrame: Top scoring gene-gene interaction per pair of genes\n        \"\"\"\n        return get_record_with_maximum_value(\n            interactions,\n            [\"targetA\", \"targetB\"],\n            \"scoring\",\n        ).selectExpr(\n            \"targetA as geneIdA\",\n            \"targetB as geneIdB\",\n            \"scoring as score\",\n        )\n\n    @classmethod\n    def as_l2g_gold_standard(\n        cls: type[OpenTargetsL2GGoldStandard],\n        gold_standard_curation: DataFrame,\n        v2g: V2G,\n        study_locus_overlap: StudyLocusOverlap,\n        interactions: DataFrame,\n    ) -> L2GGoldStandard:\n        \"\"\"Initialise L2GGoldStandard from source dataset.\n\n        Args:\n            gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards\n            v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS\n            study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci\n            interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene\n\n        Returns:\n            L2GGoldStandard: L2G Gold Standard dataset\n        \"\"\"\n        overlaps_df = study_locus_overlap._df.select(\n            \"leftStudyLocusId\", \"rightStudyLocusId\"\n        )\n        interactions_df = cls.process_gene_interactions(interactions)\n        return L2GGoldStandard(\n            _df=(\n                gold_standard_curation.filter(\n                    f.col(\"gold_standard_info.highest_confidence\").isin(\n                        [\"High\", \"Medium\"]\n                    )\n                )\n                .select(\n                    f.col(\"association_info.otg_id\").alias(\"studyId\"),\n                    f.col(\"gold_standard_info.gene_id\").alias(\"geneId\"),\n                    f.concat_ws(\n                        \"_\",\n                        f.col(\"sentinel_variant.locus_GRCh38.chromosome\"),\n                        f.col(\"sentinel_variant.locus_GRCh38.position\"),\n                        f.col(\"sentinel_variant.alleles.reference\"),\n                        f.col(\"sentinel_variant.alleles.alternative\"),\n                    ).alias(\"variantId\"),\n                    f.col(\"metadata.set_label\").alias(\"source\"),\n                )\n                .withColumn(\n                    \"studyLocusId\",\n                    StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n                )\n                .groupBy(\"studyLocusId\", \"studyId\", \"variantId\", \"geneId\")\n                .agg(\n                    f.collect_set(\"source\").alias(\"sources\"),\n                )\n                # Assign Positive or Negative Status based on confidence\n                .join(\n                    v2g.df.filter(f.col(\"distance\").isNotNull()).select(\n                        \"variantId\", \"geneId\", \"distance\"\n                    ),\n                    on=[\"variantId\", \"geneId\"],\n                    how=\"inner\",\n                )\n                .withColumn(\n                    \"goldStandardSet\",\n                    f.when(f.col(\"distance\") <= 500_000, f.lit(\"positive\")).otherwise(\n                        f.lit(\"negative\")\n                    ),\n                )\n                # Remove redundant loci by testing they are truly independent\n                .alias(\"left\")\n                .join(\n                    overlaps_df.alias(\"right\"),\n                    (f.col(\"left.variantId\") == f.col(\"right.leftStudyLocusId\"))\n                    | (f.col(\"left.variantId\") == f.col(\"right.rightStudyLocusId\")),\n                    how=\"left\",\n                )\n                .distinct()\n                # Remove redundant genes by testing they do not interact with a positive gene\n                .join(\n                    interactions_df.alias(\"interactions\"),\n                    (f.col(\"left.geneId\") == f.col(\"interactions.geneIdA\"))\n                    | (f.col(\"left.geneId\") == f.col(\"interactions.geneIdB\")),\n                    how=\"left\",\n                )\n                .withColumn(\"interacting\", (f.col(\"score\") > 0.7))\n                # filter out genes where geneIdA has goldStandardSet negative but geneIdA and gene IdB are interacting\n                .filter(\n                    ~(\n                        (f.col(\"goldStandardSet\") == 0)\n                        & (f.col(\"interacting\"))\n                        & (\n                            (f.col(\"left.geneId\") == f.col(\"interactions.geneIdA\"))\n                            | (f.col(\"left.geneId\") == f.col(\"interactions.geneIdB\"))\n                        )\n                    )\n                )\n                .select(\"studyLocusId\", \"geneId\", \"goldStandardSet\", \"sources\")\n            ),\n            _schema=L2GGoldStandard.get_schema(),\n        )\n
"},{"location":"python_api/datasource/open_targets/l2g_gold_standard/#otg.datasource.open_targets.l2g_gold_standard.OpenTargetsL2GGoldStandard.as_l2g_gold_standard","title":"as_l2g_gold_standard(gold_standard_curation: DataFrame, v2g: V2G, study_locus_overlap: StudyLocusOverlap, interactions: DataFrame) -> L2GGoldStandard classmethod","text":"

Initialise L2GGoldStandard from source dataset.

Parameters:

Name Type Description Default gold_standard_curation DataFrame

Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards

required v2g V2G

Variant to gene dataset to bring distance between a variant and a gene's TSS

required study_locus_overlap StudyLocusOverlap

Study locus overlap dataset to remove duplicated loci

required interactions DataFrame

Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene

required

Returns:

Name Type Description L2GGoldStandard L2GGoldStandard

L2G Gold Standard dataset

Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
@classmethod\ndef as_l2g_gold_standard(\n    cls: type[OpenTargetsL2GGoldStandard],\n    gold_standard_curation: DataFrame,\n    v2g: V2G,\n    study_locus_overlap: StudyLocusOverlap,\n    interactions: DataFrame,\n) -> L2GGoldStandard:\n    \"\"\"Initialise L2GGoldStandard from source dataset.\n\n    Args:\n        gold_standard_curation (DataFrame): Gold standard curation dataframe, extracted from https://github.com/opentargets/genetics-gold-standards\n        v2g (V2G): Variant to gene dataset to bring distance between a variant and a gene's TSS\n        study_locus_overlap (StudyLocusOverlap): Study locus overlap dataset to remove duplicated loci\n        interactions (DataFrame): Gene-gene interactions dataset to remove negative cases where the gene interacts with a positive gene\n\n    Returns:\n        L2GGoldStandard: L2G Gold Standard dataset\n    \"\"\"\n    overlaps_df = study_locus_overlap._df.select(\n        \"leftStudyLocusId\", \"rightStudyLocusId\"\n    )\n    interactions_df = cls.process_gene_interactions(interactions)\n    return L2GGoldStandard(\n        _df=(\n            gold_standard_curation.filter(\n                f.col(\"gold_standard_info.highest_confidence\").isin(\n                    [\"High\", \"Medium\"]\n                )\n            )\n            .select(\n                f.col(\"association_info.otg_id\").alias(\"studyId\"),\n                f.col(\"gold_standard_info.gene_id\").alias(\"geneId\"),\n                f.concat_ws(\n                    \"_\",\n                    f.col(\"sentinel_variant.locus_GRCh38.chromosome\"),\n                    f.col(\"sentinel_variant.locus_GRCh38.position\"),\n                    f.col(\"sentinel_variant.alleles.reference\"),\n                    f.col(\"sentinel_variant.alleles.alternative\"),\n                ).alias(\"variantId\"),\n                f.col(\"metadata.set_label\").alias(\"source\"),\n            )\n            .withColumn(\n                \"studyLocusId\",\n                StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n            )\n            .groupBy(\"studyLocusId\", \"studyId\", \"variantId\", \"geneId\")\n            .agg(\n                f.collect_set(\"source\").alias(\"sources\"),\n            )\n            # Assign Positive or Negative Status based on confidence\n            .join(\n                v2g.df.filter(f.col(\"distance\").isNotNull()).select(\n                    \"variantId\", \"geneId\", \"distance\"\n                ),\n                on=[\"variantId\", \"geneId\"],\n                how=\"inner\",\n            )\n            .withColumn(\n                \"goldStandardSet\",\n                f.when(f.col(\"distance\") <= 500_000, f.lit(\"positive\")).otherwise(\n                    f.lit(\"negative\")\n                ),\n            )\n            # Remove redundant loci by testing they are truly independent\n            .alias(\"left\")\n            .join(\n                overlaps_df.alias(\"right\"),\n                (f.col(\"left.variantId\") == f.col(\"right.leftStudyLocusId\"))\n                | (f.col(\"left.variantId\") == f.col(\"right.rightStudyLocusId\")),\n                how=\"left\",\n            )\n            .distinct()\n            # Remove redundant genes by testing they do not interact with a positive gene\n            .join(\n                interactions_df.alias(\"interactions\"),\n                (f.col(\"left.geneId\") == f.col(\"interactions.geneIdA\"))\n                | (f.col(\"left.geneId\") == f.col(\"interactions.geneIdB\")),\n                how=\"left\",\n            )\n            .withColumn(\"interacting\", (f.col(\"score\") > 0.7))\n            # filter out genes where geneIdA has goldStandardSet negative but geneIdA and gene IdB are interacting\n            .filter(\n                ~(\n                    (f.col(\"goldStandardSet\") == 0)\n                    & (f.col(\"interacting\"))\n                    & (\n                        (f.col(\"left.geneId\") == f.col(\"interactions.geneIdA\"))\n                        | (f.col(\"left.geneId\") == f.col(\"interactions.geneIdB\"))\n                    )\n                )\n            )\n            .select(\"studyLocusId\", \"geneId\", \"goldStandardSet\", \"sources\")\n        ),\n        _schema=L2GGoldStandard.get_schema(),\n    )\n
"},{"location":"python_api/datasource/open_targets/l2g_gold_standard/#otg.datasource.open_targets.l2g_gold_standard.OpenTargetsL2GGoldStandard.process_gene_interactions","title":"process_gene_interactions(interactions: DataFrame) -> DataFrame staticmethod","text":"

Extract top scoring gene-gene interaction from the interactions dataset of the Platform.

Parameters:

Name Type Description Default interactions DataFrame

Gene-gene interactions dataset

required

Returns:

Name Type Description DataFrame DataFrame

Top scoring gene-gene interaction per pair of genes

Source code in src/otg/datasource/open_targets/l2g_gold_standard.py
@staticmethod\ndef process_gene_interactions(interactions: DataFrame) -> DataFrame:\n    \"\"\"Extract top scoring gene-gene interaction from the interactions dataset of the Platform.\n\n    Args:\n        interactions (DataFrame): Gene-gene interactions dataset\n\n    Returns:\n        DataFrame: Top scoring gene-gene interaction per pair of genes\n    \"\"\"\n    return get_record_with_maximum_value(\n        interactions,\n        [\"targetA\", \"targetB\"],\n        \"scoring\",\n    ).selectExpr(\n        \"targetA as geneIdA\",\n        \"targetB as geneIdB\",\n        \"scoring as score\",\n    )\n
"},{"location":"python_api/datasource/open_targets/target/","title":"Target","text":""},{"location":"python_api/datasource/open_targets/target/#otg.datasource.open_targets.target.OpenTargetsTarget","title":"otg.datasource.open_targets.target.OpenTargetsTarget","text":"

Parser for OTPlatform target dataset.

Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.

The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is: - Genes from all biotypes encoded in canonical chromosomes - Genes in alternative assemblies encoding for a reviewed protein product.

Source code in src/otg/datasource/open_targets/target.py
class OpenTargetsTarget:\n    \"\"\"Parser for OTPlatform target dataset.\n\n    Genomic data from Open Targets provides gene identification and genomic coordinates that are integrated into the gene index of our ETL pipeline.\n\n    The EMBL-EBI Ensembl database is used as a source for human targets in the Platform, with the Ensembl gene ID as the primary identifier. The criteria for target inclusion is:\n    - Genes from all biotypes encoded in canonical chromosomes\n    - Genes in alternative assemblies encoding for a reviewed protein product.\n    \"\"\"\n\n    @staticmethod\n    def _get_gene_tss(strand_col: Column, start_col: Column, end_col: Column) -> Column:\n        \"\"\"Returns the TSS of a gene based on its orientation.\n\n        Args:\n            strand_col (Column): Column containing 1 if the coding strand of the gene is forward, and -1 if it is reverse.\n            start_col (Column): Column containing the start position of the gene.\n            end_col (Column): Column containing the end position of the gene.\n\n        Returns:\n            Column: Column containing the TSS of the gene.\n\n        Examples:\n            >>> df = spark.createDataFrame([{\"strand\": 1, \"start\": 100, \"end\": 200}, {\"strand\": -1, \"start\": 100, \"end\": 200}])\n            >>> df.withColumn(\"tss\", OpenTargetsTarget._get_gene_tss(f.col(\"strand\"), f.col(\"start\"), f.col(\"end\"))).show()\n            +---+-----+------+---+\n            |end|start|strand|tss|\n            +---+-----+------+---+\n            |200|  100|     1|100|\n            |200|  100|    -1|200|\n            +---+-----+------+---+\n            <BLANKLINE>\n\n        \"\"\"\n        return f.when(strand_col == 1, start_col).when(strand_col == -1, end_col)\n\n    @classmethod\n    def as_gene_index(cls: type[GeneIndex], target_index: DataFrame) -> GeneIndex:\n        \"\"\"Initialise GeneIndex from source dataset.\n\n        Args:\n            target_index (DataFrame): Target index dataframe\n\n        Returns:\n            GeneIndex: Gene index dataset\n        \"\"\"\n        return GeneIndex(\n            _df=target_index.select(\n                f.coalesce(f.col(\"id\"), f.lit(\"unknown\")).alias(\"geneId\"),\n                \"approvedSymbol\",\n                \"approvedName\",\n                \"biotype\",\n                f.col(\"obsoleteSymbols.label\").alias(\"obsoleteSymbols\"),\n                f.coalesce(f.col(\"genomicLocation.chromosome\"), f.lit(\"unknown\")).alias(\n                    \"chromosome\"\n                ),\n                OpenTargetsTarget._get_gene_tss(\n                    f.col(\"genomicLocation.strand\"),\n                    f.col(\"genomicLocation.start\"),\n                    f.col(\"genomicLocation.end\"),\n                ).alias(\"tss\"),\n                f.col(\"genomicLocation.start\").alias(\"start\"),\n                f.col(\"genomicLocation.end\").alias(\"end\"),\n                f.col(\"genomicLocation.strand\").alias(\"strand\"),\n            ),\n            _schema=GeneIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/open_targets/target/#otg.datasource.open_targets.target.OpenTargetsTarget.as_gene_index","title":"as_gene_index(target_index: DataFrame) -> GeneIndex classmethod","text":"

Initialise GeneIndex from source dataset.

Parameters:

Name Type Description Default target_index DataFrame

Target index dataframe

required

Returns:

Name Type Description GeneIndex GeneIndex

Gene index dataset

Source code in src/otg/datasource/open_targets/target.py
@classmethod\ndef as_gene_index(cls: type[GeneIndex], target_index: DataFrame) -> GeneIndex:\n    \"\"\"Initialise GeneIndex from source dataset.\n\n    Args:\n        target_index (DataFrame): Target index dataframe\n\n    Returns:\n        GeneIndex: Gene index dataset\n    \"\"\"\n    return GeneIndex(\n        _df=target_index.select(\n            f.coalesce(f.col(\"id\"), f.lit(\"unknown\")).alias(\"geneId\"),\n            \"approvedSymbol\",\n            \"approvedName\",\n            \"biotype\",\n            f.col(\"obsoleteSymbols.label\").alias(\"obsoleteSymbols\"),\n            f.coalesce(f.col(\"genomicLocation.chromosome\"), f.lit(\"unknown\")).alias(\n                \"chromosome\"\n            ),\n            OpenTargetsTarget._get_gene_tss(\n                f.col(\"genomicLocation.strand\"),\n                f.col(\"genomicLocation.start\"),\n                f.col(\"genomicLocation.end\"),\n            ).alias(\"tss\"),\n            f.col(\"genomicLocation.start\").alias(\"start\"),\n            f.col(\"genomicLocation.end\").alias(\"end\"),\n            f.col(\"genomicLocation.strand\").alias(\"strand\"),\n        ),\n        _schema=GeneIndex.get_schema(),\n    )\n
"},{"location":"python_api/datasource/ukbiobank/_ukbiobank/","title":"UK Biobank","text":"

The UK Biobank is a large-scale biomedical database and research resource that contains a diverse range of in-depth information from 500,000 volunteers in the United Kingdom. Its genomic data comprises whole-genome sequencing for a subset of participants, along with genotyping arrays for the entire cohort. The data has been a cornerstone for numerous genome-wide association studies (GWAS) and other genetic analyses, advancing our understanding of human health and disease.

Recent efforts to rapidly and systematically apply established GWAS methods to all available data fields in UK Biobank have made available large repositories of summary statistics. To leverage these data disease locus discovery, we used full summary statistics from: The Neale lab Round 2 (N=2139). - These analyses applied GWAS (implemented in Hail) to all data fields using imputed genotypes from HRC as released by UK Biobank in May 2017, consisting of 337,199 individuals post-QC. Full details of the Neale lab GWAS implementation are available here. We have remove all ICD-10 related traits from the Neale data to reduce overlap with the SAIGE results. - http://www.nealelab.is/uk-biobank/ The University of Michigan SAIGE analysis (N=1281). - The SAIGE analysis uses PheCode derived phenotypes and applies a new method that \"provides accurate P values even when case-control ratios are extremely unbalanced\". See Zhou et al. (2018) for further details. - https://pubmed.ncbi.nlm.nih.gov/30104761/

"},{"location":"python_api/datasource/ukbiobank/study_index/","title":"Study Index","text":""},{"location":"python_api/datasource/ukbiobank/study_index/#otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex","title":"otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex","text":"

Bases: StudyIndex

Study index dataset from UKBiobank.

The following information is extracted:

  • studyId
  • pubmedId
  • publicationDate
  • publicationJournal
  • publicationTitle
  • publicationFirstAuthor
  • traitFromSource
  • ancestry_discoverySamples
  • ancestry_replicationSamples
  • initialSampleSize
  • nCases
  • replicationSamples

Some fields are populated as constants, such as projectID, studyType, and initial sample size.

Source code in src/otg/datasource/ukbiobank/study_index.py
class UKBiobankStudyIndex(StudyIndex):\n    \"\"\"Study index dataset from UKBiobank.\n\n    The following information is extracted:\n\n    - studyId\n    - pubmedId\n    - publicationDate\n    - publicationJournal\n    - publicationTitle\n    - publicationFirstAuthor\n    - traitFromSource\n    - ancestry_discoverySamples\n    - ancestry_replicationSamples\n    - initialSampleSize\n    - nCases\n    - replicationSamples\n\n    Some fields are populated as constants, such as projectID, studyType, and initial sample size.\n    \"\"\"\n\n    @classmethod\n    def from_source(\n        cls: type[UKBiobankStudyIndex],\n        ukbiobank_studies: DataFrame,\n    ) -> UKBiobankStudyIndex:\n        \"\"\"This function ingests study level metadata from UKBiobank.\n\n        The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n        The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n        Args:\n            ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n        Returns:\n            UKBiobankStudyIndex: Annotated UKBiobank study table.\n        \"\"\"\n        return StudyIndex(\n            _df=(\n                ukbiobank_studies.select(\n                    f.col(\"code\").alias(\"studyId\"),\n                    f.lit(\"UKBiobank\").alias(\"projectId\"),\n                    f.lit(\"gwas\").alias(\"studyType\"),\n                    f.col(\"trait\").alias(\"traitFromSource\"),\n                    # Make publication and ancestry schema columns.\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n                        \"pubmedId\"\n                    ),\n                    f.when(\n                        f.col(\"code\").startswith(\"SAIGE_\"),\n                        \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n                    )\n                    .otherwise(None)\n                    .alias(\"publicationTitle\"),\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n                        \"publicationFirstAuthor\"\n                    ),\n                    f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n                    .otherwise(\"2018-10-24\")\n                    .alias(\"publicationDate\"),\n                    f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n                        \"publicationJournal\"\n                    ),\n                    f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n                    f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n                    f.array(\n                        f.struct(\n                            f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n                            f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n                                \"ancestry\"\n                            ),\n                        )\n                    ).alias(\"discoverySamples\"),\n                    f.col(\"in_path\").alias(\"summarystatsLocation\"),\n                    f.lit(True).alias(\"hasSumstats\"),\n                )\n                .withColumn(\n                    \"traitFromSource\",\n                    f.when(\n                        f.col(\"traitFromSource\").contains(\":\"),\n                        f.concat(\n                            f.initcap(\n                                f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n                            ),\n                            f.lit(\" | \"),\n                            f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n                        ),\n                    ).otherwise(f.col(\"traitFromSource\")),\n                )\n                .withColumn(\n                    \"ldPopulationStructure\",\n                    cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n                )\n            ),\n            _schema=StudyIndex.get_schema(),\n        )\n
"},{"location":"python_api/datasource/ukbiobank/study_index/#otg.datasource.ukbiobank.study_index.UKBiobankStudyIndex.from_source","title":"from_source(ukbiobank_studies: DataFrame) -> UKBiobankStudyIndex classmethod","text":"

This function ingests study level metadata from UKBiobank.

The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).

The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.

Parameters:

Name Type Description Default ukbiobank_studies DataFrame

UKBiobank study manifest file loaded in spark session.

required

Returns:

Name Type Description UKBiobankStudyIndex UKBiobankStudyIndex

Annotated UKBiobank study table.

Source code in src/otg/datasource/ukbiobank/study_index.py
@classmethod\ndef from_source(\n    cls: type[UKBiobankStudyIndex],\n    ukbiobank_studies: DataFrame,\n) -> UKBiobankStudyIndex:\n    \"\"\"This function ingests study level metadata from UKBiobank.\n\n    The University of Michigan SAIGE analysis (N=1281) utilized PheCode derived phenotypes and a novel method that ensures accurate P values, even with highly unbalanced case-control ratios (Zhou et al., 2018).\n\n    The Neale lab Round 2 study (N=2139) used GWAS with imputed genotypes from HRC to analyze all data fields in UK Biobank, excluding ICD-10 related traits to reduce overlap with the SAIGE results.\n\n    Args:\n        ukbiobank_studies (DataFrame): UKBiobank study manifest file loaded in spark session.\n\n    Returns:\n        UKBiobankStudyIndex: Annotated UKBiobank study table.\n    \"\"\"\n    return StudyIndex(\n        _df=(\n            ukbiobank_studies.select(\n                f.col(\"code\").alias(\"studyId\"),\n                f.lit(\"UKBiobank\").alias(\"projectId\"),\n                f.lit(\"gwas\").alias(\"studyType\"),\n                f.col(\"trait\").alias(\"traitFromSource\"),\n                # Make publication and ancestry schema columns.\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"30104761\").alias(\n                    \"pubmedId\"\n                ),\n                f.when(\n                    f.col(\"code\").startswith(\"SAIGE_\"),\n                    \"Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies\",\n                )\n                .otherwise(None)\n                .alias(\"publicationTitle\"),\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Wei Zhou\").alias(\n                    \"publicationFirstAuthor\"\n                ),\n                f.when(f.col(\"code\").startswith(\"NEALE2_\"), \"2018-08-01\")\n                .otherwise(\"2018-10-24\")\n                .alias(\"publicationDate\"),\n                f.when(f.col(\"code\").startswith(\"SAIGE_\"), \"Nature Genetics\").alias(\n                    \"publicationJournal\"\n                ),\n                f.col(\"n_total\").cast(\"string\").alias(\"initialSampleSize\"),\n                f.col(\"n_cases\").cast(\"long\").alias(\"nCases\"),\n                f.array(\n                    f.struct(\n                        f.col(\"n_total\").cast(\"long\").alias(\"sampleSize\"),\n                        f.concat(f.lit(\"European=\"), f.col(\"n_total\")).alias(\n                            \"ancestry\"\n                        ),\n                    )\n                ).alias(\"discoverySamples\"),\n                f.col(\"in_path\").alias(\"summarystatsLocation\"),\n                f.lit(True).alias(\"hasSumstats\"),\n            )\n            .withColumn(\n                \"traitFromSource\",\n                f.when(\n                    f.col(\"traitFromSource\").contains(\":\"),\n                    f.concat(\n                        f.initcap(\n                            f.split(f.col(\"traitFromSource\"), \": \").getItem(1)\n                        ),\n                        f.lit(\" | \"),\n                        f.lower(f.split(f.col(\"traitFromSource\"), \": \").getItem(0)),\n                    ),\n                ).otherwise(f.col(\"traitFromSource\")),\n            )\n            .withColumn(\n                \"ldPopulationStructure\",\n                cls.aggregate_and_map_ancestries(f.col(\"discoverySamples\")),\n            )\n        ),\n        _schema=StudyIndex.get_schema(),\n    )\n
"},{"location":"python_api/method/_method/","title":"Method","text":"

TBC

"},{"location":"python_api/method/clumping/","title":"Clumping","text":"

Clumping is a commonly used post-processing method that allows for identification of independent association signals from GWAS summary statistics and curated associations. This process is critical because of the complex linkage disequilibrium (LD) structure in human populations, which can result in multiple statistically significant associations within the same genomic region. Clumping methods help reduce redundancy in GWAS results and ensure that each reported association represents an independent signal.

We have implemented 2 clumping methods:

"},{"location":"python_api/method/clumping/#otg.method.clump.LDclumping","title":"otg.method.clump.LDclumping","text":"

LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.

Source code in src/otg/method/clump.py
class LDclumping:\n    \"\"\"LD clumping reports the most significant genetic associations in a region in terms of a smaller number of \u201cclumps\u201d of genetically linked SNPs.\"\"\"\n\n    @staticmethod\n    def _is_lead_linked(\n        study_id: Column,\n        variant_id: Column,\n        p_value_exponent: Column,\n        p_value_mantissa: Column,\n        ld_set: Column,\n    ) -> Column:\n        \"\"\"Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.\n\n        Args:\n            study_id (Column): studyId\n            variant_id (Column): Lead variant id\n            p_value_exponent (Column): p-value exponent\n            p_value_mantissa (Column): p-value mantissa\n            ld_set (Column): Array of variants in LD with the lead variant\n\n        Returns:\n            Column: Boolean in which True indicates that the lead is linked to another tag in the same dataset.\n        \"\"\"\n        leads_in_study = f.collect_set(variant_id).over(Window.partitionBy(study_id))\n        tags_in_studylocus = f.array_union(\n            # Get all tag variants from the credible set per studyLocusId\n            f.transform(ld_set, lambda x: x.tagVariantId),\n            # And append the lead variant so that the intersection is the same for all studyLocusIds in a study\n            f.array(variant_id),\n        )\n        intersect_lead_tags = f.array_sort(\n            f.array_intersect(leads_in_study, tags_in_studylocus)\n        )\n        return (\n            # If the lead is in the credible set, we rank the peaks by p-value\n            f.when(\n                f.size(intersect_lead_tags) > 0,\n                f.row_number().over(\n                    Window.partitionBy(study_id, intersect_lead_tags).orderBy(\n                        p_value_exponent, p_value_mantissa\n                    )\n                )\n                > 1,\n            )\n            # If the intersection is empty (lead is not in the credible set or cred set is empty), the association is not linked\n            .otherwise(f.lit(False))\n        )\n\n    @classmethod\n    def clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n        \"\"\"Perform clumping on studyLocus dataset.\n\n        Args:\n            associations (StudyLocus): StudyLocus dataset\n\n        Returns:\n            StudyLocus: including flag and removing locus information for LD clumped loci.\n        \"\"\"\n        return associations.clump()\n
"},{"location":"python_api/method/clumping/#otg.method.clump.LDclumping.clump","title":"clump(associations: StudyLocus) -> StudyLocus classmethod","text":"

Perform clumping on studyLocus dataset.

Parameters:

Name Type Description Default associations StudyLocus

StudyLocus dataset

required

Returns:

Name Type Description StudyLocus StudyLocus

including flag and removing locus information for LD clumped loci.

Source code in src/otg/method/clump.py
@classmethod\ndef clump(cls: type[LDclumping], associations: StudyLocus) -> StudyLocus:\n    \"\"\"Perform clumping on studyLocus dataset.\n\n    Args:\n        associations (StudyLocus): StudyLocus dataset\n\n    Returns:\n        StudyLocus: including flag and removing locus information for LD clumped loci.\n    \"\"\"\n    return associations.clump()\n
"},{"location":"python_api/method/coloc/","title":"Coloc","text":""},{"location":"python_api/method/coloc/#otg.method.colocalisation.Coloc","title":"otg.method.colocalisation.Coloc","text":"

Calculate bayesian colocalisation based on overlapping signals from credible sets.

Based on the R COLOC package, which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that only one single causal variant exists for any given trait in any genomic region.

Hypothesis Description H0 no association with either trait in the region H1 association with trait 1 only H2 association with trait 2 only H3 both traits are associated, but have different single causal variants H4 both traits are associated and share the same single causal variant

Approximate Bayes factors required

Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (logABF column).

Source code in src/otg/method/colocalisation.py
class Coloc:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals from credible sets.\n\n    Based on the [R COLOC package](https://github.com/chr1swallace/coloc/blob/main/R/claudia.R), which uses the Bayes factors from the credible set to estimate the posterior probability of colocalisation. This method makes the simplifying assumption that **only one single causal variant** exists for any given trait in any genomic region.\n\n    | Hypothesis    | Description                                                           |\n    | ------------- | --------------------------------------------------------------------- |\n    | H<sub>0</sub> | no association with either trait in the region                        |\n    | H<sub>1</sub> | association with trait 1 only                                         |\n    | H<sub>2</sub> | association with trait 2 only                                         |\n    | H<sub>3</sub> | both traits are associated, but have different single causal variants |\n    | H<sub>4</sub> | both traits are associated and share the same single causal variant   |\n\n    !!! warning \"Approximate Bayes factors required\"\n        Coloc requires the availability of approximate Bayes factors (ABF) for each variant in the credible set (`logABF` column).\n\n    \"\"\"\n\n    @staticmethod\n    def _get_logsum(log_abf: ndarray) -> float:\n        \"\"\"Calculates logsum of vector.\n\n        This function calculates the log of the sum of the exponentiated\n        logs taking out the max, i.e. insuring that the sum is not Inf\n\n        Args:\n            log_abf (ndarray): log approximate bayes factor\n\n        Returns:\n            float: logsum\n\n        Example:\n            >>> l = [0.2, 0.1, 0.05, 0]\n            >>> round(Coloc._get_logsum(l), 6)\n            1.476557\n        \"\"\"\n        themax = np.max(log_abf)\n        result = themax + np.log(np.sum(np.exp(log_abf - themax)))\n        return float(result)\n\n    @staticmethod\n    def _get_posteriors(all_abfs: ndarray) -> DenseVector:\n        \"\"\"Calculate posterior probabilities for each hypothesis.\n\n        Args:\n            all_abfs (ndarray): h0-h4 bayes factors\n\n        Returns:\n            DenseVector: Posterior\n\n        Example:\n            >>> l = np.array([0.2, 0.1, 0.05, 0])\n            >>> Coloc._get_posteriors(l)\n            DenseVector([0.279, 0.2524, 0.2401, 0.2284])\n        \"\"\"\n        diff = all_abfs - Coloc._get_logsum(all_abfs)\n        abfs_posteriors = np.exp(diff)\n        return Vectors.dense(abfs_posteriors)\n\n    @classmethod\n    def colocalise(\n        cls: type[Coloc],\n        overlapping_signals: StudyLocusOverlap,\n        priorc1: float = 1e-4,\n        priorc2: float = 1e-4,\n        priorc12: float = 1e-5,\n    ) -> Colocalisation:\n        \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n        Args:\n            overlapping_signals (StudyLocusOverlap): overlapping peaks\n            priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n            priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n            priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n        Returns:\n            Colocalisation: Colocalisation results\n        \"\"\"\n        # register udfs\n        logsum = f.udf(Coloc._get_logsum, DoubleType())\n        posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n        return Colocalisation(\n            _df=(\n                overlapping_signals.df\n                # Before summing log_abf columns nulls need to be filled with 0:\n                .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n                # Sum of log_abfs for each pair of signals\n                .withColumn(\n                    \"sum_log_abf\",\n                    f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n                )\n                # Group by overlapping peak and generating dense vectors of log_abf:\n                .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n                .agg(\n                    f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                    fml.array_to_vector(\n                        f.collect_list(f.col(\"statistics.left_logABF\"))\n                    ).alias(\"left_logABF\"),\n                    fml.array_to_vector(\n                        f.collect_list(f.col(\"statistics.right_logABF\"))\n                    ).alias(\"right_logABF\"),\n                    fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n                        \"sum_log_abf\"\n                    ),\n                )\n                .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n                .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n                .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n                .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n                # Add priors\n                # priorc1 Prior on variant being causal for trait 1\n                .withColumn(\"priorc1\", f.lit(priorc1))\n                # priorc2 Prior on variant being causal for trait 2\n                .withColumn(\"priorc2\", f.lit(priorc2))\n                # priorc12 Prior on variant being causal for traits 1 and 2\n                .withColumn(\"priorc12\", f.lit(priorc12))\n                # h0-h2\n                .withColumn(\"lH0abf\", f.lit(0))\n                .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n                .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n                # h3\n                .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n                # exclude null H3/H4s: due to sumlogsum == logsum12\n                .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n                .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n                .withColumn(\n                    \"logdiff\",\n                    (\n                        f.col(\"max\")\n                        + f.log(\n                            f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n                            - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n                        )\n                    ),\n                )\n                .withColumn(\n                    \"lH3abf\",\n                    f.log(f.col(\"priorc1\"))\n                    + f.log(f.col(\"priorc2\"))\n                    + f.col(\"logdiff\"),\n                )\n                .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n                # h4\n                .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n                # cleaning\n                .drop(\n                    \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n                )\n                # posteriors\n                .withColumn(\n                    \"allABF\",\n                    fml.array_to_vector(\n                        f.array(\n                            f.col(\"lH0abf\"),\n                            f.col(\"lH1abf\"),\n                            f.col(\"lH2abf\"),\n                            f.col(\"lH3abf\"),\n                            f.col(\"lH4abf\"),\n                        )\n                    ),\n                )\n                .withColumn(\n                    \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n                )\n                .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n                .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n                .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n                .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n                .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n                .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n                .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n                # clean up\n                .drop(\n                    \"posteriors\",\n                    \"allABF\",\n                    \"h4h3\",\n                    \"lH0abf\",\n                    \"lH1abf\",\n                    \"lH2abf\",\n                    \"lH3abf\",\n                    \"lH4abf\",\n                )\n                .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n            ),\n            _schema=Colocalisation.get_schema(),\n        )\n
"},{"location":"python_api/method/coloc/#otg.method.colocalisation.Coloc.colocalise","title":"colocalise(overlapping_signals: StudyLocusOverlap, priorc1: float = 0.0001, priorc2: float = 0.0001, priorc12: float = 1e-05) -> Colocalisation classmethod","text":"

Calculate bayesian colocalisation based on overlapping signals.

Parameters:

Name Type Description Default overlapping_signals StudyLocusOverlap

overlapping peaks

required priorc1 float

Prior on variant being causal for trait 1. Defaults to 1e-4.

0.0001 priorc2 float

Prior on variant being causal for trait 2. Defaults to 1e-4.

0.0001 priorc12 float

Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.

1e-05

Returns:

Name Type Description Colocalisation Colocalisation

Colocalisation results

Source code in src/otg/method/colocalisation.py
@classmethod\ndef colocalise(\n    cls: type[Coloc],\n    overlapping_signals: StudyLocusOverlap,\n    priorc1: float = 1e-4,\n    priorc2: float = 1e-4,\n    priorc12: float = 1e-5,\n) -> Colocalisation:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n    Args:\n        overlapping_signals (StudyLocusOverlap): overlapping peaks\n        priorc1 (float): Prior on variant being causal for trait 1. Defaults to 1e-4.\n        priorc2 (float): Prior on variant being causal for trait 2. Defaults to 1e-4.\n        priorc12 (float): Prior on variant being causal for traits 1 and 2. Defaults to 1e-5.\n\n    Returns:\n        Colocalisation: Colocalisation results\n    \"\"\"\n    # register udfs\n    logsum = f.udf(Coloc._get_logsum, DoubleType())\n    posteriors = f.udf(Coloc._get_posteriors, VectorUDT())\n    return Colocalisation(\n        _df=(\n            overlapping_signals.df\n            # Before summing log_abf columns nulls need to be filled with 0:\n            .fillna(0, subset=[\"statistics.left_logABF\", \"statistics.right_logABF\"])\n            # Sum of log_abfs for each pair of signals\n            .withColumn(\n                \"sum_log_abf\",\n                f.col(\"statistics.left_logABF\") + f.col(\"statistics.right_logABF\"),\n            )\n            # Group by overlapping peak and generating dense vectors of log_abf:\n            .groupBy(\"chromosome\", \"leftStudyLocusId\", \"rightStudyLocusId\")\n            .agg(\n                f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                fml.array_to_vector(\n                    f.collect_list(f.col(\"statistics.left_logABF\"))\n                ).alias(\"left_logABF\"),\n                fml.array_to_vector(\n                    f.collect_list(f.col(\"statistics.right_logABF\"))\n                ).alias(\"right_logABF\"),\n                fml.array_to_vector(f.collect_list(f.col(\"sum_log_abf\"))).alias(\n                    \"sum_log_abf\"\n                ),\n            )\n            .withColumn(\"logsum1\", logsum(f.col(\"left_logABF\")))\n            .withColumn(\"logsum2\", logsum(f.col(\"right_logABF\")))\n            .withColumn(\"logsum12\", logsum(f.col(\"sum_log_abf\")))\n            .drop(\"left_logABF\", \"right_logABF\", \"sum_log_abf\")\n            # Add priors\n            # priorc1 Prior on variant being causal for trait 1\n            .withColumn(\"priorc1\", f.lit(priorc1))\n            # priorc2 Prior on variant being causal for trait 2\n            .withColumn(\"priorc2\", f.lit(priorc2))\n            # priorc12 Prior on variant being causal for traits 1 and 2\n            .withColumn(\"priorc12\", f.lit(priorc12))\n            # h0-h2\n            .withColumn(\"lH0abf\", f.lit(0))\n            .withColumn(\"lH1abf\", f.log(f.col(\"priorc1\")) + f.col(\"logsum1\"))\n            .withColumn(\"lH2abf\", f.log(f.col(\"priorc2\")) + f.col(\"logsum2\"))\n            # h3\n            .withColumn(\"sumlogsum\", f.col(\"logsum1\") + f.col(\"logsum2\"))\n            # exclude null H3/H4s: due to sumlogsum == logsum12\n            .filter(f.col(\"sumlogsum\") != f.col(\"logsum12\"))\n            .withColumn(\"max\", f.greatest(\"sumlogsum\", \"logsum12\"))\n            .withColumn(\n                \"logdiff\",\n                (\n                    f.col(\"max\")\n                    + f.log(\n                        f.exp(f.col(\"sumlogsum\") - f.col(\"max\"))\n                        - f.exp(f.col(\"logsum12\") - f.col(\"max\"))\n                    )\n                ),\n            )\n            .withColumn(\n                \"lH3abf\",\n                f.log(f.col(\"priorc1\"))\n                + f.log(f.col(\"priorc2\"))\n                + f.col(\"logdiff\"),\n            )\n            .drop(\"right_logsum\", \"left_logsum\", \"sumlogsum\", \"max\", \"logdiff\")\n            # h4\n            .withColumn(\"lH4abf\", f.log(f.col(\"priorc12\")) + f.col(\"logsum12\"))\n            # cleaning\n            .drop(\n                \"priorc1\", \"priorc2\", \"priorc12\", \"logsum1\", \"logsum2\", \"logsum12\"\n            )\n            # posteriors\n            .withColumn(\n                \"allABF\",\n                fml.array_to_vector(\n                    f.array(\n                        f.col(\"lH0abf\"),\n                        f.col(\"lH1abf\"),\n                        f.col(\"lH2abf\"),\n                        f.col(\"lH3abf\"),\n                        f.col(\"lH4abf\"),\n                    )\n                ),\n            )\n            .withColumn(\n                \"posteriors\", fml.vector_to_array(posteriors(f.col(\"allABF\")))\n            )\n            .withColumn(\"h0\", f.col(\"posteriors\").getItem(0))\n            .withColumn(\"h1\", f.col(\"posteriors\").getItem(1))\n            .withColumn(\"h2\", f.col(\"posteriors\").getItem(2))\n            .withColumn(\"h3\", f.col(\"posteriors\").getItem(3))\n            .withColumn(\"h4\", f.col(\"posteriors\").getItem(4))\n            .withColumn(\"h4h3\", f.col(\"h4\") / f.col(\"h3\"))\n            .withColumn(\"log2h4h3\", f.log2(f.col(\"h4h3\")))\n            # clean up\n            .drop(\n                \"posteriors\",\n                \"allABF\",\n                \"h4h3\",\n                \"lH0abf\",\n                \"lH1abf\",\n                \"lH2abf\",\n                \"lH3abf\",\n                \"lH4abf\",\n            )\n            .withColumn(\"colocalisationMethod\", f.lit(\"COLOC\"))\n        ),\n        _schema=Colocalisation.get_schema(),\n    )\n
"},{"location":"python_api/method/ecaviar/","title":"eCAVIAR","text":""},{"location":"python_api/method/ecaviar/#otg.method.colocalisation.ECaviar","title":"otg.method.colocalisation.ECaviar","text":"

ECaviar-based colocalisation analysis.

It extends CAVIAR\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (CLPP) by utilizing the marginal posterior probabilities. This framework allows for multiple variants to be causal in a single locus.

Source code in src/otg/method/colocalisation.py
class ECaviar:\n    \"\"\"ECaviar-based colocalisation analysis.\n\n    It extends [CAVIAR](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142122/#bib18)\u00a0framework to explicitly estimate the posterior probability that the same variant is causal in 2 studies while accounting for the uncertainty of LD. eCAVIAR computes the colocalization posterior probability (**CLPP**) by utilizing the marginal posterior probabilities. This framework allows for **multiple variants to be causal** in a single locus.\n    \"\"\"\n\n    @staticmethod\n    def _get_clpp(left_pp: Column, right_pp: Column) -> Column:\n        \"\"\"Calculate the colocalisation posterior probability (CLPP).\n\n        If the fact that the same variant is found causal for two studies are independent events,\n        CLPP is defined as the product of posterior porbabilities that a variant is causal in both studies.\n\n        Args:\n            left_pp (Column): left posterior probability\n            right_pp (Column): right posterior probability\n\n        Returns:\n            Column: CLPP\n\n        Examples:\n            >>> d = [{\"left_pp\": 0.5, \"right_pp\": 0.5}, {\"left_pp\": 0.25, \"right_pp\": 0.75}]\n            >>> df = spark.createDataFrame(d)\n            >>> df.withColumn(\"clpp\", ECaviar._get_clpp(f.col(\"left_pp\"), f.col(\"right_pp\"))).show()\n            +-------+--------+------+\n            |left_pp|right_pp|  clpp|\n            +-------+--------+------+\n            |    0.5|     0.5|  0.25|\n            |   0.25|    0.75|0.1875|\n            +-------+--------+------+\n            <BLANKLINE>\n\n        \"\"\"\n        return left_pp * right_pp\n\n    @classmethod\n    def colocalise(\n        cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n    ) -> Colocalisation:\n        \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n        Args:\n            overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n        Returns:\n            Colocalisation: colocalisation results based on eCAVIAR.\n        \"\"\"\n        return Colocalisation(\n            _df=(\n                overlapping_signals.df.withColumn(\n                    \"clpp\",\n                    ECaviar._get_clpp(\n                        f.col(\"statistics.left_posteriorProbability\"),\n                        f.col(\"statistics.right_posteriorProbability\"),\n                    ),\n                )\n                .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n                .agg(\n                    f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                    f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n                )\n                .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n            ),\n            _schema=Colocalisation.get_schema(),\n        )\n
"},{"location":"python_api/method/ecaviar/#otg.method.colocalisation.ECaviar.colocalise","title":"colocalise(overlapping_signals: StudyLocusOverlap) -> Colocalisation classmethod","text":"

Calculate bayesian colocalisation based on overlapping signals.

Parameters:

Name Type Description Default overlapping_signals StudyLocusOverlap

overlapping signals.

required

Returns:

Name Type Description Colocalisation Colocalisation

colocalisation results based on eCAVIAR.

Source code in src/otg/method/colocalisation.py
@classmethod\ndef colocalise(\n    cls: type[ECaviar], overlapping_signals: StudyLocusOverlap\n) -> Colocalisation:\n    \"\"\"Calculate bayesian colocalisation based on overlapping signals.\n\n    Args:\n        overlapping_signals (StudyLocusOverlap): overlapping signals.\n\n    Returns:\n        Colocalisation: colocalisation results based on eCAVIAR.\n    \"\"\"\n    return Colocalisation(\n        _df=(\n            overlapping_signals.df.withColumn(\n                \"clpp\",\n                ECaviar._get_clpp(\n                    f.col(\"statistics.left_posteriorProbability\"),\n                    f.col(\"statistics.right_posteriorProbability\"),\n                ),\n            )\n            .groupBy(\"leftStudyLocusId\", \"rightStudyLocusId\", \"chromosome\")\n            .agg(\n                f.count(\"*\").alias(\"numberColocalisingVariants\"),\n                f.sum(f.col(\"clpp\")).alias(\"clpp\"),\n            )\n            .withColumn(\"colocalisationMethod\", f.lit(\"eCAVIAR\"))\n        ),\n        _schema=Colocalisation.get_schema(),\n    )\n
"},{"location":"python_api/method/ld_annotator/","title":"LDAnnotator","text":""},{"location":"python_api/method/ld_annotator/#otg.method.ld.LDAnnotator","title":"otg.method.ld.LDAnnotator","text":"

Class to annotate linkage disequilibrium (LD) operations from GnomAD.

Source code in src/otg/method/ld.py
class LDAnnotator:\n    \"\"\"Class to annotate linkage disequilibrium (LD) operations from GnomAD.\"\"\"\n\n    @staticmethod\n    def _calculate_weighted_r_overall(ld_set: Column) -> Column:\n        \"\"\"Aggregation of weighted R information using ancestry proportions.\n\n        Args:\n            ld_set (Column): LD set\n\n        Returns:\n            Column: LD set with added 'r2Overall' field\n        \"\"\"\n        return f.transform(\n            ld_set,\n            lambda x: f.struct(\n                x[\"tagVariantId\"].alias(\"tagVariantId\"),\n                # r2Overall is the accumulated sum of each r2 relative to the population size\n                f.aggregate(\n                    x[\"rValues\"],\n                    f.lit(0.0),\n                    lambda acc, y: acc\n                    + f.coalesce(\n                        f.pow(y[\"r\"], 2) * y[\"relativeSampleSize\"], f.lit(0.0)\n                    ),  # we use coalesce to avoid problems when r/relativeSampleSize is null\n                ).alias(\"r2Overall\"),\n            ),\n        )\n\n    @staticmethod\n    def _add_population_size(ld_set: Column, study_populations: Column) -> Column:\n        \"\"\"Add population size to each rValues entry in the ldSet.\n\n        Args:\n            ld_set (Column): LD set\n            study_populations (Column): Study populations\n\n        Returns:\n            Column: LD set with added 'relativeSampleSize' field\n        \"\"\"\n        # Create a population to relativeSampleSize map from the struct\n        populations_map = f.map_from_arrays(\n            study_populations[\"ldPopulation\"],\n            study_populations[\"relativeSampleSize\"],\n        )\n        return f.transform(\n            ld_set,\n            lambda x: f.struct(\n                x[\"tagVariantId\"].alias(\"tagVariantId\"),\n                f.transform(\n                    x[\"rValues\"],\n                    lambda y: f.struct(\n                        y[\"population\"].alias(\"population\"),\n                        y[\"r\"].alias(\"r\"),\n                        populations_map[y[\"population\"]].alias(\"relativeSampleSize\"),\n                    ),\n                ).alias(\"rValues\"),\n            ),\n        )\n\n    @classmethod\n    def ld_annotate(\n        cls: type[LDAnnotator],\n        associations: StudyLocus,\n        studies: StudyIndex,\n        ld_index: LDIndex,\n    ) -> StudyLocus:\n        \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n        This function:\n            1. Annotates study locus with population structure information from the study index\n            2. Joins the LD index to the StudyLocus\n            3. Adds the population size of the study to each rValues entry in the ldSet\n            4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n        Args:\n            associations (StudyLocus): Dataset to be LD annotated\n            studies (StudyIndex): Dataset with study information\n            ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n        Returns:\n            StudyLocus: including additional column with LD information.\n        \"\"\"\n        return (\n            StudyLocus(\n                _df=(\n                    associations.df\n                    # Drop ldSet column if already available\n                    .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n                    # Annotate study locus with population structure from study index\n                    .join(\n                        studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n                        on=\"studyId\",\n                        how=\"left\",\n                    )\n                    # Bring LD information from LD Index\n                    .join(\n                        ld_index.df,\n                        on=[\"variantId\", \"chromosome\"],\n                        how=\"left\",\n                    )\n                    # Add population size to each rValues entry in the ldSet if population structure available:\n                    .withColumn(\n                        \"ldSet\",\n                        f.when(\n                            f.col(\"ldPopulationStructure\").isNotNull(),\n                            cls._add_population_size(\n                                f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n                            ),\n                        ),\n                    )\n                    # Aggregate weighted R information using ancestry proportions\n                    .withColumn(\n                        \"ldSet\",\n                        f.when(\n                            f.col(\"ldPopulationStructure\").isNotNull(),\n                            cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n                        ),\n                    ).drop(\"ldPopulationStructure\")\n                ),\n                _schema=StudyLocus.get_schema(),\n            )\n            ._qc_no_population()\n            ._qc_unresolved_ld()\n        )\n
"},{"location":"python_api/method/ld_annotator/#otg.method.ld.LDAnnotator.ld_annotate","title":"ld_annotate(associations: StudyLocus, studies: StudyIndex, ld_index: LDIndex) -> StudyLocus classmethod","text":"

Annotate linkage disequilibrium (LD) information to a set of studyLocus.

This function
  1. Annotates study locus with population structure information from the study index
  2. Joins the LD index to the StudyLocus
  3. Adds the population size of the study to each rValues entry in the ldSet
  4. Calculates the overall R weighted by the ancestry proportions in every given study.

Parameters:

Name Type Description Default associations StudyLocus

Dataset to be LD annotated

required studies StudyIndex

Dataset with study information

required ld_index LDIndex

Dataset with LD information for every variant present in LD matrix

required

Returns:

Name Type Description StudyLocus StudyLocus

including additional column with LD information.

Source code in src/otg/method/ld.py
@classmethod\ndef ld_annotate(\n    cls: type[LDAnnotator],\n    associations: StudyLocus,\n    studies: StudyIndex,\n    ld_index: LDIndex,\n) -> StudyLocus:\n    \"\"\"Annotate linkage disequilibrium (LD) information to a set of studyLocus.\n\n    This function:\n        1. Annotates study locus with population structure information from the study index\n        2. Joins the LD index to the StudyLocus\n        3. Adds the population size of the study to each rValues entry in the ldSet\n        4. Calculates the overall R weighted by the ancestry proportions in every given study.\n\n    Args:\n        associations (StudyLocus): Dataset to be LD annotated\n        studies (StudyIndex): Dataset with study information\n        ld_index (LDIndex): Dataset with LD information for every variant present in LD matrix\n\n    Returns:\n        StudyLocus: including additional column with LD information.\n    \"\"\"\n    return (\n        StudyLocus(\n            _df=(\n                associations.df\n                # Drop ldSet column if already available\n                .select(*[col for col in associations.df.columns if col != \"ldSet\"])\n                # Annotate study locus with population structure from study index\n                .join(\n                    studies.df.select(\"studyId\", \"ldPopulationStructure\"),\n                    on=\"studyId\",\n                    how=\"left\",\n                )\n                # Bring LD information from LD Index\n                .join(\n                    ld_index.df,\n                    on=[\"variantId\", \"chromosome\"],\n                    how=\"left\",\n                )\n                # Add population size to each rValues entry in the ldSet if population structure available:\n                .withColumn(\n                    \"ldSet\",\n                    f.when(\n                        f.col(\"ldPopulationStructure\").isNotNull(),\n                        cls._add_population_size(\n                            f.col(\"ldSet\"), f.col(\"ldPopulationStructure\")\n                        ),\n                    ),\n                )\n                # Aggregate weighted R information using ancestry proportions\n                .withColumn(\n                    \"ldSet\",\n                    f.when(\n                        f.col(\"ldPopulationStructure\").isNotNull(),\n                        cls._calculate_weighted_r_overall(f.col(\"ldSet\")),\n                    ),\n                ).drop(\"ldPopulationStructure\")\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n        ._qc_no_population()\n        ._qc_unresolved_ld()\n    )\n
"},{"location":"python_api/method/pics/","title":"PICS","text":""},{"location":"python_api/method/pics/#otg.method.pics.PICS","title":"otg.method.pics.PICS","text":"

Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.

Source code in src/otg/method/pics.py
class PICS:\n    \"\"\"Probabilistic Identification of Causal SNPs (PICS), an algorithm estimating the probability that an individual variant is causal considering the haplotype structure and observed pattern of association at the genetic locus.\"\"\"\n\n    @staticmethod\n    def _pics_relative_posterior_probability(\n        neglog_p: float, pics_snp_mu: float, pics_snp_std: float\n    ) -> float:\n        \"\"\"Compute the PICS posterior probability for a given SNP.\n\n        !!! info \"This probability needs to be scaled to take into account the probabilities of the other variants in the locus.\"\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            pics_snp_mu (float): Mean P value of the association between a SNP and a trait\n            pics_snp_std (float): Standard deviation for the P value of the association between a SNP and a trait\n\n        Returns:\n            float: Posterior probability of the association between a SNP and a trait\n\n        Examples:\n            >>> rel_prob = PICS._pics_relative_posterior_probability(neglog_p=10.0, pics_snp_mu=1.0, pics_snp_std=10.0)\n            >>> round(rel_prob, 3)\n            0.368\n        \"\"\"\n        return float(norm(pics_snp_mu, pics_snp_std).sf(neglog_p) * 2)\n\n    @staticmethod\n    def _pics_standard_deviation(neglog_p: float, r2: float, k: float) -> float | None:\n        \"\"\"Compute the PICS standard deviation.\n\n        This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n        valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            r2 (float): LD score between a given SNP and the lead variant\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            float | None: Standard deviation for the P value of the association between a SNP and a trait\n\n        Examples:\n            >>> PICS._pics_standard_deviation(neglog_p=1.0, r2=1.0, k=6.4)\n            0.0\n            >>> round(PICS._pics_standard_deviation(neglog_p=10.0, r2=0.5, k=6.4), 3)\n            1.493\n            >>> print(PICS._pics_standard_deviation(neglog_p=1.0, r2=0.0, k=6.4))\n            None\n        \"\"\"\n        return (\n            abs(((1 - (r2**0.5) ** k) ** 0.5) * (neglog_p**0.5) / 2)\n            if r2 >= 0.5\n            else None\n        )\n\n    @staticmethod\n    def _pics_mu(neglog_p: float, r2: float) -> float | None:\n        \"\"\"Compute the PICS mu that estimates the probability of association between a given SNP and the trait.\n\n        This distribution is obtained after a series of permutation tests described in the PICS method, and it is only\n        valid when the SNP is highly linked with the lead (r2 > 0.5).\n\n        Args:\n            neglog_p (float): Negative log p-value of the lead variant\n            r2 (float): LD score between a given SNP and the lead variant\n\n        Returns:\n            float | None: Mean P value of the association between a SNP and a trait\n\n        Examples:\n            >>> PICS._pics_mu(neglog_p=1.0, r2=1.0)\n            1.0\n            >>> PICS._pics_mu(neglog_p=10.0, r2=0.5)\n            5.0\n            >>> print(PICS._pics_mu(neglog_p=10.0, r2=0.3))\n            None\n        \"\"\"\n        return neglog_p * r2 if r2 >= 0.5 else None\n\n    @staticmethod\n    def _finemap(ld_set: list[Row], lead_neglog_p: float, k: float) -> list | None:\n        \"\"\"Calculates the probability of a variant being causal in a study-locus context by applying the PICS method.\n\n        It is intended to be applied as an UDF in `PICS.finemap`, where each row is a StudyLocus association.\n        The function iterates over every SNP in the `ldSet` array, and it returns an updated locus with\n        its association signal and causality probability as of PICS.\n\n        Args:\n            ld_set (list[Row]): list of tagging variants after expanding the locus\n            lead_neglog_p (float): P value of the association signal between the lead variant and the study in the form of -log10.\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            list | None: List of tagging variants with an estimation of the association signal and their posterior probability as of PICS.\n\n        Examples:\n            >>> from pyspark.sql import Row\n            >>> ld_set = [\n            ...     Row(variantId=\"var1\", r2Overall=0.8),\n            ...     Row(variantId=\"var2\", r2Overall=1),\n            ... ]\n            >>> PICS._finemap(ld_set, lead_neglog_p=10.0, k=6.4)\n            [{'variantId': 'var1', 'r2Overall': 0.8, 'standardError': 0.07420896512708416, 'posteriorProbability': 0.07116959886882368}, {'variantId': 'var2', 'r2Overall': 1, 'standardError': 0.9977000638225533, 'posteriorProbability': 0.9288304011311763}]\n            >>> empty_ld_set = []\n            >>> PICS._finemap(empty_ld_set, lead_neglog_p=10.0, k=6.4)\n            []\n            >>> ld_set_with_no_r2 = [\n            ...     Row(variantId=\"var1\", r2Overall=None),\n            ...     Row(variantId=\"var2\", r2Overall=None),\n            ... ]\n            >>> PICS._finemap(ld_set_with_no_r2, lead_neglog_p=10.0, k=6.4)\n            [{'variantId': 'var1', 'r2Overall': None}, {'variantId': 'var2', 'r2Overall': None}]\n        \"\"\"\n        if ld_set is None:\n            return None\n        elif not ld_set:\n            return []\n        tmp_credible_set = []\n        new_credible_set = []\n        # First iteration: calculation of mu, standard deviation, and the relative posterior probability\n        for tag_struct in ld_set:\n            tag_dict = (\n                tag_struct.asDict()\n            )  # tag_struct is of type pyspark.Row, we'll represent it as a dict\n            if (\n                not tag_dict[\"r2Overall\"]\n                or tag_dict[\"r2Overall\"] < 0.5\n                or not lead_neglog_p\n            ):\n                # If PICS cannot be calculated, we'll return the original credible set\n                new_credible_set.append(tag_dict)\n                continue\n\n            pics_snp_mu = PICS._pics_mu(lead_neglog_p, tag_dict[\"r2Overall\"])\n            pics_snp_std = PICS._pics_standard_deviation(\n                lead_neglog_p, tag_dict[\"r2Overall\"], k\n            )\n            pics_snp_std = 0.001 if pics_snp_std == 0 else pics_snp_std\n            if pics_snp_mu is not None and pics_snp_std is not None:\n                posterior_probability = PICS._pics_relative_posterior_probability(\n                    lead_neglog_p, pics_snp_mu, pics_snp_std\n                )\n                tag_dict[\"standardError\"] = 10**-pics_snp_std\n                tag_dict[\"relativePosteriorProbability\"] = posterior_probability\n\n                tmp_credible_set.append(tag_dict)\n\n        # Second iteration: calculation of the sum of all the posteriors in each study-locus, so that we scale them between 0-1\n        total_posteriors = sum(\n            tag_dict.get(\"relativePosteriorProbability\", 0)\n            for tag_dict in tmp_credible_set\n        )\n\n        # Third iteration: calculation of the final posteriorProbability\n        for tag_dict in tmp_credible_set:\n            if total_posteriors != 0:\n                tag_dict[\"posteriorProbability\"] = float(\n                    tag_dict.get(\"relativePosteriorProbability\", 0) / total_posteriors\n                )\n            tag_dict.pop(\"relativePosteriorProbability\")\n            new_credible_set.append(tag_dict)\n        return new_credible_set\n\n    @classmethod\n    def finemap(\n        cls: type[PICS], associations: StudyLocus, k: float = 6.4\n    ) -> StudyLocus:\n        \"\"\"Run PICS on a study locus.\n\n        !!! info \"Study locus needs to be LD annotated\"\n            The study locus needs to be LD annotated before PICS can be calculated.\n\n        Args:\n            associations (StudyLocus): Study locus to finemap using PICS\n            k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n        Returns:\n            StudyLocus: Study locus with PICS results\n        \"\"\"\n        # Register UDF by defining the structure of the output locus array of structs\n        # it also renames tagVariantId to variantId\n\n        picsed_ldset_schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"tagVariantId\", t.StringType(), True),\n                    t.StructField(\"r2Overall\", t.DoubleType(), True),\n                    t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                    t.StructField(\"standardError\", t.DoubleType(), True),\n                ]\n            )\n        )\n        picsed_study_locus_schema = t.ArrayType(\n            t.StructType(\n                [\n                    t.StructField(\"variantId\", t.StringType(), True),\n                    t.StructField(\"r2Overall\", t.DoubleType(), True),\n                    t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                    t.StructField(\"standardError\", t.DoubleType(), True),\n                ]\n            )\n        )\n        _finemap_udf = f.udf(\n            lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n            picsed_ldset_schema,\n        )\n        return StudyLocus(\n            _df=(\n                associations.df\n                # Old locus column will be dropped if available\n                .select(*[col for col in associations.df.columns if col != \"locus\"])\n                # Estimate neglog_pvalue for the lead variant\n                .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n                # New locus containing the PICS results\n                .withColumn(\n                    \"locus\",\n                    f.when(\n                        f.col(\"ldSet\").isNotNull(),\n                        _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n                            picsed_study_locus_schema\n                        ),\n                    ),\n                )\n                # Rename tagVariantId to variantId\n                .drop(\"neglog_pvalue\")\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n
"},{"location":"python_api/method/pics/#otg.method.pics.PICS.finemap","title":"finemap(associations: StudyLocus, k: float = 6.4) -> StudyLocus classmethod","text":"

Run PICS on a study locus.

Study locus needs to be LD annotated

The study locus needs to be LD annotated before PICS can be calculated.

Parameters:

Name Type Description Default associations StudyLocus

Study locus to finemap using PICS

required k float

Empiric constant that can be adjusted to fit the curve, 6.4 recommended.

6.4

Returns:

Name Type Description StudyLocus StudyLocus

Study locus with PICS results

Source code in src/otg/method/pics.py
@classmethod\ndef finemap(\n    cls: type[PICS], associations: StudyLocus, k: float = 6.4\n) -> StudyLocus:\n    \"\"\"Run PICS on a study locus.\n\n    !!! info \"Study locus needs to be LD annotated\"\n        The study locus needs to be LD annotated before PICS can be calculated.\n\n    Args:\n        associations (StudyLocus): Study locus to finemap using PICS\n        k (float): Empiric constant that can be adjusted to fit the curve, 6.4 recommended.\n\n    Returns:\n        StudyLocus: Study locus with PICS results\n    \"\"\"\n    # Register UDF by defining the structure of the output locus array of structs\n    # it also renames tagVariantId to variantId\n\n    picsed_ldset_schema = t.ArrayType(\n        t.StructType(\n            [\n                t.StructField(\"tagVariantId\", t.StringType(), True),\n                t.StructField(\"r2Overall\", t.DoubleType(), True),\n                t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                t.StructField(\"standardError\", t.DoubleType(), True),\n            ]\n        )\n    )\n    picsed_study_locus_schema = t.ArrayType(\n        t.StructType(\n            [\n                t.StructField(\"variantId\", t.StringType(), True),\n                t.StructField(\"r2Overall\", t.DoubleType(), True),\n                t.StructField(\"posteriorProbability\", t.DoubleType(), True),\n                t.StructField(\"standardError\", t.DoubleType(), True),\n            ]\n        )\n    )\n    _finemap_udf = f.udf(\n        lambda locus, neglog_p: PICS._finemap(locus, neglog_p, k),\n        picsed_ldset_schema,\n    )\n    return StudyLocus(\n        _df=(\n            associations.df\n            # Old locus column will be dropped if available\n            .select(*[col for col in associations.df.columns if col != \"locus\"])\n            # Estimate neglog_pvalue for the lead variant\n            .withColumn(\"neglog_pvalue\", associations.neglog_pvalue())\n            # New locus containing the PICS results\n            .withColumn(\n                \"locus\",\n                f.when(\n                    f.col(\"ldSet\").isNotNull(),\n                    _finemap_udf(f.col(\"ldSet\"), f.col(\"neglog_pvalue\")).cast(\n                        picsed_study_locus_schema\n                    ),\n                ),\n            )\n            # Rename tagVariantId to variantId\n            .drop(\"neglog_pvalue\")\n        ),\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/method/window_based_clumping/","title":"Window-based clumping","text":""},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping","title":"otg.method.window_based_clumping.WindowBasedClumping","text":"

Get semi-lead snps from summary statistics using a window based function.

Source code in src/otg/method/window_based_clumping.py
class WindowBasedClumping:\n    \"\"\"Get semi-lead snps from summary statistics using a window based function.\"\"\"\n\n    @staticmethod\n    def _cluster_peaks(\n        study: Column, chromosome: Column, position: Column, window_length: int\n    ) -> Column:\n        \"\"\"Cluster GWAS significant variants, were clusters are separated by a defined distance.\n\n        !! Important to note that the length of the clusters can be arbitrarily big.\n\n        Args:\n            study (Column): study identifier\n            chromosome (Column): chromosome identifier\n            position (Column): position of the variant\n            window_length (int): window length in basepair\n\n        Returns:\n            Column: containing cluster identifier\n\n        Examples:\n            >>> data = [\n            ...     # Cluster 1:\n            ...     ('s1', 'chr1', 2),\n            ...     ('s1', 'chr1', 4),\n            ...     ('s1', 'chr1', 12),\n            ...     # Cluster 2 - Same chromosome:\n            ...     ('s1', 'chr1', 31),\n            ...     ('s1', 'chr1', 38),\n            ...     ('s1', 'chr1', 42),\n            ...     # Cluster 3 - New chromosome:\n            ...     ('s1', 'chr2', 41),\n            ...     ('s1', 'chr2', 44),\n            ...     ('s1', 'chr2', 50),\n            ...     # Cluster 4 - other study:\n            ...     ('s2', 'chr2', 55),\n            ...     ('s2', 'chr2', 62),\n            ...     ('s2', 'chr2', 70),\n            ... ]\n            >>> window_length = 10\n            >>> (\n            ...     spark.createDataFrame(data, ['studyId', 'chromosome', 'position'])\n            ...     .withColumn(\"cluster_id\",\n            ...         WindowBasedClumping._cluster_peaks(\n            ...             f.col('studyId'),\n            ...             f.col('chromosome'),\n            ...             f.col('position'),\n            ...             window_length\n            ...         )\n            ...     ).show()\n            ... )\n            +-------+----------+--------+----------+\n            |studyId|chromosome|position|cluster_id|\n            +-------+----------+--------+----------+\n            |     s1|      chr1|       2| s1_chr1_2|\n            |     s1|      chr1|       4| s1_chr1_2|\n            |     s1|      chr1|      12| s1_chr1_2|\n            |     s1|      chr1|      31|s1_chr1_31|\n            |     s1|      chr1|      38|s1_chr1_31|\n            |     s1|      chr1|      42|s1_chr1_31|\n            |     s1|      chr2|      41|s1_chr2_41|\n            |     s1|      chr2|      44|s1_chr2_41|\n            |     s1|      chr2|      50|s1_chr2_41|\n            |     s2|      chr2|      55|s2_chr2_55|\n            |     s2|      chr2|      62|s2_chr2_55|\n            |     s2|      chr2|      70|s2_chr2_55|\n            +-------+----------+--------+----------+\n            <BLANKLINE>\n\n        \"\"\"\n        # By adding previous position, the cluster boundary can be identified:\n        previous_position = f.lag(position).over(\n            Window.partitionBy(study, chromosome).orderBy(position)\n        )\n        # We consider a cluster boudary if subsequent snps are further than the defined window:\n        cluster_id = f.when(\n            (previous_position.isNull())\n            | (position - previous_position > window_length),\n            f.concat_ws(\"_\", study, chromosome, position),\n        )\n        # The cluster identifier is propagated across every variant of the cluster:\n        return f.when(\n            cluster_id.isNull(),\n            f.last(cluster_id, ignorenulls=True).over(\n                Window.partitionBy(study, chromosome)\n                .orderBy(position)\n                .rowsBetween(Window.unboundedPreceding, Window.currentRow)\n            ),\n        ).otherwise(cluster_id)\n\n    @staticmethod\n    def _prune_peak(position: ndarray, window_size: int) -> DenseVector:\n        \"\"\"Establish lead snps based on their positions listed by p-value.\n\n        The function `find_peak` assigns lead SNPs based on their positions listed by p-value within a specified window size.\n\n        Args:\n            position (ndarray): positions of the SNPs sorted by p-value.\n            window_size (int): the distance in bp within which associations are clumped together around the lead snp.\n\n        Returns:\n            DenseVector: binary vector where 1 indicates a lead SNP and 0 indicates a non-lead SNP.\n\n        Examples:\n            >>> from pyspark.ml import functions as fml\n            >>> from pyspark.ml.linalg import DenseVector\n            >>> WindowBasedClumping._prune_peak(np.array((3, 9, 8, 4, 6)), 2)\n            DenseVector([1.0, 1.0, 0.0, 0.0, 1.0])\n\n        \"\"\"\n        # Initializing the lead list with zeroes:\n        is_lead: ndarray = np.zeros(len(position))\n\n        # List containing indices of leads:\n        lead_indices: list = []\n\n        # Looping through all positions:\n        for index in range(len(position)):\n            # Looping through leads to find out if they are within a window:\n            for lead_index in lead_indices:\n                # If any of the leads within the window:\n                if abs(position[lead_index] - position[index]) < window_size:\n                    # Skipping further checks:\n                    break\n            else:\n                # None of the leads were within the window:\n                lead_indices.append(index)\n                is_lead[index] = 1\n\n        return DenseVector(is_lead)\n\n    @classmethod\n    def clump(\n        cls: type[WindowBasedClumping],\n        summary_stats: SummaryStatistics,\n        window_length: int,\n        p_value_significance: float = 5e-8,\n    ) -> StudyLocus:\n        \"\"\"Clump summary statistics by distance.\n\n        Args:\n            summary_stats (SummaryStatistics): summary statistics to clump\n            window_length (int): window length in basepair\n            p_value_significance (float): only more significant variants are considered\n\n        Returns:\n            StudyLocus: clumped summary statistics\n        \"\"\"\n        # Create window for locus clusters\n        # - variants where the distance between subsequent variants is below the defined threshold.\n        # - Variants are sorted by descending significance\n        cluster_window = Window.partitionBy(\n            \"studyId\", \"chromosome\", \"cluster_id\"\n        ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n        return StudyLocus(\n            _df=(\n                summary_stats\n                # Dropping snps below significance - all subsequent steps are done on significant variants:\n                .pvalue_filter(p_value_significance)\n                .df\n                # Clustering summary variants for efficient windowing (complexity reduction):\n                .withColumn(\n                    \"cluster_id\",\n                    WindowBasedClumping._cluster_peaks(\n                        f.col(\"studyId\"),\n                        f.col(\"chromosome\"),\n                        f.col(\"position\"),\n                        window_length,\n                    ),\n                )\n                # Within each cluster variants are ranked by significance:\n                .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n                # Collect positions in cluster for the most significant variant (complexity reduction):\n                .withColumn(\n                    \"collectedPositions\",\n                    f.when(\n                        f.col(\"pvRank\") == 1,\n                        f.collect_list(f.col(\"position\")).over(\n                            cluster_window.rowsBetween(\n                                Window.currentRow, Window.unboundedFollowing\n                            )\n                        ),\n                    ).otherwise(f.array()),\n                )\n                # Get semi indices only ONCE per cluster:\n                .withColumn(\n                    \"semiIndices\",\n                    f.when(\n                        f.size(f.col(\"collectedPositions\")) > 0,\n                        fml.vector_to_array(\n                            f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n                                fml.array_to_vector(f.col(\"collectedPositions\")),\n                                f.lit(window_length),\n                            )\n                        ),\n                    ),\n                )\n                # Propagating the result of the above calculation for all rows:\n                .withColumn(\n                    \"semiIndices\",\n                    f.when(\n                        f.col(\"semiIndices\").isNull(),\n                        f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n                            cluster_window\n                        ),\n                    ).otherwise(f.col(\"semiIndices\")),\n                )\n                # Keeping semi indices only:\n                .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n                .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n                # Adding study-locus id:\n                .withColumn(\n                    \"studyLocusId\",\n                    StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n                )\n                # Initialize QC column as array of strings:\n                .withColumn(\n                    \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n                )\n            ),\n            _schema=StudyLocus.get_schema(),\n        )\n\n    @classmethod\n    def clump_with_locus(\n        cls: type[WindowBasedClumping],\n        summary_stats: SummaryStatistics,\n        window_length: int,\n        p_value_significance: float = 5e-8,\n        p_value_baseline: float = 0.05,\n        locus_window_length: int | None = None,\n    ) -> StudyLocus:\n        \"\"\"Clump significant associations while collecting locus around them.\n\n        Args:\n            summary_stats (SummaryStatistics): Input summary statistics dataset\n            window_length (int): Window size in  bp, used for distance based clumping.\n            p_value_significance (float): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n            p_value_baseline (float): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n            locus_window_length (int | None): The distance for collecting locus around the semi indices. Defaults to None.\n\n        Returns:\n            StudyLocus: StudyLocus after clumping with information about the `locus`\n        \"\"\"\n        # If no locus window provided, using the same value:\n        if locus_window_length is None:\n            locus_window_length = window_length\n\n        # Run distance based clumping on the summary stats:\n        clumped_dataframe = WindowBasedClumping.clump(\n            summary_stats,\n            window_length=window_length,\n            p_value_significance=p_value_significance,\n        ).df.alias(\"clumped\")\n\n        # Get list of columns from clumped dataset for further propagation:\n        clumped_columns = clumped_dataframe.columns\n\n        # Dropping variants not meeting the baseline criteria:\n        sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n        # Renaming columns:\n        sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n            *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n        ).alias(\"sumstat\")\n\n        study_locus_df = (\n            sumstats_baseline_renamed\n            # Joining the two datasets together:\n            .join(\n                f.broadcast(clumped_dataframe),\n                on=[\n                    (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n                    & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n                    & (\n                        f.col(\"sumstat.tag_position\")\n                        >= (f.col(\"clumped.position\") - locus_window_length)\n                    )\n                    & (\n                        f.col(\"sumstat.tag_position\")\n                        <= (f.col(\"clumped.position\") + locus_window_length)\n                    )\n                ],\n                how=\"right\",\n            )\n            .withColumn(\n                \"locus\",\n                f.struct(\n                    f.col(\"tag_variantId\").alias(\"variantId\"),\n                    f.col(\"tag_beta\").alias(\"beta\"),\n                    f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n                    f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n                    f.col(\"tag_standardError\").alias(\"standardError\"),\n                ),\n            )\n            .groupby(\"studyLocusId\")\n            .agg(\n                *[\n                    f.first(col).alias(col)\n                    for col in clumped_columns\n                    if col != \"studyLocusId\"\n                ],\n                f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n            )\n        )\n\n        return StudyLocus(\n            _df=study_locus_df,\n            _schema=StudyLocus.get_schema(),\n        )\n
"},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping.clump","title":"clump(summary_stats: SummaryStatistics, window_length: int, p_value_significance: float = 5e-08) -> StudyLocus classmethod","text":"

Clump summary statistics by distance.

Parameters:

Name Type Description Default summary_stats SummaryStatistics

summary statistics to clump

required window_length int

window length in basepair

required p_value_significance float

only more significant variants are considered

5e-08

Returns:

Name Type Description StudyLocus StudyLocus

clumped summary statistics

Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump(\n    cls: type[WindowBasedClumping],\n    summary_stats: SummaryStatistics,\n    window_length: int,\n    p_value_significance: float = 5e-8,\n) -> StudyLocus:\n    \"\"\"Clump summary statistics by distance.\n\n    Args:\n        summary_stats (SummaryStatistics): summary statistics to clump\n        window_length (int): window length in basepair\n        p_value_significance (float): only more significant variants are considered\n\n    Returns:\n        StudyLocus: clumped summary statistics\n    \"\"\"\n    # Create window for locus clusters\n    # - variants where the distance between subsequent variants is below the defined threshold.\n    # - Variants are sorted by descending significance\n    cluster_window = Window.partitionBy(\n        \"studyId\", \"chromosome\", \"cluster_id\"\n    ).orderBy(f.col(\"pValueExponent\").asc(), f.col(\"pValueMantissa\").asc())\n\n    return StudyLocus(\n        _df=(\n            summary_stats\n            # Dropping snps below significance - all subsequent steps are done on significant variants:\n            .pvalue_filter(p_value_significance)\n            .df\n            # Clustering summary variants for efficient windowing (complexity reduction):\n            .withColumn(\n                \"cluster_id\",\n                WindowBasedClumping._cluster_peaks(\n                    f.col(\"studyId\"),\n                    f.col(\"chromosome\"),\n                    f.col(\"position\"),\n                    window_length,\n                ),\n            )\n            # Within each cluster variants are ranked by significance:\n            .withColumn(\"pvRank\", f.row_number().over(cluster_window))\n            # Collect positions in cluster for the most significant variant (complexity reduction):\n            .withColumn(\n                \"collectedPositions\",\n                f.when(\n                    f.col(\"pvRank\") == 1,\n                    f.collect_list(f.col(\"position\")).over(\n                        cluster_window.rowsBetween(\n                            Window.currentRow, Window.unboundedFollowing\n                        )\n                    ),\n                ).otherwise(f.array()),\n            )\n            # Get semi indices only ONCE per cluster:\n            .withColumn(\n                \"semiIndices\",\n                f.when(\n                    f.size(f.col(\"collectedPositions\")) > 0,\n                    fml.vector_to_array(\n                        f.udf(WindowBasedClumping._prune_peak, VectorUDT())(\n                            fml.array_to_vector(f.col(\"collectedPositions\")),\n                            f.lit(window_length),\n                        )\n                    ),\n                ),\n            )\n            # Propagating the result of the above calculation for all rows:\n            .withColumn(\n                \"semiIndices\",\n                f.when(\n                    f.col(\"semiIndices\").isNull(),\n                    f.first(f.col(\"semiIndices\"), ignorenulls=True).over(\n                        cluster_window\n                    ),\n                ).otherwise(f.col(\"semiIndices\")),\n            )\n            # Keeping semi indices only:\n            .filter(f.col(\"semiIndices\")[f.col(\"pvRank\") - 1] > 0)\n            .drop(\"pvRank\", \"collectedPositions\", \"semiIndices\", \"cluster_id\")\n            # Adding study-locus id:\n            .withColumn(\n                \"studyLocusId\",\n                StudyLocus.assign_study_locus_id(\"studyId\", \"variantId\"),\n            )\n            # Initialize QC column as array of strings:\n            .withColumn(\n                \"qualityControls\", f.array().cast(t.ArrayType(t.StringType()))\n            )\n        ),\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/method/window_based_clumping/#otg.method.window_based_clumping.WindowBasedClumping.clump_with_locus","title":"clump_with_locus(summary_stats: SummaryStatistics, window_length: int, p_value_significance: float = 5e-08, p_value_baseline: float = 0.05, locus_window_length: int | None = None) -> StudyLocus classmethod","text":"

Clump significant associations while collecting locus around them.

Parameters:

Name Type Description Default summary_stats SummaryStatistics

Input summary statistics dataset

required window_length int

Window size in bp, used for distance based clumping.

required p_value_significance float

GWAS significance threshold used to filter peaks. Defaults to 5e-8.

5e-08 p_value_baseline float

Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.

0.05 locus_window_length int | None

The distance for collecting locus around the semi indices. Defaults to None.

None

Returns:

Name Type Description StudyLocus StudyLocus

StudyLocus after clumping with information about the locus

Source code in src/otg/method/window_based_clumping.py
@classmethod\ndef clump_with_locus(\n    cls: type[WindowBasedClumping],\n    summary_stats: SummaryStatistics,\n    window_length: int,\n    p_value_significance: float = 5e-8,\n    p_value_baseline: float = 0.05,\n    locus_window_length: int | None = None,\n) -> StudyLocus:\n    \"\"\"Clump significant associations while collecting locus around them.\n\n    Args:\n        summary_stats (SummaryStatistics): Input summary statistics dataset\n        window_length (int): Window size in  bp, used for distance based clumping.\n        p_value_significance (float): GWAS significance threshold used to filter peaks. Defaults to 5e-8.\n        p_value_baseline (float): Least significant threshold. Below this, all snps are dropped. Defaults to 0.05.\n        locus_window_length (int | None): The distance for collecting locus around the semi indices. Defaults to None.\n\n    Returns:\n        StudyLocus: StudyLocus after clumping with information about the `locus`\n    \"\"\"\n    # If no locus window provided, using the same value:\n    if locus_window_length is None:\n        locus_window_length = window_length\n\n    # Run distance based clumping on the summary stats:\n    clumped_dataframe = WindowBasedClumping.clump(\n        summary_stats,\n        window_length=window_length,\n        p_value_significance=p_value_significance,\n    ).df.alias(\"clumped\")\n\n    # Get list of columns from clumped dataset for further propagation:\n    clumped_columns = clumped_dataframe.columns\n\n    # Dropping variants not meeting the baseline criteria:\n    sumstats_baseline = summary_stats.pvalue_filter(p_value_baseline).df\n\n    # Renaming columns:\n    sumstats_baseline_renamed = sumstats_baseline.selectExpr(\n        *[f\"{col} as tag_{col}\" for col in sumstats_baseline.columns]\n    ).alias(\"sumstat\")\n\n    study_locus_df = (\n        sumstats_baseline_renamed\n        # Joining the two datasets together:\n        .join(\n            f.broadcast(clumped_dataframe),\n            on=[\n                (f.col(\"sumstat.tag_studyId\") == f.col(\"clumped.studyId\"))\n                & (f.col(\"sumstat.tag_chromosome\") == f.col(\"clumped.chromosome\"))\n                & (\n                    f.col(\"sumstat.tag_position\")\n                    >= (f.col(\"clumped.position\") - locus_window_length)\n                )\n                & (\n                    f.col(\"sumstat.tag_position\")\n                    <= (f.col(\"clumped.position\") + locus_window_length)\n                )\n            ],\n            how=\"right\",\n        )\n        .withColumn(\n            \"locus\",\n            f.struct(\n                f.col(\"tag_variantId\").alias(\"variantId\"),\n                f.col(\"tag_beta\").alias(\"beta\"),\n                f.col(\"tag_pValueMantissa\").alias(\"pValueMantissa\"),\n                f.col(\"tag_pValueExponent\").alias(\"pValueExponent\"),\n                f.col(\"tag_standardError\").alias(\"standardError\"),\n            ),\n        )\n        .groupby(\"studyLocusId\")\n        .agg(\n            *[\n                f.first(col).alias(col)\n                for col in clumped_columns\n                if col != \"studyLocusId\"\n            ],\n            f.collect_list(f.col(\"locus\")).alias(\"locus\"),\n        )\n    )\n\n    return StudyLocus(\n        _df=study_locus_df,\n        _schema=StudyLocus.get_schema(),\n    )\n
"},{"location":"python_api/method/l2g/_l2g/","title":"Locus to Gene (L2G) classifier","text":"

TBC

"},{"location":"python_api/method/l2g/evaluator/","title":"W&B evaluator","text":""},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator","title":"otg.method.l2g.evaluator.WandbEvaluator","text":"

Bases: Evaluator

Wrapper for pyspark Evaluators. It is expected that the user will provide an Evaluators, and this wrapper will log metrics from said evaluator to W&B.

Source code in src/otg/method/l2g/evaluator.py
class WandbEvaluator(Evaluator):\n    \"\"\"Wrapper for pyspark Evaluators. It is expected that the user will provide an Evaluators, and this wrapper will log metrics from said evaluator to W&B.\"\"\"\n\n    spark_ml_evaluator: Param = Param(\n        Params._dummy(), \"spark_ml_evaluator\", \"evaluator from pyspark.ml.evaluation\"  # type: ignore\n    )\n\n    wandb_run: Param = Param(\n        Params._dummy(),  # type: ignore\n        \"wandb_run\",\n        \"wandb run.  Expects an already initialized run.  You should set this, or wandb_run_kwargs, NOT BOTH\",\n    )\n\n    wandb_run_kwargs: Param = Param(\n        Params._dummy(),\n        \"wandb_run_kwargs\",\n        \"kwargs to be passed to wandb.init.  You should set this, or wandb_runId, NOT BOTH.  Setting this is useful when using with WandbCrossValdidator\",\n    )\n\n    wandb_runId: Param = Param(  # noqa: N815\n        Params._dummy(),  # type: ignore\n        \"wandb_runId\",\n        \"wandb run id.  if not providing an intialized run to wandb_run, a run with id wandb_runId will be resumed\",\n    )\n\n    wandb_project_name: Param = Param(\n        Params._dummy(),\n        \"wandb_project_name\",\n        \"name of W&B project\",\n        typeConverter=TypeConverters.toString,\n    )\n\n    label_values: Param = Param(\n        Params._dummy(),\n        \"label_values\",\n        \"for classification and multiclass classification, this is a list of values the label can assume\\nIf provided Multiclass or Multilabel evaluator without label_values, we'll figure it out from dataset passed through to evaluate.\",\n    )\n\n    _input_kwargs: Dict[str, Any]\n\n    @keyword_only\n    def __init__(\n        self: WandbEvaluator,\n        *,\n        label_values: list,\n        wandb_run: wandb.sdk.wandb_run.Run | None = None,\n        spark_ml_evaluator: Evaluator | None = None,\n    ) -> None:\n        \"\"\"Initialize a WandbEvaluator.\n\n        Args:\n            label_values (list): List of label values.\n            wandb_run (wandb.sdk.wandb_run.Run | None): Wandb run object. Defaults to None.\n            spark_ml_evaluator (Evaluator | None): Spark ML evaluator. Defaults to None.\n        \"\"\"\n        if label_values is None:\n            label_values = []\n        super(Evaluator, self).__init__()\n\n        self.metrics = {\n            MulticlassClassificationEvaluator: [\n                \"f1\",\n                \"accuracy\",\n                \"weightedPrecision\",\n                \"weightedRecall\",\n                \"weightedTruePositiveRate\",\n                \"weightedFalsePositiveRate\",\n                \"weightedFMeasure\",\n                \"truePositiveRateByLabel\",\n                \"falsePositiveRateByLabel\",\n                \"precisionByLabel\",\n                \"recallByLabel\",\n                \"fMeasureByLabel\",\n                \"logLoss\",\n                \"hammingLoss\",\n            ],\n            BinaryClassificationEvaluator: [\"areaUnderROC\", \"areaUnderPR\"],\n        }\n\n        self._setDefault(label_values=[])\n        kwargs = self._input_kwargs\n        self._set(**kwargs)\n\n    def setspark_ml_evaluator(self: WandbEvaluator, value: Evaluator) -> None:\n        \"\"\"Set the spark_ml_evaluator parameter.\n\n        Args:\n            value (Evaluator): Spark ML evaluator.\n        \"\"\"\n        self._set(spark_ml_evaluator=value)\n\n    def setlabel_values(self: WandbEvaluator, value: list) -> None:\n        \"\"\"Set the label_values parameter.\n\n        Args:\n            value (list): List of label values.\n        \"\"\"\n        self._set(label_values=value)\n\n    def getspark_ml_evaluator(self: WandbEvaluator) -> Evaluator:\n        \"\"\"Get the spark_ml_evaluator parameter.\n\n        Returns:\n            Evaluator: Spark ML evaluator.\n        \"\"\"\n        return self.getOrDefault(self.spark_ml_evaluator)\n\n    def getwandb_run(self: WandbEvaluator) -> wandb.sdk.wandb_run.Run:\n        \"\"\"Get the wandb_run parameter.\n\n        Returns:\n            wandb.sdk.wandb_run.Run: Wandb run object.\n        \"\"\"\n        return self.getOrDefault(self.wandb_run)\n\n    def getwandb_project_name(self: WandbEvaluator) -> str:\n        \"\"\"Get the wandb_project_name parameter.\n\n        Returns:\n            str: Name of the W&B project.\n        \"\"\"\n        return self.getOrDefault(self.wandb_project_name)\n\n    def getlabel_values(self: WandbEvaluator) -> list:\n        \"\"\"Get the label_values parameter.\n\n        Returns:\n            list: List of label values.\n        \"\"\"\n        return self.getOrDefault(self.label_values)\n\n    def _evaluate(self: WandbEvaluator, dataset: DataFrame) -> float:\n        \"\"\"Evaluate the model on the given dataset.\n\n        Args:\n            dataset (DataFrame): Dataset to evaluate the model on.\n\n        Returns:\n            float: Metric value.\n        \"\"\"\n        dataset.persist()\n        metric_values = []\n        label_values = self.getlabel_values()\n        spark_ml_evaluator = self.getspark_ml_evaluator()\n        run = self.getwandb_run()\n        evaluator_type = type(spark_ml_evaluator)\n        if isinstance(spark_ml_evaluator, RankingEvaluator):\n            metric_values.append((\"k\", spark_ml_evaluator.getK()))\n        for metric in self.metrics[evaluator_type]:\n            if \"ByLabel\" in metric and label_values == []:\n                print(\n                    \"no label_values for the target have been provided and will be determined by the dataset.  This could take some time\"\n                )\n                label_values = [\n                    r[spark_ml_evaluator.getLabelCol()]\n                    for r in dataset.select(spark_ml_evaluator.getLabelCol())\n                    .distinct()\n                    .collect()\n                ]\n                if isinstance(label_values[0], list):\n                    merged = list(itertools.chain(*label_values))\n                    label_values = list(dict.fromkeys(merged).keys())\n                    self.setlabel_values(label_values)\n            for label in label_values:\n                out = spark_ml_evaluator.evaluate(\n                    dataset,\n                    {\n                        spark_ml_evaluator.metricLabel: label,\n                        spark_ml_evaluator.metricName: metric,\n                    },\n                )\n                metric_values.append((f\"{metric}:{label}\", out))\n            out = spark_ml_evaluator.evaluate(\n                dataset, {spark_ml_evaluator.metricName: metric}\n            )\n            metric_values.append((f\"{metric}\", out))\n        run.log(dict(metric_values))\n        config = [\n            (f\"{k.parent.split('_')[0]}.{k.name}\", v)\n            for k, v in spark_ml_evaluator.extractParamMap().items()\n            if \"metric\" not in k.name\n        ]\n        run.config.update(dict(config))\n        return_metric = spark_ml_evaluator.evaluate(dataset)\n        dataset.unpersist()\n        return return_metric\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.getlabel_values","title":"getlabel_values() -> list","text":"

Get the label_values parameter.

Returns:

Name Type Description list list

List of label values.

Source code in src/otg/method/l2g/evaluator.py
def getlabel_values(self: WandbEvaluator) -> list:\n    \"\"\"Get the label_values parameter.\n\n    Returns:\n        list: List of label values.\n    \"\"\"\n    return self.getOrDefault(self.label_values)\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.getspark_ml_evaluator","title":"getspark_ml_evaluator() -> Evaluator","text":"

Get the spark_ml_evaluator parameter.

Returns:

Name Type Description Evaluator Evaluator

Spark ML evaluator.

Source code in src/otg/method/l2g/evaluator.py
def getspark_ml_evaluator(self: WandbEvaluator) -> Evaluator:\n    \"\"\"Get the spark_ml_evaluator parameter.\n\n    Returns:\n        Evaluator: Spark ML evaluator.\n    \"\"\"\n    return self.getOrDefault(self.spark_ml_evaluator)\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.getwandb_project_name","title":"getwandb_project_name() -> str","text":"

Get the wandb_project_name parameter.

Returns:

Name Type Description str str

Name of the W&B project.

Source code in src/otg/method/l2g/evaluator.py
def getwandb_project_name(self: WandbEvaluator) -> str:\n    \"\"\"Get the wandb_project_name parameter.\n\n    Returns:\n        str: Name of the W&B project.\n    \"\"\"\n    return self.getOrDefault(self.wandb_project_name)\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.getwandb_run","title":"getwandb_run() -> wandb.sdk.wandb_run.Run","text":"

Get the wandb_run parameter.

Returns:

Type Description Run

wandb.sdk.wandb_run.Run: Wandb run object.

Source code in src/otg/method/l2g/evaluator.py
def getwandb_run(self: WandbEvaluator) -> wandb.sdk.wandb_run.Run:\n    \"\"\"Get the wandb_run parameter.\n\n    Returns:\n        wandb.sdk.wandb_run.Run: Wandb run object.\n    \"\"\"\n    return self.getOrDefault(self.wandb_run)\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.setlabel_values","title":"setlabel_values(value: list) -> None","text":"

Set the label_values parameter.

Parameters:

Name Type Description Default value list

List of label values.

required Source code in src/otg/method/l2g/evaluator.py
def setlabel_values(self: WandbEvaluator, value: list) -> None:\n    \"\"\"Set the label_values parameter.\n\n    Args:\n        value (list): List of label values.\n    \"\"\"\n    self._set(label_values=value)\n
"},{"location":"python_api/method/l2g/evaluator/#otg.method.l2g.evaluator.WandbEvaluator.setspark_ml_evaluator","title":"setspark_ml_evaluator(value: Evaluator) -> None","text":"

Set the spark_ml_evaluator parameter.

Parameters:

Name Type Description Default value Evaluator

Spark ML evaluator.

required Source code in src/otg/method/l2g/evaluator.py
def setspark_ml_evaluator(self: WandbEvaluator, value: Evaluator) -> None:\n    \"\"\"Set the spark_ml_evaluator parameter.\n\n    Args:\n        value (Evaluator): Spark ML evaluator.\n    \"\"\"\n    self._set(spark_ml_evaluator=value)\n
"},{"location":"python_api/method/l2g/feature_factory/","title":"L2G Feature Factory","text":""},{"location":"python_api/method/l2g/feature_factory/#otg.method.l2g.feature_factory.L2GFeature","title":"otg.method.l2g.feature_factory.L2GFeature dataclass","text":"

Bases: Dataset

Property of a study locus pair.

Source code in src/otg/method/l2g/feature_factory.py
@dataclass\nclass L2GFeature(Dataset):\n    \"\"\"Property of a study locus pair.\"\"\"\n\n    @classmethod\n    def get_schema(cls: type[L2GFeature]) -> StructType:\n        \"\"\"Provides the schema for the L2GFeature dataset.\n\n        Returns:\n            StructType: Schema for the L2GFeature dataset\n        \"\"\"\n        return parse_spark_schema(\"l2g_feature.json\")\n
"},{"location":"python_api/method/l2g/feature_factory/#otg.method.l2g.feature_factory.L2GFeature.get_schema","title":"get_schema() -> StructType classmethod","text":"

Provides the schema for the L2GFeature dataset.

Returns:

Name Type Description StructType StructType

Schema for the L2GFeature dataset

Source code in src/otg/method/l2g/feature_factory.py
@classmethod\ndef get_schema(cls: type[L2GFeature]) -> StructType:\n    \"\"\"Provides the schema for the L2GFeature dataset.\n\n    Returns:\n        StructType: Schema for the L2GFeature dataset\n    \"\"\"\n    return parse_spark_schema(\"l2g_feature.json\")\n
"},{"location":"python_api/method/l2g/feature_factory/#otg.method.l2g.feature_factory.ColocalisationFactory","title":"otg.method.l2g.feature_factory.ColocalisationFactory","text":"

Feature extraction in colocalisation.

Source code in src/otg/method/l2g/feature_factory.py
class ColocalisationFactory:\n    \"\"\"Feature extraction in colocalisation.\"\"\"\n\n    @staticmethod\n    def _get_max_coloc_per_study_locus(\n        study_locus: StudyLocus,\n        studies: StudyIndex,\n        colocalisation: Colocalisation,\n        colocalisation_method: str,\n    ) -> L2GFeature:\n        \"\"\"Get the maximum colocalisation posterior probability for each pair of overlapping study-locus per type of colocalisation method and QTL type.\n\n        Args:\n            study_locus (StudyLocus): Study locus dataset\n            studies (StudyIndex): Study index dataset\n            colocalisation (Colocalisation): Colocalisation dataset\n            colocalisation_method (str): Colocalisation method to extract the max from\n\n        Returns:\n            L2GFeature: Stores the features with the max coloc probabilities for each pair of study-locus\n\n        Raises:\n            ValueError: If the colocalisation method is not supported\n        \"\"\"\n        if colocalisation_method not in [\"COLOC\", \"eCAVIAR\"]:\n            raise ValueError(\n                f\"Colocalisation method {colocalisation_method} not supported\"\n            )\n        if colocalisation_method == \"COLOC\":\n            coloc_score_col_name = \"log2h4h3\"\n            coloc_feature_col_template = \"max_coloc_llr\"\n\n        elif colocalisation_method == \"eCAVIAR\":\n            coloc_score_col_name = \"clpp\"\n            coloc_feature_col_template = \"max_coloc_clpp\"\n\n        colocalising_study_locus = (\n            study_locus.df.select(\"studyLocusId\", \"studyId\")\n            # annotate studyLoci with overlapping IDs on the left - to just keep GWAS associations\n            .join(\n                colocalisation._df.selectExpr(\n                    \"leftStudyLocusId as studyLocusId\",\n                    \"rightStudyLocusId\",\n                    \"colocalisationMethod\",\n                    f\"{coloc_score_col_name} as coloc_score\",\n                ),\n                on=\"studyLocusId\",\n                how=\"inner\",\n            )\n            # bring study metadata to just keep QTL studies on the right\n            .join(\n                study_locus.df.selectExpr(\n                    \"studyLocusId as rightStudyLocusId\", \"studyId as right_studyId\"\n                ),\n                on=\"rightStudyLocusId\",\n                how=\"inner\",\n            )\n            .join(\n                f.broadcast(\n                    studies._df.selectExpr(\n                        \"studyId as right_studyId\",\n                        \"studyType as right_studyType\",\n                        \"geneId\",\n                    )\n                ),\n                on=\"right_studyId\",\n                how=\"inner\",\n            )\n            .filter(\n                (f.col(\"colocalisationMethod\") == colocalisation_method)\n                & (f.col(\"right_studyType\") != \"gwas\")\n            )\n            .select(\"studyLocusId\", \"right_studyType\", \"geneId\", \"coloc_score\")\n        )\n\n        # Max LLR calculation per studyLocus AND type of QTL\n        local_max = get_record_with_maximum_value(\n            colocalising_study_locus,\n            [\"studyLocusId\", \"right_studyType\", \"geneId\"],\n            \"coloc_score\",\n        )\n        neighbourhood_max = (\n            get_record_with_maximum_value(\n                colocalising_study_locus,\n                [\"studyLocusId\", \"right_studyType\"],\n                \"coloc_score\",\n            )\n            .join(\n                local_max.selectExpr(\"studyLocusId\", \"coloc_score as coloc_local_max\"),\n                on=\"studyLocusId\",\n                how=\"inner\",\n            )\n            .withColumn(\n                f\"{coloc_feature_col_template}_nbh\",\n                f.col(\"coloc_local_max\") - f.col(\"coloc_score\"),\n            )\n        )\n\n        # Split feature per molQTL\n        local_dfs = []\n        nbh_dfs = []\n        study_types = (\n            colocalising_study_locus.select(\"right_studyType\").distinct().collect()\n        )\n\n        for qtl_type in study_types:\n            local_max = local_max.filter(\n                f.col(\"right_studyType\") == qtl_type\n            ).withColumnRenamed(\n                \"coloc_score\", f\"{qtl_type}_{coloc_feature_col_template}_local\"\n            )\n            local_dfs.append(local_max)\n\n            neighbourhood_max = neighbourhood_max.filter(\n                f.col(\"right_studyType\") == qtl_type\n            ).withColumnRenamed(\n                f\"{coloc_feature_col_template}_nbh\",\n                f\"{qtl_type}_{coloc_feature_col_template}_nbh\",\n            )\n            nbh_dfs.append(neighbourhood_max)\n\n        wide_dfs = reduce(\n            lambda x, y: x.unionByName(y, allowMissingColumns=True),\n            local_dfs + nbh_dfs,\n            colocalising_study_locus.limit(0),\n        )\n\n        return L2GFeature(\n            _df=_convert_from_wide_to_long(\n                wide_dfs,\n                id_vars=(\"studyLocusId\", \"geneId\"),\n                var_name=\"featureName\",\n                value_name=\"featureValue\",\n            ),\n            _schema=L2GFeature.get_schema(),\n        )\n\n    @staticmethod\n    def _get_coloc_features(\n        study_locus: StudyLocus, studies: StudyIndex, colocalisation: Colocalisation\n    ) -> L2GFeature:\n        \"\"\"Calls _get_max_coloc_per_study_locus for both methods and concatenates the results.\n\n        Args:\n            study_locus (StudyLocus): Study locus dataset\n            studies (StudyIndex): Study index dataset\n            colocalisation (Colocalisation): Colocalisation dataset\n\n        Returns:\n            L2GFeature: Stores the features with the max coloc probabilities for each pair of study-locus\n        \"\"\"\n        coloc_llr = ColocalisationFactory._get_max_coloc_per_study_locus(\n            study_locus,\n            studies,\n            colocalisation,\n            \"COLOC\",\n        )\n        coloc_clpp = ColocalisationFactory._get_max_coloc_per_study_locus(\n            study_locus,\n            studies,\n            colocalisation,\n            \"eCAVIAR\",\n        )\n\n        return L2GFeature(\n            _df=coloc_llr.df.unionByName(coloc_clpp.df, allowMissingColumns=True),\n            _schema=L2GFeature.get_schema(),\n        )\n
"},{"location":"python_api/method/l2g/feature_factory/#otg.method.l2g.feature_factory.StudyLocusFactory","title":"otg.method.l2g.feature_factory.StudyLocusFactory","text":"

Bases: StudyLocus

Feature extraction in study locus.

Source code in src/otg/method/l2g/feature_factory.py
class StudyLocusFactory(StudyLocus):\n    \"\"\"Feature extraction in study locus.\"\"\"\n\n    @staticmethod\n    def _get_tss_distance_features(\n        study_locus: StudyLocus, distances: V2G\n    ) -> L2GFeature:\n        \"\"\"Joins StudyLocus with the V2G to extract the minimum distance to a gene TSS of all variants in a StudyLocus credible set.\n\n        Args:\n            study_locus (StudyLocus): Study locus dataset\n            distances (V2G): Dataframe containing the distances of all variants to all genes TSS within a region\n\n        Returns:\n            L2GFeature: Stores the features with the minimum distance among all variants in the credible set and a gene TSS.\n\n        \"\"\"\n        wide_df = (\n            study_locus.filter_credible_set(CredibleInterval.IS95)\n            .df.select(\n                \"studyLocusId\",\n                \"variantId\",\n                f.explode(\"locus.variantId\").alias(\"tagVariantId\"),\n            )\n            .join(\n                distances.df.selectExpr(\n                    \"variantId as tagVariantId\", \"geneId\", \"distance\"\n                ),\n                on=\"tagVariantId\",\n                how=\"inner\",\n            )\n            .groupBy(\"studyLocusId\", \"geneId\")\n            .agg(\n                f.min(\"distance\").alias(\"distanceTssMinimum\"),\n                f.mean(\"distance\").alias(\"distanceTssMean\"),\n            )\n        )\n\n        return L2GFeature(\n            _df=_convert_from_wide_to_long(\n                wide_df,\n                id_vars=(\"studyLocusId\", \"geneId\"),\n                var_name=\"featureName\",\n                value_name=\"featureValue\",\n            ),\n            _schema=L2GFeature.get_schema(),\n        )\n
"},{"location":"python_api/method/l2g/model/","title":"L2G Model","text":""},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel","title":"otg.method.l2g.model.LocusToGeneModel dataclass","text":"

Wrapper for the Locus to Gene classifier.

Source code in src/otg/method/l2g/model.py
@dataclass\nclass LocusToGeneModel:\n    \"\"\"Wrapper for the Locus to Gene classifier.\"\"\"\n\n    features_list: list[str]\n    estimator: Any = None\n    pipeline: Pipeline = Pipeline(stages=[])\n    model: PipelineModel | None = None\n\n    def __post_init__(self: LocusToGeneModel) -> None:\n        \"\"\"Post init that adds the model to the ML pipeline.\"\"\"\n        label_indexer = StringIndexer(\n            inputCol=\"goldStandardSet\", outputCol=\"label\", handleInvalid=\"keep\"\n        )\n        vector_assembler = LocusToGeneModel.features_vector_assembler(\n            self.features_list\n        )\n\n        self.pipeline = Pipeline(\n            stages=[\n                label_indexer,\n                vector_assembler,\n            ]\n        )\n\n    def save(self: LocusToGeneModel, path: str) -> None:\n        \"\"\"Saves fitted pipeline model to disk.\n\n        Args:\n            path (str): Path to save the model to\n\n        Raises:\n            ValueError: If the model has not been fitted yet\n        \"\"\"\n        if self.model is None:\n            raise ValueError(\"Model has not been fitted yet.\")\n        self.model.write().overwrite().save(path)\n\n    @property\n    def classifier(self: LocusToGeneModel) -> Any:\n        \"\"\"Return the model.\n\n        Returns:\n            Any: An estimator object from Spark ML\n        \"\"\"\n        return self.estimator\n\n    @staticmethod\n    def features_vector_assembler(features_cols: list[str]) -> VectorAssembler:\n        \"\"\"Spark transformer to assemble the feature columns into a vector.\n\n        Args:\n            features_cols (list[str]): List of feature columns to assemble\n\n        Returns:\n            VectorAssembler: Spark transformer to assemble the feature columns into a vector\n\n        Examples:\n            >>> from pyspark.ml.feature import VectorAssembler\n            >>> df = spark.createDataFrame([(5.2, 3.5)], schema=\"feature_1 FLOAT, feature_2 FLOAT\")\n            >>> assembler = LocusToGeneModel.features_vector_assembler([\"feature_1\", \"feature_2\"])\n            >>> assembler.transform(df).show()\n            +---------+---------+--------------------+\n            |feature_1|feature_2|            features|\n            +---------+---------+--------------------+\n            |      5.2|      3.5|[5.19999980926513...|\n            +---------+---------+--------------------+\n            <BLANKLINE>\n        \"\"\"\n        return (\n            VectorAssembler(handleInvalid=\"error\")\n            .setInputCols(features_cols)\n            .setOutputCol(\"features\")\n        )\n\n    @staticmethod\n    def log_to_wandb(\n        results: DataFrame,\n        binary_evaluator: BinaryClassificationEvaluator,\n        multi_evaluator: MulticlassClassificationEvaluator,\n        wandb_run: Run,\n    ) -> None:\n        \"\"\"Perform evaluation of the model by applying it to a test set and tracking the results with W&B.\n\n        Args:\n            results (DataFrame): Dataframe containing the predictions\n            binary_evaluator (BinaryClassificationEvaluator): Binary evaluator\n            multi_evaluator (MulticlassClassificationEvaluator): Multiclass evaluator\n            wandb_run (Run): W&B run to log the results to\n        \"\"\"\n        binary_wandb_evaluator = WandbEvaluator(\n            spark_ml_evaluator=binary_evaluator, wandb_run=wandb_run\n        )\n        binary_wandb_evaluator.evaluate(results)\n        multi_wandb_evaluator = WandbEvaluator(\n            spark_ml_evaluator=multi_evaluator, wandb_run=wandb_run\n        )\n        multi_wandb_evaluator.evaluate(results)\n\n    @classmethod\n    def load_from_disk(\n        cls: Type[LocusToGeneModel], path: str, features_list: list[str]\n    ) -> LocusToGeneModel:\n        \"\"\"Load a fitted pipeline model from disk.\n\n        Args:\n            path (str): Path to the model\n            features_list (list[str]): List of features used for the model\n\n        Returns:\n            LocusToGeneModel: L2G model loaded from disk\n        \"\"\"\n        return cls(model=PipelineModel.load(path), features_list=features_list)\n\n    @classifier.setter  # type: ignore\n    def classifier(self: LocusToGeneModel, new_estimator: Any) -> None:\n        \"\"\"Set the model.\n\n        Args:\n            new_estimator (Any): An estimator object from Spark ML\n        \"\"\"\n        self.estimator = new_estimator\n\n    def get_param_grid(self: LocusToGeneModel) -> list:\n        \"\"\"Return the parameter grid for the model.\n\n        Returns:\n            list: List of parameter maps to use for cross validation\n        \"\"\"\n        return (\n            ParamGridBuilder()\n            .addGrid(self.estimator.max_depth, [3, 5, 7])\n            .addGrid(self.estimator.learning_rate, [0.01, 0.1, 1.0])\n            .build()\n        )\n\n    def add_pipeline_stage(\n        self: LocusToGeneModel, transformer: Transformer\n    ) -> LocusToGeneModel:\n        \"\"\"Adds a stage to the L2G pipeline.\n\n        Args:\n            transformer (Transformer): Spark transformer to add to the pipeline\n\n        Returns:\n            LocusToGeneModel: L2G model with the new transformer\n\n        Examples:\n            >>> from pyspark.ml.regression import LinearRegression\n            >>> estimator = LinearRegression()\n            >>> test_model = LocusToGeneModel(features_list=[\"a\", \"b\"])\n            >>> print(len(test_model.pipeline.getStages()))\n            2\n            >>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))\n            3\n        \"\"\"\n        pipeline_stages = self.pipeline.getStages()\n        new_stages = pipeline_stages + [transformer]\n        self.pipeline = Pipeline(stages=new_stages)\n        return self\n\n    def evaluate(\n        self: LocusToGeneModel,\n        results: DataFrame,\n        hyperparameters: dict,\n        wandb_run_name: str | None,\n    ) -> None:\n        \"\"\"Perform evaluation of the model by applying it to a test set and tracking the results with W&B.\n\n        Args:\n            results (DataFrame): Dataframe containing the predictions\n            hyperparameters (dict): Hyperparameters used for the model\n            wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B\n        \"\"\"\n        binary_evaluator = BinaryClassificationEvaluator(\n            rawPredictionCol=\"rawPrediction\", labelCol=\"label\"\n        )\n        multi_evaluator = MulticlassClassificationEvaluator(\n            labelCol=\"label\", predictionCol=\"prediction\"\n        )\n\n        print(\"Evaluating model...\")\n        print(\n            \"... Area under ROC curve:\",\n            binary_evaluator.evaluate(\n                results, {binary_evaluator.metricName: \"areaUnderROC\"}\n            ),\n        )\n        print(\n            \"... Area under Precision-Recall curve:\",\n            binary_evaluator.evaluate(\n                results, {binary_evaluator.metricName: \"areaUnderPR\"}\n            ),\n        )\n        print(\n            \"... Accuracy:\",\n            multi_evaluator.evaluate(results, {multi_evaluator.metricName: \"accuracy\"}),\n        )\n        print(\n            \"... F1 score:\",\n            multi_evaluator.evaluate(results, {multi_evaluator.metricName: \"f1\"}),\n        )\n\n        if wandb_run_name:\n            print(\"Logging to W&B...\")\n            run = wandb.init(\n                project=\"otg_l2g\", config=hyperparameters, name=wandb_run_name\n            )\n            if isinstance(run, Run):\n                LocusToGeneModel.log_to_wandb(\n                    results, binary_evaluator, multi_evaluator, run\n                )\n                run.finish()\n\n    def plot_importance(self: LocusToGeneModel) -> None:\n        \"\"\"Plot the feature importance of the model.\"\"\"\n        # xgb_plot_importance(self)  # FIXME: What is the attribute that stores the model?\n\n    def fit(\n        self: LocusToGeneModel,\n        feature_matrix: L2GFeatureMatrix,\n    ) -> LocusToGeneModel:\n        \"\"\"Fit the pipeline to the feature matrix dataframe.\n\n        Args:\n            feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to fit the model to\n\n        Returns:\n            LocusToGeneModel: Fitted model\n        \"\"\"\n        self.model = self.pipeline.fit(feature_matrix.df)\n        return self\n\n    def predict(\n        self: LocusToGeneModel,\n        feature_matrix: L2GFeatureMatrix,\n    ) -> DataFrame:\n        \"\"\"Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.\n\n        Args:\n            feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to apply the model to\n\n        Returns:\n            DataFrame: Dataframe with predictions\n\n        Raises:\n            ValueError: If the model has not been fitted yet\n        \"\"\"\n        if not self.model:\n            raise ValueError(\"Model not fitted yet. `fit()` has to be called first.\")\n        return self.model.transform(feature_matrix.df)\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.classifier","title":"classifier: Any property writable","text":"

Return the model.

Returns:

Name Type Description Any Any

An estimator object from Spark ML

"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.add_pipeline_stage","title":"add_pipeline_stage(transformer: Transformer) -> LocusToGeneModel","text":"

Adds a stage to the L2G pipeline.

Parameters:

Name Type Description Default transformer Transformer

Spark transformer to add to the pipeline

required

Returns:

Name Type Description LocusToGeneModel LocusToGeneModel

L2G model with the new transformer

Examples:

>>> from pyspark.ml.regression import LinearRegression\n>>> estimator = LinearRegression()\n>>> test_model = LocusToGeneModel(features_list=[\"a\", \"b\"])\n>>> print(len(test_model.pipeline.getStages()))\n2\n>>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))\n3\n
Source code in src/otg/method/l2g/model.py
def add_pipeline_stage(\n    self: LocusToGeneModel, transformer: Transformer\n) -> LocusToGeneModel:\n    \"\"\"Adds a stage to the L2G pipeline.\n\n    Args:\n        transformer (Transformer): Spark transformer to add to the pipeline\n\n    Returns:\n        LocusToGeneModel: L2G model with the new transformer\n\n    Examples:\n        >>> from pyspark.ml.regression import LinearRegression\n        >>> estimator = LinearRegression()\n        >>> test_model = LocusToGeneModel(features_list=[\"a\", \"b\"])\n        >>> print(len(test_model.pipeline.getStages()))\n        2\n        >>> print(len(test_model.add_pipeline_stage(estimator).pipeline.getStages()))\n        3\n    \"\"\"\n    pipeline_stages = self.pipeline.getStages()\n    new_stages = pipeline_stages + [transformer]\n    self.pipeline = Pipeline(stages=new_stages)\n    return self\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.evaluate","title":"evaluate(results: DataFrame, hyperparameters: dict, wandb_run_name: str | None) -> None","text":"

Perform evaluation of the model by applying it to a test set and tracking the results with W&B.

Parameters:

Name Type Description Default results DataFrame

Dataframe containing the predictions

required hyperparameters dict

Hyperparameters used for the model

required wandb_run_name str | None

Descriptive name for the run to be tracked with W&B

required Source code in src/otg/method/l2g/model.py
def evaluate(\n    self: LocusToGeneModel,\n    results: DataFrame,\n    hyperparameters: dict,\n    wandb_run_name: str | None,\n) -> None:\n    \"\"\"Perform evaluation of the model by applying it to a test set and tracking the results with W&B.\n\n    Args:\n        results (DataFrame): Dataframe containing the predictions\n        hyperparameters (dict): Hyperparameters used for the model\n        wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B\n    \"\"\"\n    binary_evaluator = BinaryClassificationEvaluator(\n        rawPredictionCol=\"rawPrediction\", labelCol=\"label\"\n    )\n    multi_evaluator = MulticlassClassificationEvaluator(\n        labelCol=\"label\", predictionCol=\"prediction\"\n    )\n\n    print(\"Evaluating model...\")\n    print(\n        \"... Area under ROC curve:\",\n        binary_evaluator.evaluate(\n            results, {binary_evaluator.metricName: \"areaUnderROC\"}\n        ),\n    )\n    print(\n        \"... Area under Precision-Recall curve:\",\n        binary_evaluator.evaluate(\n            results, {binary_evaluator.metricName: \"areaUnderPR\"}\n        ),\n    )\n    print(\n        \"... Accuracy:\",\n        multi_evaluator.evaluate(results, {multi_evaluator.metricName: \"accuracy\"}),\n    )\n    print(\n        \"... F1 score:\",\n        multi_evaluator.evaluate(results, {multi_evaluator.metricName: \"f1\"}),\n    )\n\n    if wandb_run_name:\n        print(\"Logging to W&B...\")\n        run = wandb.init(\n            project=\"otg_l2g\", config=hyperparameters, name=wandb_run_name\n        )\n        if isinstance(run, Run):\n            LocusToGeneModel.log_to_wandb(\n                results, binary_evaluator, multi_evaluator, run\n            )\n            run.finish()\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.features_vector_assembler","title":"features_vector_assembler(features_cols: list[str]) -> VectorAssembler staticmethod","text":"

Spark transformer to assemble the feature columns into a vector.

Parameters:

Name Type Description Default features_cols list[str]

List of feature columns to assemble

required

Returns:

Name Type Description VectorAssembler VectorAssembler

Spark transformer to assemble the feature columns into a vector

Examples:

>>> from pyspark.ml.feature import VectorAssembler\n>>> df = spark.createDataFrame([(5.2, 3.5)], schema=\"feature_1 FLOAT, feature_2 FLOAT\")\n>>> assembler = LocusToGeneModel.features_vector_assembler([\"feature_1\", \"feature_2\"])\n>>> assembler.transform(df).show()\n+---------+---------+--------------------+\n|feature_1|feature_2|            features|\n+---------+---------+--------------------+\n|      5.2|      3.5|[5.19999980926513...|\n+---------+---------+--------------------+\n
Source code in src/otg/method/l2g/model.py
@staticmethod\ndef features_vector_assembler(features_cols: list[str]) -> VectorAssembler:\n    \"\"\"Spark transformer to assemble the feature columns into a vector.\n\n    Args:\n        features_cols (list[str]): List of feature columns to assemble\n\n    Returns:\n        VectorAssembler: Spark transformer to assemble the feature columns into a vector\n\n    Examples:\n        >>> from pyspark.ml.feature import VectorAssembler\n        >>> df = spark.createDataFrame([(5.2, 3.5)], schema=\"feature_1 FLOAT, feature_2 FLOAT\")\n        >>> assembler = LocusToGeneModel.features_vector_assembler([\"feature_1\", \"feature_2\"])\n        >>> assembler.transform(df).show()\n        +---------+---------+--------------------+\n        |feature_1|feature_2|            features|\n        +---------+---------+--------------------+\n        |      5.2|      3.5|[5.19999980926513...|\n        +---------+---------+--------------------+\n        <BLANKLINE>\n    \"\"\"\n    return (\n        VectorAssembler(handleInvalid=\"error\")\n        .setInputCols(features_cols)\n        .setOutputCol(\"features\")\n    )\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.fit","title":"fit(feature_matrix: L2GFeatureMatrix) -> LocusToGeneModel","text":"

Fit the pipeline to the feature matrix dataframe.

Parameters:

Name Type Description Default feature_matrix L2GFeatureMatrix

Feature matrix dataframe to fit the model to

required

Returns:

Name Type Description LocusToGeneModel LocusToGeneModel

Fitted model

Source code in src/otg/method/l2g/model.py
def fit(\n    self: LocusToGeneModel,\n    feature_matrix: L2GFeatureMatrix,\n) -> LocusToGeneModel:\n    \"\"\"Fit the pipeline to the feature matrix dataframe.\n\n    Args:\n        feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to fit the model to\n\n    Returns:\n        LocusToGeneModel: Fitted model\n    \"\"\"\n    self.model = self.pipeline.fit(feature_matrix.df)\n    return self\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.get_param_grid","title":"get_param_grid() -> list","text":"

Return the parameter grid for the model.

Returns:

Name Type Description list list

List of parameter maps to use for cross validation

Source code in src/otg/method/l2g/model.py
def get_param_grid(self: LocusToGeneModel) -> list:\n    \"\"\"Return the parameter grid for the model.\n\n    Returns:\n        list: List of parameter maps to use for cross validation\n    \"\"\"\n    return (\n        ParamGridBuilder()\n        .addGrid(self.estimator.max_depth, [3, 5, 7])\n        .addGrid(self.estimator.learning_rate, [0.01, 0.1, 1.0])\n        .build()\n    )\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.load_from_disk","title":"load_from_disk(path: str, features_list: list[str]) -> LocusToGeneModel classmethod","text":"

Load a fitted pipeline model from disk.

Parameters:

Name Type Description Default path str

Path to the model

required features_list list[str]

List of features used for the model

required

Returns:

Name Type Description LocusToGeneModel LocusToGeneModel

L2G model loaded from disk

Source code in src/otg/method/l2g/model.py
@classmethod\ndef load_from_disk(\n    cls: Type[LocusToGeneModel], path: str, features_list: list[str]\n) -> LocusToGeneModel:\n    \"\"\"Load a fitted pipeline model from disk.\n\n    Args:\n        path (str): Path to the model\n        features_list (list[str]): List of features used for the model\n\n    Returns:\n        LocusToGeneModel: L2G model loaded from disk\n    \"\"\"\n    return cls(model=PipelineModel.load(path), features_list=features_list)\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.log_to_wandb","title":"log_to_wandb(results: DataFrame, binary_evaluator: BinaryClassificationEvaluator, multi_evaluator: MulticlassClassificationEvaluator, wandb_run: Run) -> None staticmethod","text":"

Perform evaluation of the model by applying it to a test set and tracking the results with W&B.

Parameters:

Name Type Description Default results DataFrame

Dataframe containing the predictions

required binary_evaluator BinaryClassificationEvaluator

Binary evaluator

required multi_evaluator MulticlassClassificationEvaluator

Multiclass evaluator

required wandb_run Run

W&B run to log the results to

required Source code in src/otg/method/l2g/model.py
@staticmethod\ndef log_to_wandb(\n    results: DataFrame,\n    binary_evaluator: BinaryClassificationEvaluator,\n    multi_evaluator: MulticlassClassificationEvaluator,\n    wandb_run: Run,\n) -> None:\n    \"\"\"Perform evaluation of the model by applying it to a test set and tracking the results with W&B.\n\n    Args:\n        results (DataFrame): Dataframe containing the predictions\n        binary_evaluator (BinaryClassificationEvaluator): Binary evaluator\n        multi_evaluator (MulticlassClassificationEvaluator): Multiclass evaluator\n        wandb_run (Run): W&B run to log the results to\n    \"\"\"\n    binary_wandb_evaluator = WandbEvaluator(\n        spark_ml_evaluator=binary_evaluator, wandb_run=wandb_run\n    )\n    binary_wandb_evaluator.evaluate(results)\n    multi_wandb_evaluator = WandbEvaluator(\n        spark_ml_evaluator=multi_evaluator, wandb_run=wandb_run\n    )\n    multi_wandb_evaluator.evaluate(results)\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.plot_importance","title":"plot_importance() -> None","text":"

Plot the feature importance of the model.

Source code in src/otg/method/l2g/model.py
def plot_importance(self: LocusToGeneModel) -> None:\n    \"\"\"Plot the feature importance of the model.\"\"\"\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.predict","title":"predict(feature_matrix: L2GFeatureMatrix) -> DataFrame","text":"

Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.

Parameters:

Name Type Description Default feature_matrix L2GFeatureMatrix

Feature matrix dataframe to apply the model to

required

Returns:

Name Type Description DataFrame DataFrame

Dataframe with predictions

Raises:

Type Description ValueError

If the model has not been fitted yet

Source code in src/otg/method/l2g/model.py
def predict(\n    self: LocusToGeneModel,\n    feature_matrix: L2GFeatureMatrix,\n) -> DataFrame:\n    \"\"\"Apply the model to a given feature matrix dataframe. The feature matrix needs to be preprocessed first.\n\n    Args:\n        feature_matrix (L2GFeatureMatrix): Feature matrix dataframe to apply the model to\n\n    Returns:\n        DataFrame: Dataframe with predictions\n\n    Raises:\n        ValueError: If the model has not been fitted yet\n    \"\"\"\n    if not self.model:\n        raise ValueError(\"Model not fitted yet. `fit()` has to be called first.\")\n    return self.model.transform(feature_matrix.df)\n
"},{"location":"python_api/method/l2g/model/#otg.method.l2g.model.LocusToGeneModel.save","title":"save(path: str) -> None","text":"

Saves fitted pipeline model to disk.

Parameters:

Name Type Description Default path str

Path to save the model to

required

Raises:

Type Description ValueError

If the model has not been fitted yet

Source code in src/otg/method/l2g/model.py
def save(self: LocusToGeneModel, path: str) -> None:\n    \"\"\"Saves fitted pipeline model to disk.\n\n    Args:\n        path (str): Path to save the model to\n\n    Raises:\n        ValueError: If the model has not been fitted yet\n    \"\"\"\n    if self.model is None:\n        raise ValueError(\"Model has not been fitted yet.\")\n    self.model.write().overwrite().save(path)\n
"},{"location":"python_api/method/l2g/trainer/","title":"L2G Trainer","text":""},{"location":"python_api/method/l2g/trainer/#otg.method.l2g.trainer.LocusToGeneTrainer","title":"otg.method.l2g.trainer.LocusToGeneTrainer dataclass","text":"

Modelling of what is the most likely causal gene associated with a given locus.

Source code in src/otg/method/l2g/trainer.py
@dataclass\nclass LocusToGeneTrainer:\n    \"\"\"Modelling of what is the most likely causal gene associated with a given locus.\"\"\"\n\n    _model: LocusToGeneModel\n    train_set: L2GFeatureMatrix\n\n    @classmethod\n    def train(\n        cls: type[LocusToGeneTrainer],\n        data: L2GFeatureMatrix,\n        l2g_model: LocusToGeneModel,\n        features_list: list[str],\n        evaluate: bool,\n        wandb_run_name: str | None = None,\n        model_path: str | None = None,\n        **hyperparams: dict,\n    ) -> LocusToGeneModel:\n        \"\"\"Train the Locus to Gene model.\n\n        Args:\n            data (L2GFeatureMatrix): Feature matrix containing the data\n            l2g_model (LocusToGeneModel): Model to fit to the data on\n            features_list (list[str]): List of features to use for the model\n            evaluate (bool): Whether to evaluate the model on a test set\n            wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B\n            model_path (str | None): Path to save the model to\n            **hyperparams (dict): Hyperparameters to use for the model\n\n        Returns:\n            LocusToGeneModel: Trained model\n        \"\"\"\n        train, test = data.select_features(features_list).train_test_split(fraction=0.8)\n\n        model = l2g_model.add_pipeline_stage(l2g_model.estimator).fit(train)\n\n        if evaluate:\n            l2g_model.evaluate(\n                results=model.predict(test),\n                hyperparameters=hyperparams,\n                wandb_run_name=wandb_run_name,\n            )\n        if model_path:\n            l2g_model.save(model_path)\n        return l2g_model\n\n    @classmethod\n    def cross_validate(\n        cls: type[LocusToGeneTrainer],\n        l2g_model: LocusToGeneModel,\n        data: L2GFeatureMatrix,\n        num_folds: int,\n        param_grid: Optional[list] = None,\n    ) -> LocusToGeneModel:\n        \"\"\"Perform k-fold cross validation on the model.\n\n        By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each\n        combination of parameters and return the best model.\n\n        Args:\n            l2g_model (LocusToGeneModel): Model to fit to the data on\n            data (L2GFeatureMatrix): Data to perform cross validation on\n            num_folds (int): Number of folds to use for cross validation\n            param_grid (Optional[list]): List of parameter maps to use for cross validation\n\n        Returns:\n            LocusToGeneModel: Trained model fitted with the best hyperparameters\n\n        Raises:\n            ValueError: Parameter grid is empty. Cannot perform cross-validation.\n            ValueError: Unable to retrieve the best model.\n        \"\"\"\n        evaluator = MulticlassClassificationEvaluator()\n        params_grid = param_grid or l2g_model.get_param_grid()\n        if not param_grid:\n            raise ValueError(\n                \"Parameter grid is empty. Cannot perform cross-validation.\"\n            )\n        cv = CrossValidator(\n            numFolds=num_folds,\n            estimator=l2g_model.estimator,\n            estimatorParamMaps=params_grid,\n            evaluator=evaluator,\n            parallelism=2,\n            collectSubModels=False,\n            seed=42,\n        )\n\n        l2g_model.add_pipeline_stage(cv)\n\n        # Integrate the best model from the last stage of the pipeline\n        if (full_pipeline_model := l2g_model.fit(data).model) is None or not hasattr(\n            full_pipeline_model, \"stages\"\n        ):\n            raise ValueError(\"Unable to retrieve the best model.\")\n        l2g_model.model = full_pipeline_model.stages[-1].bestModel\n\n        return l2g_model\n
"},{"location":"python_api/method/l2g/trainer/#otg.method.l2g.trainer.LocusToGeneTrainer.cross_validate","title":"cross_validate(l2g_model: LocusToGeneModel, data: L2GFeatureMatrix, num_folds: int, param_grid: Optional[list] = None) -> LocusToGeneModel classmethod","text":"

Perform k-fold cross validation on the model.

By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each combination of parameters and return the best model.

Parameters:

Name Type Description Default l2g_model LocusToGeneModel

Model to fit to the data on

required data L2GFeatureMatrix

Data to perform cross validation on

required num_folds int

Number of folds to use for cross validation

required param_grid Optional[list]

List of parameter maps to use for cross validation

None

Returns:

Name Type Description LocusToGeneModel LocusToGeneModel

Trained model fitted with the best hyperparameters

Raises:

Type Description ValueError

Parameter grid is empty. Cannot perform cross-validation.

ValueError

Unable to retrieve the best model.

Source code in src/otg/method/l2g/trainer.py
@classmethod\ndef cross_validate(\n    cls: type[LocusToGeneTrainer],\n    l2g_model: LocusToGeneModel,\n    data: L2GFeatureMatrix,\n    num_folds: int,\n    param_grid: Optional[list] = None,\n) -> LocusToGeneModel:\n    \"\"\"Perform k-fold cross validation on the model.\n\n    By providing a model with a parameter grid, this method will perform k-fold cross validation on the model for each\n    combination of parameters and return the best model.\n\n    Args:\n        l2g_model (LocusToGeneModel): Model to fit to the data on\n        data (L2GFeatureMatrix): Data to perform cross validation on\n        num_folds (int): Number of folds to use for cross validation\n        param_grid (Optional[list]): List of parameter maps to use for cross validation\n\n    Returns:\n        LocusToGeneModel: Trained model fitted with the best hyperparameters\n\n    Raises:\n        ValueError: Parameter grid is empty. Cannot perform cross-validation.\n        ValueError: Unable to retrieve the best model.\n    \"\"\"\n    evaluator = MulticlassClassificationEvaluator()\n    params_grid = param_grid or l2g_model.get_param_grid()\n    if not param_grid:\n        raise ValueError(\n            \"Parameter grid is empty. Cannot perform cross-validation.\"\n        )\n    cv = CrossValidator(\n        numFolds=num_folds,\n        estimator=l2g_model.estimator,\n        estimatorParamMaps=params_grid,\n        evaluator=evaluator,\n        parallelism=2,\n        collectSubModels=False,\n        seed=42,\n    )\n\n    l2g_model.add_pipeline_stage(cv)\n\n    # Integrate the best model from the last stage of the pipeline\n    if (full_pipeline_model := l2g_model.fit(data).model) is None or not hasattr(\n        full_pipeline_model, \"stages\"\n    ):\n        raise ValueError(\"Unable to retrieve the best model.\")\n    l2g_model.model = full_pipeline_model.stages[-1].bestModel\n\n    return l2g_model\n
"},{"location":"python_api/method/l2g/trainer/#otg.method.l2g.trainer.LocusToGeneTrainer.train","title":"train(data: L2GFeatureMatrix, l2g_model: LocusToGeneModel, features_list: list[str], evaluate: bool, wandb_run_name: str | None = None, model_path: str | None = None, **hyperparams: dict) -> LocusToGeneModel classmethod","text":"

Train the Locus to Gene model.

Parameters:

Name Type Description Default data L2GFeatureMatrix

Feature matrix containing the data

required l2g_model LocusToGeneModel

Model to fit to the data on

required features_list list[str]

List of features to use for the model

required evaluate bool

Whether to evaluate the model on a test set

required wandb_run_name str | None

Descriptive name for the run to be tracked with W&B

None model_path str | None

Path to save the model to

None **hyperparams dict

Hyperparameters to use for the model

{}

Returns:

Name Type Description LocusToGeneModel LocusToGeneModel

Trained model

Source code in src/otg/method/l2g/trainer.py
@classmethod\ndef train(\n    cls: type[LocusToGeneTrainer],\n    data: L2GFeatureMatrix,\n    l2g_model: LocusToGeneModel,\n    features_list: list[str],\n    evaluate: bool,\n    wandb_run_name: str | None = None,\n    model_path: str | None = None,\n    **hyperparams: dict,\n) -> LocusToGeneModel:\n    \"\"\"Train the Locus to Gene model.\n\n    Args:\n        data (L2GFeatureMatrix): Feature matrix containing the data\n        l2g_model (LocusToGeneModel): Model to fit to the data on\n        features_list (list[str]): List of features to use for the model\n        evaluate (bool): Whether to evaluate the model on a test set\n        wandb_run_name (str | None): Descriptive name for the run to be tracked with W&B\n        model_path (str | None): Path to save the model to\n        **hyperparams (dict): Hyperparameters to use for the model\n\n    Returns:\n        LocusToGeneModel: Trained model\n    \"\"\"\n    train, test = data.select_features(features_list).train_test_split(fraction=0.8)\n\n    model = l2g_model.add_pipeline_stage(l2g_model.estimator).fit(train)\n\n    if evaluate:\n        l2g_model.evaluate(\n            results=model.predict(test),\n            hyperparameters=hyperparams,\n            wandb_run_name=wandb_run_name,\n        )\n    if model_path:\n        l2g_model.save(model_path)\n    return l2g_model\n
"},{"location":"python_api/step/_step/","title":"Step","text":"

TBC

"},{"location":"python_api/step/colocalisation/","title":"Colocalisation","text":""},{"location":"python_api/step/colocalisation/#otg.colocalisation.ColocalisationStep","title":"otg.colocalisation.ColocalisationStep dataclass","text":"

Colocalisation step.

This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).

Attributes:

Name Type Description session Session

Session object.

study_locus_path DictConfig

Input Study-locus path.

coloc_path DictConfig

Output Colocalisation path.

priorc1 float

Prior on variant being causal for trait 1.

priorc2 float

Prior on variant being causal for trait 2.

priorc12 float

Prior on variant being causal for traits 1 and 2.

Source code in src/otg/colocalisation.py
@dataclass\nclass ColocalisationStep:\n    \"\"\"Colocalisation step.\n\n    This workflow runs colocalization analyses that assess the degree to which independent signals of the association share the same causal variant in a region of the genome, typically limited by linkage disequilibrium (LD).\n\n    Attributes:\n        session (Session): Session object.\n        study_locus_path (DictConfig): Input Study-locus path.\n        coloc_path (DictConfig): Output Colocalisation path.\n        priorc1 (float): Prior on variant being causal for trait 1.\n        priorc2 (float): Prior on variant being causal for trait 2.\n        priorc12 (float): Prior on variant being causal for traits 1 and 2.\n    \"\"\"\n\n    session: Session = Session()\n\n    study_locus_path: str = MISSING\n    study_index_path: str = MISSING\n    coloc_path: str = MISSING\n    priorc1: float = 1e-4\n    priorc2: float = 1e-4\n    priorc12: float = 1e-5\n\n    def __post_init__(self: ColocalisationStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Study-locus information\n        sl = StudyLocus.from_parquet(self.session, self.study_locus_path)\n        si = StudyIndex.from_parquet(self.session, self.study_index_path)\n\n        # Study-locus overlaps for 95% credible sets\n        sl_overlaps = sl.credible_set(CredibleInterval.IS95).overlaps(si)\n\n        coloc_results = Coloc.colocalise(\n            sl_overlaps, self.priorc1, self.priorc2, self.priorc12\n        )\n        ecaviar_results = ECaviar.colocalise(sl_overlaps)\n\n        coloc_results.df.unionByName(ecaviar_results.df, allowMissingColumns=True)\n\n        coloc_results.df.write.mode(self.session.write_mode).parquet(self.coloc_path)\n
"},{"location":"python_api/step/finngen/","title":"FinnGen","text":""},{"location":"python_api/step/finngen/#otg.finngen.FinnGenStep","title":"otg.finngen.FinnGenStep dataclass","text":"

FinnGen ingestion step.

Attributes:

Name Type Description session Session

Session object.

finngen_phenotype_table_url str

FinnGen API for fetching the list of studies.

finngen_release_prefix str

Release prefix pattern.

finngen_sumstat_url_prefix str

URL prefix for summary statistics location.

finngen_sumstat_url_suffix str

URL prefix suffix for summary statistics location.

finngen_study_index_out str

Output path for the FinnGen study index dataset.

finngen_summary_stats_out str

Output path for the FinnGen summary statistics.

Source code in src/otg/finngen.py
@dataclass\nclass FinnGenStep:\n    \"\"\"FinnGen ingestion step.\n\n    Attributes:\n        session (Session): Session object.\n        finngen_phenotype_table_url (str): FinnGen API for fetching the list of studies.\n        finngen_release_prefix (str): Release prefix pattern.\n        finngen_sumstat_url_prefix (str): URL prefix for summary statistics location.\n        finngen_sumstat_url_suffix (str): URL prefix suffix for summary statistics location.\n        finngen_study_index_out (str): Output path for the FinnGen study index dataset.\n        finngen_summary_stats_out (str): Output path for the FinnGen summary statistics.\n    \"\"\"\n\n    session: Session = Session()\n\n    finngen_phenotype_table_url: str = MISSING\n    finngen_release_prefix: str = MISSING\n    finngen_sumstat_url_prefix: str = MISSING\n    finngen_sumstat_url_suffix: str = MISSING\n    finngen_study_index_out: str = MISSING\n    finngen_summary_stats_out: str = MISSING\n\n    def __post_init__(self: FinnGenStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read the JSON data from the URL.\n        json_data = urlopen(self.finngen_phenotype_table_url).read().decode(\"utf-8\")\n        rdd = self.session.spark.sparkContext.parallelize([json_data])\n        df = self.session.spark.read.json(rdd)\n\n        # Parse the study index data.\n        finngen_studies = FinnGenStudyIndex.from_source(\n            df,\n            self.finngen_release_prefix,\n            self.finngen_sumstat_url_prefix,\n            self.finngen_sumstat_url_suffix,\n        )\n\n        # Write the study index output.\n        finngen_studies.df.write.mode(self.session.write_mode).parquet(\n            self.finngen_study_index_out\n        )\n\n        # Prepare list of files for ingestion.\n        input_filenames = [\n            row.summarystatsLocation for row in finngen_studies.collect()\n        ]\n        summary_stats_df = self.session.spark.read.option(\"delimiter\", \"\\t\").csv(\n            input_filenames, header=True\n        )\n\n        # Specify data processing instructions.\n        summary_stats_df = FinnGenSummaryStats.from_finngen_harmonized_summary_stats(\n            summary_stats_df\n        ).df\n\n        # Sort and partition for output.\n        summary_stats_df.sortWithinPartitions(\"position\").write.partitionBy(\n            \"studyId\", \"chromosome\"\n        ).mode(self.session.write_mode).parquet(self.finngen_summary_stats_out)\n
"},{"location":"python_api/step/gene_index/","title":"Gene Index","text":""},{"location":"python_api/step/gene_index/#otg.gene_index.GeneIndexStep","title":"otg.gene_index.GeneIndexStep dataclass","text":"

Gene index step.

This step generates a gene index dataset from an Open Targets Platform target dataset.

Attributes:

Name Type Description session Session

Session object.

target_path str

Open targets Platform target dataset path.

gene_index_path str

Output gene index path.

Source code in src/otg/gene_index.py
@dataclass\nclass GeneIndexStep:\n    \"\"\"Gene index step.\n\n    This step generates a gene index dataset from an Open Targets Platform target dataset.\n\n    Attributes:\n        session (Session): Session object.\n        target_path (str): Open targets Platform target dataset path.\n        gene_index_path (str): Output gene index path.\n    \"\"\"\n\n    session: Session = Session()\n\n    target_path: str = MISSING\n    gene_index_path: str = MISSING\n\n    def __post_init__(self: GeneIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        platform_target = self.session.spark.read.parquet(self.target_path)\n        # Transform\n        gene_index = OpenTargetsTarget.as_gene_index(platform_target)\n        # Load\n        gene_index.df.write.mode(self.session.write_mode).parquet(self.gene_index_path)\n
"},{"location":"python_api/step/gwas_catalog/","title":"GWAS Catalog","text":""},{"location":"python_api/step/gwas_catalog/#otg.gwas_catalog.GWASCatalogStep","title":"otg.gwas_catalog.GWASCatalogStep dataclass","text":"

GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.

Attributes:

Name Type Description session Session

Session object.

catalog_studies_file str

Raw GWAS catalog studies file.

catalog_ancestry_file str

Ancestry annotations file from GWAS Catalog.

catalog_sumstats_lut str

GWAS Catalog summary statistics lookup table.

catalog_associations_file str

Raw GWAS catalog associations file.

variant_annotation_path str

Input variant annotation path.

ld_populations list

List of populations to include.

min_r2 float

Minimum r2 to consider when considering variants within a window.

catalog_studies_out str

Output GWAS catalog studies path.

catalog_associations_out str

Output GWAS catalog associations path.

Source code in src/otg/gwas_catalog.py
@dataclass\nclass GWASCatalogStep:\n    \"\"\"GWAS Catalog ingestion step to extract GWASCatalog Study and StudyLocus tables.\n\n    Attributes:\n        session (Session): Session object.\n        catalog_studies_file (str): Raw GWAS catalog studies file.\n        catalog_ancestry_file (str): Ancestry annotations file from GWAS Catalog.\n        catalog_sumstats_lut (str): GWAS Catalog summary statistics lookup table.\n        catalog_associations_file (str): Raw GWAS catalog associations file.\n        variant_annotation_path (str): Input variant annotation path.\n        ld_populations (list): List of populations to include.\n        min_r2 (float): Minimum r2 to consider when considering variants within a window.\n        catalog_studies_out (str): Output GWAS catalog studies path.\n        catalog_associations_out (str): Output GWAS catalog associations path.\n    \"\"\"\n\n    session: Session = Session()\n\n    catalog_studies_file: str = MISSING\n    catalog_ancestry_file: str = MISSING\n    catalog_sumstats_lut: str = MISSING\n    catalog_associations_file: str = MISSING\n    variant_annotation_path: str = MISSING\n    ld_index_path: str = MISSING\n    min_r2: float = 0.5\n    catalog_studies_out: str = MISSING\n    catalog_associations_out: str = MISSING\n\n    def __post_init__(self: GWASCatalogStep) -> None:\n        \"\"\"Run step.\"\"\"\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        # All inputs:\n        # Variant annotation dataset\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        # GWAS Catalog raw study information\n        catalog_studies = self.session.spark.read.csv(\n            self.catalog_studies_file, sep=\"\\t\", header=True\n        )\n        # GWAS Catalog ancestry information\n        ancestry_lut = self.session.spark.read.csv(\n            self.catalog_ancestry_file, sep=\"\\t\", header=True\n        )\n        # GWAS Catalog summary statistics information\n        sumstats_lut = self.session.spark.read.csv(\n            self.catalog_sumstats_lut, sep=\"\\t\", header=False\n        )\n        # GWAS Catalog raw association information\n        catalog_associations = self.session.spark.read.csv(\n            self.catalog_associations_file, sep=\"\\t\", header=True\n        )\n        # LD index dataset\n        ld_index = LDIndex.from_parquet(self.session, self.ld_index_path)\n\n        # Transform:\n        # GWAS Catalog study index and study-locus splitted\n        study_index, study_locus = GWASCatalogStudySplitter.split(\n            GWASCatalogStudyIndex.from_source(\n                catalog_studies, ancestry_lut, sumstats_lut\n            ),\n            GWASCatalogAssociations.from_source(catalog_associations, va),\n        )\n\n        # Annotate LD information and clump associations dataset\n        study_locus_ld = LDAnnotator.ld_annotate(study_locus, study_index, ld_index)\n\n        # Fine-mapping LD-clumped study-locus using PICS\n        finemapped_study_locus = PICS.finemap(study_locus_ld).annotate_credible_sets()\n\n        # Write:\n        study_index.df.write.mode(self.session.write_mode).parquet(\n            self.catalog_studies_out\n        )\n        finemapped_study_locus.df.write.mode(self.session.write_mode).parquet(\n            self.catalog_associations_out\n        )\n
"},{"location":"python_api/step/gwas_catalog_sumstat_preprocess/","title":"GWAS Catalog sumstat preprocess","text":""},{"location":"python_api/step/gwas_catalog_sumstat_preprocess/#otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep","title":"otg.gwas_catalog_sumstat_preprocess.GWASCatalogSumstatsPreprocessStep dataclass","text":"

Step to preprocess GWAS Catalog harmonised summary stats.

Attributes:

Name Type Description session Session

Session object.

raw_sumstats_path str

Input raw GWAS Catalog summary statistics path.

out_sumstats_path str

Output GWAS Catalog summary statistics path.

study_id str

GWAS Catalog study identifier.

Source code in src/otg/gwas_catalog_sumstat_preprocess.py
@dataclass\nclass GWASCatalogSumstatsPreprocessStep:\n    \"\"\"Step to preprocess GWAS Catalog harmonised summary stats.\n\n    Attributes:\n        session (Session): Session object.\n        raw_sumstats_path (str): Input raw GWAS Catalog summary statistics path.\n        out_sumstats_path (str): Output GWAS Catalog summary statistics path.\n        study_id (str): GWAS Catalog study identifier.\n    \"\"\"\n\n    session: Session = Session()\n\n    raw_sumstats_path: str = MISSING\n    out_sumstats_path: str = MISSING\n    study_id: str = MISSING\n\n    def __post_init__(self: GWASCatalogSumstatsPreprocessStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        self.session.logger.info(self.raw_sumstats_path)\n        self.session.logger.info(self.out_sumstats_path)\n        self.session.logger.info(self.study_id)\n\n        # Reading dataset:\n        raw_dataset = self.session.spark.read.csv(\n            self.raw_sumstats_path, header=True, sep=\"\\t\"\n        )\n        self.session.logger.info(\n            f\"Number of single point associations: {raw_dataset.count()}\"\n        )\n\n        # Processing dataset:\n        GWASCatalogSummaryStatistics.from_gwas_harmonized_summary_stats(\n            raw_dataset, self.study_id\n        ).df.write.mode(self.session.write_mode).parquet(self.out_sumstats_path)\n        self.session.logger.info(\"Processing dataset successfully completed.\")\n
"},{"location":"python_api/step/l2g/","title":"Locus-to-gene (L2G)","text":""},{"location":"python_api/step/l2g/#otg.l2g.LocusToGeneStep","title":"otg.l2g.LocusToGeneStep dataclass","text":"

Locus to gene step.

Attributes:

Name Type Description session Session

Session object.

extended_spark_conf dict[str, str] | None

Extended Spark configuration.

run_mode str

One of \"train\" or \"predict\".

wandb_run_name str | None

Name of the run to be tracked on W&B.

perform_cross_validation bool

Whether to perform cross validation.

model_path str | None

Path to save the model.

predictions_path str | None

Path to save the predictions.

study_locus_path str

Path to study locus Parquet files.

variant_gene_path str

Path to variant to gene Parquet files.

colocalisation_path str

Path to colocalisation Parquet files.

study_index_path str

Path to study index Parquet files.

study_locus_overlap_path str | None

Path to study locus overlap Parquet files.

gold_standard_curation_path str | None

Path to gold standard curation JSON files.

gene_interactions_path str | None

Path to gene interactions Parquet files.

features_list list[str]

List of features to use.

hyperparameters dict

Hyperparameters for the model.

Source code in src/otg/l2g.py
@dataclass\nclass LocusToGeneStep:\n    \"\"\"Locus to gene step.\n\n    Attributes:\n        session (Session): Session object.\n        extended_spark_conf (dict[str, str] | None): Extended Spark configuration.\n        run_mode (str): One of \"train\" or \"predict\".\n        wandb_run_name (str | None): Name of the run to be tracked on W&B.\n        perform_cross_validation (bool): Whether to perform cross validation.\n        model_path (str | None): Path to save the model.\n        predictions_path (str | None): Path to save the predictions.\n        study_locus_path (str): Path to study locus Parquet files.\n        variant_gene_path (str): Path to variant to gene Parquet files.\n        colocalisation_path (str): Path to colocalisation Parquet files.\n        study_index_path (str): Path to study index Parquet files.\n        study_locus_overlap_path (str | None): Path to study locus overlap Parquet files.\n        gold_standard_curation_path (str | None): Path to gold standard curation JSON files.\n        gene_interactions_path (str | None): Path to gene interactions Parquet files.\n        features_list (list[str]): List of features to use.\n        hyperparameters (dict): Hyperparameters for the model.\n    \"\"\"\n\n    session: Session = Session()\n    extended_spark_conf: dict[str, str] | None = None\n\n    run_mode: str = MISSING\n    wandb_run_name: str | None = None\n    perform_cross_validation: bool = False\n    model_path: str | None = None\n    predictions_path: str | None = None\n    study_locus_path: str = MISSING\n    variant_gene_path: str = MISSING\n    colocalisation_path: str = MISSING\n    study_index_path: str = MISSING\n    study_locus_overlap_path: str | None = None\n    gold_standard_curation_path: str | None = None\n    gene_interactions_path: str | None = None\n    features_list: list[str] = field(\n        default_factory=lambda: [\n            # average distance of all tagging variants to gene TSS\n            \"distanceTssMean\",\n            # # minimum distance of all tagging variants to gene TSS\n            # \"distanceTssMinimum\",\n            # # max clpp for each (study, locus, gene) aggregating over all eQTLs\n            # \"eqtlColocClppLocalMaximum\",\n            # # max clpp for each (study, locus) aggregating over all eQTLs\n            # \"eqtlColocClppNeighborhoodMaximum\",\n            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all eQTLs\n            # \"eqtlColocLlrLocalMaximum\",\n            # # max log-likelihood ratio value for each (study, locus) aggregating over all eQTLs\n            # \"eqtlColocLlrNeighborhoodMaximum\",\n            # # max clpp for each (study, locus, gene) aggregating over all pQTLs\n            # \"pqtlColocClppLocalMaximum\",\n            # # max clpp for each (study, locus) aggregating over all pQTLs\n            # \"pqtlColocClppNeighborhoodMaximum\",\n            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all pQTLs\n            # \"pqtlColocLlrLocalMaximum\",\n            # # max log-likelihood ratio value for each (study, locus) aggregating over all pQTLs\n            # \"pqtlColocLlrNeighborhoodMaximum\",\n            # # max clpp for each (study, locus, gene) aggregating over all sQTLs\n            # \"sqtlColocClppLocalMaximum\",\n            # # max clpp for each (study, locus) aggregating over all sQTLs\n            # \"sqtlColocClppNeighborhoodMaximum\",\n            # # max log-likelihood ratio value for each (study, locus, gene) aggregating over all sQTLs\n            # \"sqtlColocLlrLocalMaximum\",\n            # # max log-likelihood ratio value for each (study, locus) aggregating over all sQTLs\n            # \"sqtlColocLlrNeighborhoodMaximum\",\n        ]\n    )\n    hyperparameters: dict = field(\n        default_factory=lambda: {\n            \"max_depth\": 5,\n            \"loss_function\": \"binary:logistic\",\n        }\n    )\n\n    def __post_init__(self: LocusToGeneStep) -> None:\n        \"\"\"Run step.\n\n        Raises:\n            ValueError: if run_mode is not one of \"train\" or \"predict\".\n        \"\"\"\n        if self.run_mode not in [\"train\", \"predict\"]:\n            raise ValueError(\n                f\"run_mode must be one of 'train' or 'predict', got {self.run_mode}\"\n            )\n        # Load common inputs\n        study_locus = StudyLocus.from_parquet(\n            self.session, self.study_locus_path, recursiveFileLookup=True\n        )\n        studies = StudyIndex.from_parquet(self.session, self.study_index_path)\n        v2g = V2G.from_parquet(self.session, self.variant_gene_path)\n        # coloc = Colocalisation.from_parquet(self.session, self.colocalisation_path) # TODO: run step\n\n        if self.run_mode == \"train\":\n            # Process gold standard and L2G features\n            study_locus_overlap = StudyLocusOverlap.from_parquet(\n                self.session, self.study_locus_overlap_path\n            )\n            gs_curation = self.session.spark.read.json(self.gold_standard_curation_path)\n            interactions = self.session.spark.read.parquet(self.gene_interactions_path)\n\n            gold_standards = L2GGoldStandard.from_otg_curation(\n                gold_standard_curation=gs_curation,\n                v2g=v2g,\n                study_locus_overlap=study_locus_overlap,\n                interactions=interactions,\n            )\n\n            fm = L2GFeatureMatrix.generate_features(\n                study_locus=study_locus,\n                study_index=studies,\n                variant_gene=v2g,\n                # colocalisation=coloc,\n            )\n\n            # Join and fill null values with 0\n            data = L2GFeatureMatrix(\n                _df=gold_standards.df.drop(\"sources\").join(\n                    fm.df, on=[\"studyLocusId\", \"geneId\"], how=\"inner\"\n                ),\n                _schema=L2GFeatureMatrix.get_schema(),\n            ).fill_na()\n\n            # Instantiate classifier\n            estimator = SparkXGBClassifier(\n                eval_metric=\"logloss\",\n                features_col=\"features\",\n                label_col=\"label\",\n                max_depth=5,\n            )\n            l2g_model = LocusToGeneModel(\n                features_list=list(self.features_list), estimator=estimator\n            )\n            if self.perform_cross_validation:\n                # Perform cross validation to extract what are the best hyperparameters\n                cv_folds = self.hyperparameters.get(\"cross_validation_folds\", 5)\n                LocusToGeneTrainer.cross_validate(\n                    l2g_model=l2g_model,\n                    data=data,\n                    num_folds=cv_folds,\n                )\n            else:\n                # Train model\n                model = LocusToGeneTrainer.train(\n                    data=data,\n                    l2g_model=l2g_model,\n                    features_list=list(self.features_list),\n                    model_path=self.model_path,\n                    evaluate=True,\n                    wandb_run_name=self.wandb_run_name,\n                    **self.hyperparameters,\n                )\n                model.save(self.model_path)\n                self.session.logger.info(\n                    f\"Finished L2G step. L2G model saved to {self.model_path}\"\n                )\n\n        if self.run_mode == \"predict\":\n            if not self.model_path or not self.predictions_path:\n                raise ValueError(\n                    \"model_path and predictions_path must be set for predict mode.\"\n                )\n            predictions = L2GPrediction.from_study_locus(\n                self.model_path,\n                study_locus,\n                studies,\n                v2g,\n                # coloc\n            )\n            predictions.df.write.mode(self.session.write_mode).parquet(\n                self.predictions_path\n            )\n            self.session.logger.info(\n                f\"Finished L2G step. L2G predictions saved to {self.predictions_path}\"\n            )\n
"},{"location":"python_api/step/ld_index/","title":"LD Index","text":""},{"location":"python_api/step/ld_index/#otg.ld_index.LDIndexStep","title":"otg.ld_index.LDIndexStep dataclass","text":"

LD index step.

This step is resource intensive

Suggested params: high memory machine, 5TB of boot disk, no SSDs.

Attributes:

Name Type Description session Session

Session object.

start_hail bool

Whether to start Hail. Defaults to True.

ld_matrix_template str

Template path for LD matrix from gnomAD.

ld_index_raw_template str

Template path for the variant indices correspondance in the LD Matrix from gnomAD.

min_r2 float

Minimum r2 to consider when considering variants within a window.

grch37_to_grch38_chain_path str

Path to GRCh37 to GRCh38 chain file.

ld_populations List[str]

List of population-specific LD matrices to process.

ld_index_out str

Output LD index path.

Source code in src/otg/ld_index.py
@dataclass\nclass LDIndexStep:\n    \"\"\"LD index step.\n\n    !!! warning \"This step is resource intensive\"\n        Suggested params: high memory machine, 5TB of boot disk, no SSDs.\n\n    Attributes:\n        session (Session): Session object.\n        start_hail (bool): Whether to start Hail. Defaults to True.\n        ld_matrix_template (str): Template path for LD matrix from gnomAD.\n        ld_index_raw_template (str): Template path for the variant indices correspondance in the LD Matrix from gnomAD.\n        min_r2 (float): Minimum r2 to consider when considering variants within a window.\n        grch37_to_grch38_chain_path (str): Path to GRCh37 to GRCh38 chain file.\n        ld_populations (List[str]): List of population-specific LD matrices to process.\n        ld_index_out (str): Output LD index path.\n    \"\"\"\n\n    session: Session = Session()\n    start_hail: bool = True\n\n    ld_matrix_template: str = \"gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.adj.ld.bm\"\n    ld_index_raw_template: str = \"gs://gcp-public-data--gnomad/release/2.1.1/ld/gnomad.genomes.r2.1.1.{POP}.common.ld.variant_indices.ht\"\n    min_r2: float = 0.5\n    grch37_to_grch38_chain_path: str = (\n        \"gs://hail-common/references/grch37_to_grch38.over.chain.gz\"\n    )\n    ld_populations: List[str] = field(\n        default_factory=lambda: [\n            \"afr\",  # African-American\n            \"amr\",  # American Admixed/Latino\n            \"asj\",  # Ashkenazi Jewish\n            \"eas\",  # East Asian\n            \"fin\",  # Finnish\n            \"nfe\",  # Non-Finnish European\n            \"nwe\",  # Northwestern European\n            \"seu\",  # Southeastern European\n        ]\n    )\n    ld_index_out: str = MISSING\n\n    def __post_init__(self: LDIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        ld_index = GnomADLDMatrix.as_ld_index(\n            self.ld_populations,\n            self.ld_matrix_template,\n            self.ld_index_raw_template,\n            self.grch37_to_grch38_chain_path,\n            self.min_r2,\n        )\n        self.session.logger.info(f\"Writing LD index to: {self.ld_index_out}\")\n        (\n            ld_index.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(f\"{self.ld_index_out}\")\n        )\n
"},{"location":"python_api/step/ukbiobank/","title":"UK Biobank","text":""},{"location":"python_api/step/ukbiobank/#otg.ukbiobank.UKBiobankStep","title":"otg.ukbiobank.UKBiobankStep dataclass","text":"

UKBiobank study table ingestion step.

Attributes:

Name Type Description session Session

Session object.

ukbiobank_manifest str

UKBiobank manifest of studies.

ukbiobank_study_index_out str

Output path for the UKBiobank study index dataset.

Source code in src/otg/ukbiobank.py
@dataclass\nclass UKBiobankStep:\n    \"\"\"UKBiobank study table ingestion step.\n\n    Attributes:\n        session (Session): Session object.\n        ukbiobank_manifest (str): UKBiobank manifest of studies.\n        ukbiobank_study_index_out (str): Output path for the UKBiobank study index dataset.\n    \"\"\"\n\n    session: Session = Session()\n\n    ukbiobank_manifest: str = MISSING\n    ukbiobank_study_index_out: str = MISSING\n\n    def __post_init__(self: UKBiobankStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read in the UKBiobank manifest tsv file.\n        df = self.session.spark.read.csv(\n            self.ukbiobank_manifest, sep=\"\\t\", header=True, inferSchema=True\n        )\n\n        # Parse the study index data.\n        ukbiobank_study_index = UKBiobankStudyIndex.from_source(df)\n\n        # Write the output.\n        ukbiobank_study_index.df.write.mode(self.session.write_mode).parquet(\n            self.ukbiobank_study_index_out\n        )\n
"},{"location":"python_api/step/variant_annotation_step/","title":"Variant Annotation","text":""},{"location":"python_api/step/variant_annotation_step/#otg.variant_annotation.VariantAnnotationStep","title":"otg.variant_annotation.VariantAnnotationStep dataclass","text":"

Variant annotation step.

Variant annotation step produces a dataset of the type VariantAnnotation derived from gnomADs gnomad.genomes.vX.X.X.sites.ht Hail's table. This dataset is used to validate variants and as a source of annotation.

Attributes:

Name Type Description session Session

Session object.

start_hail bool

Whether to start a Hail session. Defaults to True.

gnomad_genomes str

Path to gnomAD genomes hail table.

chain_38_to_37 str

Path to GRCh38 to GRCh37 chain file.

variant_annotation_path str

Output variant annotation path.

populations List[str]

List of populations to include.

Source code in src/otg/variant_annotation.py
@dataclass\nclass VariantAnnotationStep:\n    \"\"\"Variant annotation step.\n\n    Variant annotation step produces a dataset of the type `VariantAnnotation` derived from gnomADs `gnomad.genomes.vX.X.X.sites.ht` Hail's table. This dataset is used to validate variants and as a source of annotation.\n\n    Attributes:\n        session (Session): Session object.\n        start_hail (bool): Whether to start a Hail session. Defaults to True.\n        gnomad_genomes (str): Path to gnomAD genomes hail table.\n        chain_38_to_37 (str): Path to GRCh38 to GRCh37 chain file.\n        variant_annotation_path (str): Output variant annotation path.\n        populations (List[str]): List of populations to include.\n    \"\"\"\n\n    session: Session = Session()\n    start_hail: bool = True\n\n    gnomad_genomes: str = MISSING\n    chain_38_to_37: str = MISSING\n    variant_annotation_path: str = MISSING\n    populations: List[str] = field(\n        default_factory=lambda: [\n            \"afr\",  # African-American\n            \"amr\",  # American Admixed/Latino\n            \"ami\",  # Amish ancestry\n            \"asj\",  # Ashkenazi Jewish\n            \"eas\",  # East Asian\n            \"fin\",  # Finnish\n            \"nfe\",  # Non-Finnish European\n            \"mid\",  # Middle Eastern\n            \"sas\",  # South Asian\n            \"oth\",  # Other\n        ]\n    )\n\n    def __post_init__(self: VariantAnnotationStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Initialise hail session.\n        hl.init(sc=self.session.spark.sparkContext, log=\"/dev/null\")\n        # Run variant annotation.\n        variant_annotation = GnomADVariants.as_variant_annotation(\n            self.gnomad_genomes,\n            self.chain_38_to_37,\n            self.populations,\n        )\n        # Write data partitioned by chromosome and position.\n        (\n            variant_annotation.df.repartition(400, \"chromosome\")\n            .sortWithinPartitions(\"chromosome\", \"position\")\n            .write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.variant_annotation_path)\n        )\n
"},{"location":"python_api/step/variant_index_step/","title":"Variant Index","text":""},{"location":"python_api/step/variant_index_step/#otg.variant_index.VariantIndexStep","title":"otg.variant_index.VariantIndexStep dataclass","text":"

Run variant index step to only variants in study-locus sets.

Using a VariantAnnotation dataset as a reference, this step creates and writes a dataset of the type VariantIndex that includes only variants that have disease-association data with a reduced set of annotations.

Attributes:

Name Type Description session Session

Session object.

variant_annotation_path str

Input variant annotation path.

study_locus_path str

Input study-locus path.

variant_index_path str

Output variant index path.

Source code in src/otg/variant_index.py
@dataclass\nclass VariantIndexStep:\n    \"\"\"Run variant index step to only variants in study-locus sets.\n\n    Using a `VariantAnnotation` dataset as a reference, this step creates and writes a dataset of the type `VariantIndex` that includes only variants that have disease-association data with a reduced set of annotations.\n\n    Attributes:\n        session (Session): Session object.\n        variant_annotation_path (str): Input variant annotation path.\n        study_locus_path (str): Input study-locus path.\n        variant_index_path (str): Output variant index path.\n    \"\"\"\n\n    session: Session = Session()\n\n    variant_annotation_path: str = MISSING\n    study_locus_path: str = MISSING\n    variant_index_path: str = MISSING\n\n    def __post_init__(self: VariantIndexStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Extract\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        study_locus = StudyLocus.from_parquet(\n            self.session, self.study_locus_path, recursiveFileLookup=True\n        )\n\n        # Transform\n        vi = VariantIndex.from_variant_annotation(va, study_locus)\n\n        # Load\n        self.session.logger.info(f\"Writing variant index to: {self.variant_index_path}\")\n        (\n            vi.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.variant_index_path)\n        )\n
"},{"location":"python_api/step/variant_to_gene_step/","title":"Variant-to-gene","text":""},{"location":"python_api/step/variant_to_gene_step/#otg.v2g.V2GStep","title":"otg.v2g.V2GStep dataclass","text":"

Variant-to-gene (V2G) step.

This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:

  1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).
  2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.
  3. Distance between the variant and each gene's canonical transcription start site (TSS).

Attributes:

Name Type Description session Session

Session object.

variant_index_path str

Input variant index path.

variant_annotation_path str

Input variant annotation path.

gene_index_path str

Input gene index path.

vep_consequences_path str

Input VEP consequences path.

liftover_chain_file_path str

Path to GRCh37 to GRCh38 chain file.

liftover_max_length_difference int

Maximum length difference for liftover.

max_distance int

Maximum distance to consider.

approved_biotypes list[str]

List of approved biotypes.

intervals dict

Dictionary of interval sources.

v2g_path str

Output V2G path.

Source code in src/otg/v2g.py
@dataclass\nclass V2GStep:\n    \"\"\"Variant-to-gene (V2G) step.\n\n    This step aims to generate a dataset that contains multiple pieces of evidence supporting the functional association of specific variants with genes. Some of the evidence types include:\n\n    1. Chromatin interaction experiments, e.g. Promoter Capture Hi-C (PCHi-C).\n    2. In silico functional predictions, e.g. Variant Effect Predictor (VEP) from Ensembl.\n    3. Distance between the variant and each gene's canonical transcription start site (TSS).\n\n    Attributes:\n        session (Session): Session object.\n        variant_index_path (str): Input variant index path.\n        variant_annotation_path (str): Input variant annotation path.\n        gene_index_path (str): Input gene index path.\n        vep_consequences_path (str): Input VEP consequences path.\n        liftover_chain_file_path (str): Path to GRCh37 to GRCh38 chain file.\n        liftover_max_length_difference: Maximum length difference for liftover.\n        max_distance (int): Maximum distance to consider.\n        approved_biotypes (list[str]): List of approved biotypes.\n        intervals (dict): Dictionary of interval sources.\n        v2g_path (str): Output V2G path.\n    \"\"\"\n\n    session: Session = Session()\n\n    variant_index_path: str = MISSING\n    variant_annotation_path: str = MISSING\n    gene_index_path: str = MISSING\n    vep_consequences_path: str = MISSING\n    liftover_chain_file_path: str = MISSING\n    liftover_max_length_difference: int = 100\n    max_distance: int = 500_000\n    approved_biotypes: List[str] = field(\n        default_factory=lambda: [\n            \"protein_coding\",\n            \"3prime_overlapping_ncRNA\",\n            \"antisense\",\n            \"bidirectional_promoter_lncRNA\",\n            \"IG_C_gene\",\n            \"IG_D_gene\",\n            \"IG_J_gene\",\n            \"IG_V_gene\",\n            \"lincRNA\",\n            \"macro_lncRNA\",\n            \"non_coding\",\n            \"sense_intronic\",\n            \"sense_overlapping\",\n        ]\n    )\n    intervals: Dict[str, str] = field(default_factory=dict)\n    v2g_path: str = MISSING\n\n    def __post_init__(self: V2GStep) -> None:\n        \"\"\"Run step.\"\"\"\n        # Read\n        gene_index = GeneIndex.from_parquet(self.session, self.gene_index_path)\n        vi = VariantIndex.from_parquet(self.session, self.variant_index_path).persist()\n        va = VariantAnnotation.from_parquet(self.session, self.variant_annotation_path)\n        vep_consequences = self.session.spark.read.csv(\n            self.vep_consequences_path, sep=\"\\t\", header=True\n        ).select(\n            f.element_at(f.split(\"Accession\", r\"/\"), -1).alias(\n                \"variantFunctionalConsequenceId\"\n            ),\n            f.col(\"Term\").alias(\"label\"),\n            f.col(\"v2g_score\").cast(\"double\").alias(\"score\"),\n        )\n\n        # Transform\n        lift = LiftOverSpark(\n            # lift over variants to hg38\n            self.liftover_chain_file_path,\n            self.liftover_max_length_difference,\n        )\n        gene_index_filtered = gene_index.filter_by_biotypes(\n            # Filter gene index by approved biotypes to define V2G gene universe\n            list(self.approved_biotypes)\n        )\n        va_slimmed = va.filter_by_variant_df(\n            # Variant annotation reduced to the variant index to define V2G variant universe\n            vi.df\n        ).persist()\n        intervals = Intervals(\n            _df=reduce(\n                lambda x, y: x.unionByName(y, allowMissingColumns=True),\n                # create interval instances by parsing each source\n                [\n                    Intervals.from_source(\n                        self.session.spark, source_name, source_path, gene_index, lift\n                    ).df\n                    for source_name, source_path in self.intervals.items()\n                ],\n            ),\n            _schema=Intervals.get_schema(),\n        )\n        v2g_datasets = [\n            va_slimmed.get_distance_to_tss(gene_index_filtered, self.max_distance),\n            va_slimmed.get_most_severe_vep_v2g(vep_consequences, gene_index_filtered),\n            va_slimmed.get_polyphen_v2g(gene_index_filtered),\n            va_slimmed.get_sift_v2g(gene_index_filtered),\n            va_slimmed.get_plof_v2g(gene_index_filtered),\n            intervals.v2g(vi),\n        ]\n        v2g = V2G(\n            _df=reduce(\n                lambda x, y: x.unionByName(y, allowMissingColumns=True),\n                [dataset.df for dataset in v2g_datasets],\n            ).repartition(\"chromosome\"),\n            _schema=V2G.get_schema(),\n        )\n\n        # Load\n        (\n            v2g.df.write.partitionBy(\"chromosome\")\n            .mode(self.session.write_mode)\n            .parquet(self.v2g_path)\n        )\n
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 95626289b..0e8c4ed18 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ diff --git a/usage/index.html b/usage/index.html index 36117ad21..a55fb3d8a 100644 --- a/usage/index.html +++ b/usage/index.html @@ -1 +1 @@ - How-to - Open Targets Genetics
\ No newline at end of file + How-to - Open Targets Genetics
\ No newline at end of file