\ No newline at end of file
diff --git a/development/_development/index.html b/development/_development/index.html
index 4cd8d49fc..5f02cb3a0 100644
--- a/development/_development/index.html
+++ b/development/_development/index.html
@@ -1 +1 @@
- Development - Open Targets Genetics
This section contains various technical information on how to develop and run the code.
2023-10-09
2023-10-09
Contributors
\ No newline at end of file
diff --git a/development/airflow/index.html b/development/airflow/index.html
index dee819bf1..559f0fba1 100644
--- a/development/airflow/index.html
+++ b/development/airflow/index.html
@@ -1,4 +1,4 @@
- Running Airflow workflows - Open Targets Genetics
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.
Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.
Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml file and modify the version of the code.
This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
Run make build.
This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
A version specific subpath is used, so uploading the code will not affect any branches but your own.
If there was already a code bundle uploaded with the same version number, it will be replaced.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
The script will provision the cluster and submit the job.
The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)
For more details on each of these steps, see the sections below.
If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
Step definition in docs/reference/step (example: docs/reference/step/finngen.md)
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login and follow instructions.
Check that you have the make utility installed, and if not (which is unlikely), install it using your system package manager.
Run make setup-dev to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml file and modify the version of the code.
This must be different from the version used by any other people working on the repository to avoid any deployment conflicts, so it's a good idea to use your name, for example: 1.2.3+jdoe.
You can also add a brief branch description, for example: 1.2.3+jdoe.myfeature.
Note that the version must comply with PEP440 conventions, otherwise Poetry will not allow it to be deployed.
Do not use underscores or hyphens in your version name. When building the WHL file, they will be automatically converted to dots, which means the file name will no longer match the version and the build will fail. Use dots instead.
Run make build.
This will create a bundle containing the neccessary code, configuration and dependencies to run the ETL pipeline, and then upload this bundle to Google Cloud.
A version specific subpath is used, so uploading the code will not affect any branches but your own.
If there was already a code bundle uploaded with the same version number, it will be replaced.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
You will need to specify additional parameters, some are mandatory and some are optional. Run with --help to see usage.
The script will provision the cluster and submit the job.
The cluster will take a few minutes to get provisioned and running, during which the script will not output anything, this is normal.
Once submitted, you can monitor the progress of your job on this page: https://console.cloud.google.com/dataproc/jobs?project=open-targets-genetics-dev.
On completion (whether successful or a failure), the cluster will be automatically removed, so you don't have to worry about shutting it down to avoid incurring charges.
When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/)
For more details on each of these steps, see the sections below.
If during development you had a question which wasn't covered in the documentation, and someone explained it to you, add it to the documentation. The same applies if you encountered any instructions in the documentation which were obsolete or incorrect.
Documentation autogeneration expressions start with :::. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:
Dataset definitions in docs/reference/dataset (example: docs/reference/dataset/study_index/study_index_finngen.md)
Step definition in docs/reference/step (example: docs/reference/step/finngen.md)
Test study fixture in tests/conftest.py (example: mock_study_index_finngen in that module)
Test sample data in tests/data_samples (example: tests/data_samples/finngen_studies_sample.json)
Test definition in tests/ (example: tests/dataset/test_study_index.py → test_study_index_finngen_creation)
2023-05-30
2023-10-26
Contributors
\ No newline at end of file
diff --git a/development/troubleshooting/index.html b/development/troubleshooting/index.html
index dc96298ea..d6c3aec1f 100644
--- a/development/troubleshooting/index.html
+++ b/development/troubleshooting/index.html
@@ -1 +1 @@
- Troubleshooting - Open Targets Genetics
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
Close your currently activated environment, if any: exit
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
2023-07-04
2023-10-17
Contributors
\ No newline at end of file
+ Troubleshooting - Open Targets Genetics
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
Close your currently activated environment, if any: exit
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm, followed by sudo apt autoremove. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
2023-07-04
2023-10-17
Contributors
\ No newline at end of file
diff --git a/index.html b/index.html
index 2110f6f99..95d3d0a03 100644
--- a/index.html
+++ b/index.html
@@ -1,4 +1,4 @@
- Open Targets Genetics - Open Targets Genetics
\ No newline at end of file
diff --git a/objects.inv b/objects.inv
index 8278155af..61ba946f1 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/python_api/_python_api/index.html b/python_api/_python_api/index.html
index 5375ccc98..7d95169fe 100644
--- a/python_api/_python_api/index.html
+++ b/python_api/_python_api/index.html
@@ -1 +1 @@
- Python API - Open Targets Genetics
\ No newline at end of file
diff --git a/python_api/dataset/_dataset/index.html b/python_api/dataset/_dataset/index.html
index e0a2997da..e000a9ab5 100644
--- a/python_api/dataset/_dataset/index.html
+++ b/python_api/dataset/_dataset/index.html
@@ -1,4 +1,4 @@
- Dataset - Open Targets Genetics
Dataset is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the schemas module.
Source code in src/otg/dataset/dataset.py
17 18 19 20
@@ -145,7 +145,7 @@
classDataset(ABC):"""Open Targets Genetics Dataset.
- `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `json.schemas` module.
+ `Dataset` is a wrapper around a Spark DataFrame with a predefined schema. Schemas for each child dataset are described in the `schemas` module. """_df:DataFrame
diff --git a/python_api/dataset/colocalisation/index.html b/python_api/dataset/colocalisation/index.html
index b88d04c05..4043b57c5 100644
--- a/python_api/dataset/colocalisation/index.html
+++ b/python_api/dataset/colocalisation/index.html
@@ -1,4 +1,4 @@
- Colocalisation - Open Targets Genetics
deffill_na(
+ self:L2GFeatureMatrix,value:float=0.0,subset:list[str]|None=None
+)->L2GFeatureMatrix:
+"""Fill missing values in a column with a given value.
+
+ Args:
+ value (float): Value to replace missing values with. Defaults to 0.0.
+ subset (list[str] | None): Subset of columns to consider. Defaults to None.
+
+ Returns:
+ L2GFeatureMatrix: L2G feature matrix dataset
+ """
+ self.df=self._df.fillna(value,subset=subset)
+ returnself
+
Provides the schema for the L2gFeatureMatrix dataset.
Returns:
Name
Type
Description
StructType
StructType
Schema for the L2gFeatureMatrix dataset
Source code in src/otg/dataset/l2g_feature_matrix.py
71
+72
+73
+74
+75
+76
+77
+78
@classmethod
+defget_schema(cls:type[L2GFeatureMatrix])->StructType:
+"""Provides the schema for the L2gFeatureMatrix dataset.
+
+ Returns:
+ StructType: Schema for the L2gFeatureMatrix dataset
+ """
+ returnparse_spark_schema("l2g_feature_matrix.json")
+
defselect_features(
+ self:L2GFeatureMatrix,features_list:list[str]
+)->L2GFeatureMatrix:
+"""Select a subset of features from the feature matrix.
+
+ Args:
+ features_list (list[str]): List of features to select
+
+ Returns:
+ L2GFeatureMatrix: L2G feature matrix dataset
+ """
+ fixed_rows=["studyLocusId","geneId","goldStandardSet"]
+ self.df=self._df.select(fixed_rows+features_list)
+ returnself
+
deftrain_test_split(
+ self:L2GFeatureMatrix,fraction:float
+)->tuple[L2GFeatureMatrix,L2GFeatureMatrix]:
+"""Split the dataset into training and test sets.
+
+ Args:
+ fraction (float): Fraction of the dataset to use for training
+
+ Returns:
+ tuple[L2GFeatureMatrix, L2GFeatureMatrix]: Training and test datasets
+ """
+ train,test=self._df.randomSplit([fraction,1-fraction],seed=42)
+ return(
+ L2GFeatureMatrix(
+ _df=train,_schema=L2GFeatureMatrix.get_schema()
+ ).persist(),
+ L2GFeatureMatrix(_df=test,_schema=L2GFeatureMatrix.get_schema()).persist(),
+ )
+
\ No newline at end of file
diff --git a/python_api/dataset/l2g_gold_standard/index.html b/python_api/dataset/l2g_gold_standard/index.html
new file mode 100644
index 000000000..48f9bde4e
--- /dev/null
+++ b/python_api/dataset/l2g_gold_standard/index.html
@@ -0,0 +1,150 @@
+ L2G Gold Standard - Open Targets Genetics
Provides the schema for the L2GGoldStandard dataset.
Returns:
Name
Type
Description
StructType
StructType
Spark schema for the L2GGoldStandard dataset
Source code in src/otg/dataset/l2g_gold_standard.py
49
+50
+51
+52
+53
+54
+55
+56
@classmethod
+defget_schema(cls:type[L2GGoldStandard])->StructType:
+"""Provides the schema for the L2GGoldStandard dataset.
+
+ Returns:
+ StructType: Spark schema for the L2GGoldStandard dataset
+ """
+ returnparse_spark_schema("l2g_gold_standard.json")
+
\ No newline at end of file
diff --git a/python_api/dataset/l2g_prediction/index.html b/python_api/dataset/l2g_prediction/index.html
new file mode 100644
index 000000000..10959bb7c
--- /dev/null
+++ b/python_api/dataset/l2g_prediction/index.html
@@ -0,0 +1,222 @@
+ L2G Prediction - Open Targets Genetics
Dataset that contains the Locus to Gene predictions.
It is the result of applying the L2G model on a feature matrix, which contains all the study/locus pairs and their functional annotations. The score column informs the confidence of the prediction that a gene is causal to an association.
@dataclass
+classL2GPrediction(Dataset):
+"""Dataset that contains the Locus to Gene predictions.
+
+ It is the result of applying the L2G model on a feature matrix, which contains all
+ the study/locus pairs and their functional annotations. The score column informs the
+ confidence of the prediction that a gene is causal to an association.
+ """
+
+ @classmethod
+ defget_schema(cls:type[L2GPrediction])->StructType:
+"""Provides the schema for the L2GPrediction dataset.
+
+ Returns:
+ StructType: Schema for the L2GPrediction dataset
+ """
+ returnparse_spark_schema("l2g_predictions.json")
+
+ @classmethod
+ deffrom_study_locus(
+ cls:Type[L2GPrediction],
+ model_path:str,
+ study_locus:StudyLocus,
+ study_index:StudyIndex,
+ v2g:V2G,
+ # coloc: Colocalisation,
+ )->L2GPrediction:
+"""Initialise L2G from feature matrix.
+
+ Args:
+ model_path (str): Path to the fitted model
+ study_locus (StudyLocus): Study locus dataset
+ study_index (StudyIndex): Study index dataset
+ v2g (V2G): Variant to gene dataset
+
+ Returns:
+ L2GPrediction: L2G dataset
+ """
+ fm=L2GFeatureMatrix.generate_features(
+ study_locus=study_locus,
+ study_index=StudyIndex,
+ variant_gene=v2g,
+ # colocalisation=coloc,
+ ).fill_na()
+ returnL2GPrediction(
+ # Load and apply fitted model
+ _df=(
+ LocusToGeneModel.load_from_disk(
+ model_path,
+ features_list=fm.df.drop("studyLocusId","geneId").columns,
+ ).predict(fm)
+ # the probability of the positive class is the second element inside the probability array
+ # - this is selected as the L2G probability
+ .select(
+ "studyLocusId",
+ "geneId",
+ vector_to_array("probability")[1].alias("score"),
+ )
+ ),
+ _schema=cls.get_schema(),
+ )
+
Provides the schema for the L2GPrediction dataset.
Returns:
Name
Type
Description
StructType
StructType
Schema for the L2GPrediction dataset
Source code in src/otg/dataset/l2g_prediction.py
33
+34
+35
+36
+37
+38
+39
+40
@classmethod
+defget_schema(cls:type[L2GPrediction])->StructType:
+"""Provides the schema for the L2GPrediction dataset.
+
+ Returns:
+ StructType: Schema for the L2GPrediction dataset
+ """
+ returnparse_spark_schema("l2g_predictions.json")
+
\ No newline at end of file
diff --git a/python_api/dataset/ld_index/index.html b/python_api/dataset/ld_index/index.html
index b1f4b4d17..832339573 100644
--- a/python_api/dataset/ld_index/index.html
+++ b/python_api/dataset/ld_index/index.html
@@ -1,4 +1,4 @@
- LD Index - Open Targets Genetics
Annotate study-locus dataset with credible set flags.
Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.
Annotate study-locus dataset with credible set flags.
Sorts the array in the locus column elements by their posteriorProbability values in descending order and adds is95CredibleSet and is99CredibleSet fields to the elements, indicating which are the tagging variants whose cumulative sum of their posteriorProbability values is below 0.95 and 0.99, respectively.
defannotate_credible_sets(self:StudyLocus)->StudyLocus:"""Annotate study-locus dataset with credible set flags. Sorts the array in the `locus` column elements by their `posteriorProbability` values in descending order and adds
@@ -901,10 +907,7 @@
<BLANKLINE> """returnf.xxhash64(*[study_id_col,variant_id_col]).alias("studyLocusId")
-
defclump(self:StudyLocus)->StudyLocus:"""Perform LD clumping of the studyLocus. Evaluates whether a lead variant is linked to a tag (with lowest p-value) in the same studyLocus dataset.
@@ -984,7 +990,10 @@
238239240
-241
Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.
Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always appearing on the right side.
deffind_overlaps(self:StudyLocus,study_index:StudyIndex)->StudyLocusOverlap:"""Calculate overlapping study-locus. Find overlapping study-locus that share at least one tagging variant. All GWAS-GWAS and all GWAS-Molecular traits are computed with the Molecular traits always
@@ -1083,16 +1095,16 @@
StructType: schema for the StudyLocus dataset. """returnparse_spark_schema("study_locus.json")
-
This dataset captures pairs of overlapping StudyLocus: that is associations whose credible sets share at least one tagging variant.
Note
This is a helpful dataset for other downstream analyses, such as colocalisation. This dataset will contain the overlapping signals between studyLocus associations once they have been clumped and fine-mapped.
Source code in src/otg/dataset/study_locus_overlap.py