Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable gentropy to enrich variant in-silico predictors with amino-acid variation consequences (OTAR2081) #3676

Closed
5 of 6 tasks
DSuveges opened this issue Dec 13, 2024 · 2 comments · Fixed by opentargets/orchestration#87 · May be fixed by opentargets/gentropy#947
Assignees
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team

Comments

@DSuveges
Copy link

DSuveges commented Dec 13, 2024

Context

There are a number of methods that provide variation consequences at the amino acid resolution. In such cases the method claims that changing position n of the sequence of protein represented by Uniprot id u from amino acid ref to alt there's some predicted consequence eg. effect on protein stability.

An example of such dataset is the FoldX project, where the team was using AlphaFold structures of all proteins to mutate all residues to all other residues and measured the resulting ddG indicating highly destabilising mutations.

All variants in the variant index that causes amino acid change that were tested in the FoldX project needs to be annotated with the corresponding ddG values and stored in the inSilicoPredictos object.

Tasks

  • Given there's a good chance such annotations will increase in the future, a new data model needs to be included in gentropy.
  • Add new datasource for FoldX.
  • Add logic to normalise FoldX ddG values.
  • Add logic to integrate dataset into the variant index.
  • Add test.
  • Add new step to ingest data.

Integration of FoldX dataset

  • Some of the large proteins are chunked to build AlphaFold models, but the positions in the dataset are referring to the models not the full-length Uniprot sequence. The chunked Uniprot IDs need to be dropped.
  • Developing logic to normalise ddG values.
  • Add step to ingest data and save as static asset.
@DSuveges DSuveges added Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team labels Dec 13, 2024
@DSuveges DSuveges self-assigned this Dec 13, 2024
@DSuveges
Copy link
Author

DSuveges commented Dec 16, 2024

Stats

Out of the 6.2 million variants in the variant index, 7% of them (442k) has FoldX ddG values.

|       |          score |
|:------|---------------:|
| count | 442748         |
| mean  |      1.53664   |
| std   |      3.19015   |
| min   |     -8.43588   |
| 25%   |      0.0257792 |
| 50%   |      0.734665  |
| 75%   |      2.04552   |
| max   |    106.329     |

Truncated (score < 20) distribution of the scores:

Image

Most Extreme stabilising variants:

+----------------+------+--------+
|       variantId|method|   score|
+----------------+------+--------+
| 19_18788237_T_C| FoldX|-8.43588|
| 19_18788237_T_A| FoldX|-7.88709|
| 19_18786036_T_C| FoldX|-7.50872|
| 5_132591266_A_G| FoldX|-7.48104|
| 19_18786633_C_T| FoldX|  -7.439|
| 19_18786037_C_G| FoldX|-7.23347|
| 6_131581284_A_G| FoldX| -7.1111|
| 19_18787500_C_T| FoldX|-7.10864|
| 19_18786145_C_T| FoldX|-7.03227|
| 21_39232259_T_C| FoldX|-6.67871|
| 19_18787650_C_T| FoldX|-6.41392|
| 19_18786139_C_T| FoldX|-5.88879|
| 6_131581296_A_G| FoldX|-5.86358|
| 17_78995210_C_T| FoldX|-5.75068|
| 9_137108478_G_A| FoldX|-5.72537|
| X_100406812_C_T| FoldX|-5.57866|
|  17_7173778_C_T| FoldX|-5.55336|
|  2_26448802_C_T| FoldX|-5.53677|
|11_134282081_G_C| FoldX|-5.43581|
| X_100406811_T_C| FoldX|-5.31817|
+----------------+------+--------+
only showing top 20 rows

Most extreme destabilizing variants:

+---------------+------+-------+
|      variantId|method|  score|
+---------------+------+-------+
| 2_29228944_C_A| FoldX|106.329|
| 2_29227060_C_A| FoldX|86.3004|
|8_144052236_C_A| FoldX|85.9526|
|1_151366445_C_A| FoldX|85.6923|
| 11_2165721_C_A| FoldX|84.8677|
|17_44383931_C_A| FoldX|78.1641|
| 2_29228923_C_A| FoldX|76.4907|
|8_144413330_C_A| FoldX|75.7238|
| 1_75733626_G_T| FoldX| 74.136|
|14_87992972_C_A| FoldX|69.9048|
|1_151024711_G_T| FoldX|69.2282|
|14_87965571_C_A| FoldX|68.1366|
|17_29254327_C_G| FoldX| 68.027|
| 8_38414279_C_A| FoldX|67.5331|
|2_232481836_C_A| FoldX|66.8255|
| 2_29228944_C_T| FoldX| 64.039|
| 5_40955428_G_T| FoldX|63.8174|
|11_71437868_C_A| FoldX|  63.34|
|11_68780705_C_A| FoldX|61.5534|
|1_161306849_C_A| FoldX|60.2048|
+---------------+------+-------+
only showing top 20 rows

These variants, either highly positive or negative ddG, they occur in ClinVar and Uniprot. It seems, although the effect is mostly stabilising, the impact is severe. Example X_100406811_T_C, where the D596G amino acid change has a -5.5kJ/Mol ddG indicating a stabilizing effect, based on the available phenotyping information, the ClinVar consequence is still likely pathogenic in concordance with the other in-silico predictors.
Image

Looking at the structure around the residue it is very complicated to imagine how a Asp to Gly change could introduce extra stabilisation effect considering how widespread interactions occur between the Asp's carboxylg group and its surroundings.

Image

@DSuveges
Copy link
Author

The mean ddG predicted by FoldX for clinVar variants shows a nice correlation with the severity of the assigned clinical significance:

Image

For more severe assessment classes have a higher ddG on average, however it has to be noted that the ddG distribution is very wide for all classes:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team
Projects
None yet
1 participant