Skip to content

Latest commit

 

History

History
144 lines (104 loc) · 7.87 KB

README.md

File metadata and controls

144 lines (104 loc) · 7.87 KB

Polygenic Risk Score (PRS) Toolkit for HLA

Modified By Tim Majarian

Changes:

  • support for VCF input
  • Plink path can be specified from command line
  • Output directory can be specified from command line
  • Proxy variant for rs17843689 (failed TOPMed QC, not present in TOPMed imputed datasets)
  • Addition of score file for non-hla haplotype scoring, with proxies for datasets imputed with TOPMed

In development:

  • support for Plink 2.0
  • BGEN input
  • Scripts for full GRS67 (non-hla haplotype scoring)

Please note the terms Polygenic Risk Score (PRS) and Genetic Risk Score (GRS) are used interchangeably.

This toolkit contains a collection of scripts for generating Single Nucleotide Polymorphism (SNP) based Polygenic Risk Scores (PRS) for common polygenic disease / traits where Human Leukocyte Antigen (HLA) region epistasis (gene vs gene interaction) is modelled using SNPs in strong Linkage Disequilbrium with alleles.

The major advantage of this is the ability to call HLA haplogenotypes (e.g DR3/DQ2.5) using SNPs alone which is signifiantly more cost and time effective in practice than alternative methods such SNP-chip imputation and HLA typing. Demo data and guide files are provided for our own published PRS but these could be easily generated in order to apply code to your own HLA calling and or scoring schema.

This work is a collective effort from researchers at the University of Exeter and elsewhere, please see references to corresponding publications below. For any questions about code on this page please contact Seth Sharp by email ([email protected]). For any questions about the individual PRS included please contact the corresponding author of the relevant publication cited.

Requirements

All scripts are designed to be run from a bash or compatible shell environment.

PLINK 1.9 is required.

Anaconda or Python (Version 3.8).

PLINK genotyping data (.bed/.bim/.fam)

Download

hla-prs-toolkit.zip - Latest version of all scripts and demo data.

Alternatively "git clone https://github.com/sethsh7/hla-prs-toolkit.git" to clone onto your working local directory or server.

Scripts

1. PLINK2CALL - Hard calling of HLA haplogenotypes from SNP proxies

Usage:

python 1_plink2call.py --bfile <prefix> --mapping <mapping> 

This script takes your PLINK data containing a set of SNPs for which you wish to generate HLA haplogenotypes and a mapping file which should be text (space or tab delimited) in the format:

SNP ALLELE
snp1  allele1
snp2 allele2
snp3 allele3
...

The SNPs are matched on RSID and should match exactly with your PLINK .bim file. The script will output the following results if succesful:

a) <prefix>_clean.txt - A table of allele counts by sample and mapped allele.
b) <prefix>_count.txt - Row wise sum of alleles per person for quality control.
c) <prefix>_cat.txt - List of categorical haplogenotypes by sample.

Filtering:

Since only 2 calls are possible per person (corresponding to 2 chromosomes) any samples with >2 calls will be filtered out in the same way missing genotypes are filtered. Work in progress - Implementing a custom allele prioritisation number system to eliminate excess SNP calls on a probabilistic basis.

2. CAT2SCORES - Assign scores to samples by HLA haplogenotype

Usage:

python 2_cat2scores.py --cat <your_cat.txt> --score <scorefile> 

The second script takes the by-sample categorical haplogenotype list file (c above) generated from PLINK2CALL and applies a scoring scheme. This can be EITHER on a full haplogenotype basis (one weight per combination) or on an additive basis (sum of each allele weight) or a mixture of both. The scorefile should be text (space or tab delimited) in the format:

ALLELE1 ALLELE2 BETA
allele1  allele2 value - Use this format for complete haplogenotypes
allele1 allele3 value
allele3 allele2 value
NULL allele1 value - For addition of an additive model use the NULL representation for the first column.
...

By default the script will NOT generate an additive term if a complete haplogenotype term is available (this differs from specific "interaction terms"). The script will output the following results file:

a) <prefix>_Scored.txt - List of samples and their corresponding scores as above

To generate a complete PRS (e.g. variants outside the HLA region) you can then sum this component with regular allele scoring generated using PLINK score function.

SNP Lists

A number of scoring and mapping files are provided to utilise in generating scores from their referenced publications. There are versions available generated from SNPs used in the publications (recommended with TOPMED imputed data) or safer proxy SNPs that are more likely to be presented in all data (recommended with 1000Genomes or similar imputation).

Type 1 Diabetes (T1D-PRS)

10-SNP / 30-SNP score ("GRS1") [1] - Original 10 and 30 SNP Exeter scores, they use a simple interaction model for DR3 and DR4 haplotypes.

67-SNP score ("GRS2") [2] - Updated 67 SNP Exeter score, complex interaction model, combines interaction and additive models (as described above), best prediction.

Coeliac Disease (CD-PRS)

42-SNP score [3] - 42 SNP Exeter score, interaction model only consisting of DQ2.5, DQ2.2, DQ8.1 and DQ2.2 haplotypes.

Example with Demo Data

In the scripts folder a subfolder "Demo" contains randomly generated data on 50 samples. You can run this code with either Run.sh or the following commands:

python 1_plink2call.py --bfile demo/demo_cohort --mapping demo/mapping.txt 
python 2_cat2scores.py --cat demo/demo_cohort_cat.txt --score demo/scorefile.txt

If all is succesful these scripts will terminate without error and you will have the output files as described above.

Additional files

In tutorials a PDF guide to HLA nomenclature is available as background.

FAQ

How do I get from a PLINK direct genotyping file to all of these additional SNPs?

You will need to impute your data, a useful tool for this is the NIH TOPMED Imputation Server. You will then need to convert your VCF of imputed dosages to PLINK format and extract the variants you need.

Should I impute to TOPMED or 1000Genomes?

If you are able to impute to TOPMED this will eliminate many issues with missing SNPs. In addition 1000Genomes designs for PRS listed here are less accurate overall.

I am still missing SNPs listed how do I get alternatives?

The best way is to use the NIH LDProxy tool to look up the next best 1000Genomes proxy SNPs.

My SNPs are badly imputed (INFO<0.8) what should I do?

Use the same tool as above, if you cannot find a good proxy SNP then you may have to exclude this locus.

How do I apply this to my own 23andMe data?

You will need to download your raw genome data and imputed it via a third party service.

How are missing genotype values handled?

They are set to 0 by default as the algorithm is conservative with calls.

Will this be updated?

Yes, as time allows (infrequently at present).

Supporting Publications

[1] "A Type 1 Diabetes Genetic Risk Score Can Aid Discrimination Between Type 1 and Type 2 Diabetes in Young Adults", DiabetesCare 2016, RA Oram et al

[2] "Development and Standardization of an Improved Type 1 Diabetes Genetic Risk Score for Use in Newborn Screening and Incident Diagnosis", DiabetesCare 2019, SA Sharp et al

[3] "A single nucleotide polymorphism genetic risk score to aid diagnosis of coeliac disease: a pilot study in clinical care", AP&T 2020, SA Sharp et al

License

Resources are provided under the GNU General Public License v3.0. For more information please LICENSE.md.