Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Beagle imputed VCF for TRTools #209

Open
maegsul opened this issue Jan 24, 2024 · 5 comments · Fixed by #226
Open

Converting Beagle imputed VCF for TRTools #209

maegsul opened this issue Jan 24, 2024 · 5 comments · Fixed by #226

Comments

@maegsul
Copy link

maegsul commented Jan 24, 2024

Hi

First of all, thanks for developing TRTools! I have a question regarding imputed STR input files.

I have a Beagle (v5.4) imputed VCF file that looks like below:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20230503
##source="beagle.22Jul22.46e.jar"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=DR2,Number=A,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + 2
P(AA)]">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
18 136703 STR_614109 TCCGGCAAAAAAAAAAAAAAAA TCCAGCAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAA,TGCGGCAAAAAAAAAAAAAAAA,TCCGGCAAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAGAA,TCCGGAAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAAGAA . PASS DR2=0,0.35,0,0.28,0,0.97,0.33,0.93,0.45;AF=0,0.008,0,0.0082,0,0.0352,0.0044,0.2689,0.0045;IMP GT:DS 0|8:0,0,0,0,0,0,0,1,0

The imputation reference panel is from: http://gymreklab.com/2018/03/05/snpstr_imputation.html that was described in the Saini et al. paper: https://www.nature.com/articles/s41467-018-06694-0

I typically split these STR alleles and perform biallelic STR GWAS on binary phenotypes of interest, however I am also interested in a length-based STR GWAS. That is how I came across with associaTR available in TRTools, that seems to be capable of running length-based GWAS. I made a test as below with chr18 STRs:

associaTR results.tsv input_chr18_fortesting.vcf.gz cases phenotypes_forTest_caseControl.npy --same-samples --beagle-dosages --vcftype hipstr

...however this gave an error as:

Traceback (most recent call last):
File "/home/fahri/miniconda3/bin/associaTR", line 10, in
sys.exit(run())
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 582, in run
main(args)
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 599, in main
perform_gwas(
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 449, in perform_gwas
perform_gwas_helper(
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/associaTR.py", line 207, in perform_gwas_helper
extra_detail_fields = next(genotype_iter)
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/associaTR/load_and_filter_genotypes.py", line 117, in load_trs
inferred_vcftype = trh.InferVCFType(vcf, vcftype if vcftype else 'auto')
File "/home/fahri/miniconda3/lib/python3.9/site-packages/trtools/utils/tr_harmonizer.py", line 209, in InferVCFType
raise TypeError('Could not identify the type of this vcf')
TypeError: Could not identify the type of this vcf.

Of note, above I set --vcftype as hipstr, as Saini et al. paper seems to use HipSTR.

Then I realized this section of the documentation regarding Beagle: https://trtools.readthedocs.io/en/stable/CALLERS.html#beagle and used trtools_prep_beagle_vcf.sh convert my imputed VCF into a VCF that can be used in associaTR:

trtools_prep_beagle_vcf.sh hipstr 1kg.snp.str.chr18.vcf.gz input_chr18_fortesting.vcf.gz converted_input_chr18_fortesting.vcf.gz

...where 1kg.snp.str.chr18.vcf.gz was the reference panel downloaded from http://gymreklab.com/2018/03/05/snpstr_imputation.html, as mentioned above. I got another error here:

Creating temporary file ... ~/tmp/tmp.ULy8Ry1uQ1.vcf
Copying over meta header lines from the reference panel, then copying the imputed file contents
bgzipping and tabix indexing
Adding INFO fields and values from the ref_panel, then removing loci which are missing required INFO fields
[W::hts_idx_load3] The index file is older than the data file: ~/1kg.snp.str.chr18.vcf.gz.tbi
The INFO tag "START" is not defined in ~/1kg.snp.str.chr18.vcf.gz, was the -h option provided?
Failed to read from standard input: unknown file type

Is this error related to the fact that there is no INFO tag of "START" (not sure what it is?) in the chr18 STR imputation reference file*, and/or something else related to the "unknown file type" (maybe my file is not in "HipSTR" format?)?

Can you please help with this error? Thank you very much in advance!

Best,
Fahri

*PS: This is how the imputation reference file header and a randomly chosen STR look like:

##fileformat=VCFv4.2
#filedate=20180225
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
##contig=<ID=18>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
18 147475 STR_614123 ATACAAAAAAAAAAAAAAA ATACAAAAAAAAAAAAA,AAACAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAA,AAACAAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAAAA,ATACAAAAAAAAAAAAAAAAA . PASS AR2=0.89;DR2=0.91;AF=0.016,0.0019,0.33,0.0025,0.0057,0.004;IMP GT:DS 3|0:0,0,0.99,0,0,0
@maegsul
Copy link
Author

maegsul commented Aug 29, 2024

Thank you for developing annotaTR for addressing this issue @gymreklab! I have been making some tests this week, here is my experience (also for the future reference for others who might be having similar problems):

I reinstalled TRTools, and tried annotaTR on the same test file as before (chr18 Beagle-imputed VCF [this time also with GP and AP format fields] - imputed using Saini et al. 2018 STR reference panel http://gymreklab.com/2018/03/05/snpstr_imputation.html) as below:

annotaTR --vcf chr18_Beagle_imputed.vcf.gz --out beagleap_norm_annotaTR_chr18_Beagle_imputed --vcftype hipstr --outtype vcf pgen --dosages beagleap_norm --ref-panel 1kg.snp.str.chr18.vcf.gz --match-refpanel-on trimmedalleles

Here I also tried with --match-refpanel-on rawalleles and --dosages beagleap ; but it failed at "Loading reference panel" step:

Error: No TRs detected in reference panel. Was the right vcftype specified? Quitting

I then thought that perhaps annotaTR was not designed for Saini et al. 2018 imputation reference panel, as the documentation (https://trtools.readthedocs.io/en/latest/source/annotaTR.html) mentions the latest TR reference panel from EnsemblTR (by the way I was not aware of this new publication and reference panel, looks great, congratulations!), so I switched to this reference panel - using a chr18 VCF (filtered for dosage R2 > 0.3 TRs) again for testing and using Beagle v5.4 (22Jul22.46e) for imputation with the same settings as before (with AP and GP).

I first tried getting only VCF output to see how the output looks like:

annotaTR --vcf DR2_03.chr18_Beagle_imputedv2.vcf.gz --out annotaTR_DR2_03.chr18_Beagle_imputedv2 --vcftype hipstr --outtype vcf --dosages beagleap --ref-panel chr18_final_SNP_merged_additional_TRs.vcf.gz --match-refpanel-on trimmedalleles

This worked very well with either --match-refpanel-on rawalleles or --match-refpanel-on trimmedalleles options, just the latter outputting 32 more variants that are looking like biallelic indels with genomic position-ref/alt based variant IDs (not looking like TRs without any ID that have just "." as IDs). That is why I opted in for --match-refpanel-on rawalleles for later trials (please let me know if you suggest otherwise).

The outputted VCF is having TRDS format field that is exactly what I wanted to obtain (especially with --dosages beagleap that accounts for uncertainty), so thanks a lot! However, because I am also interested in logistic regression and this is not available currently in associaTR (please see also #178 (comment)), I also wanted to try pgen output (for plink2) with different dosage settings, to be able to perform logistic regression using TRDS.

However, adding "pgen" to --outtype has been problematic for me. I think I tried all possible different combinations of match-refpanel-on and --dosages, but none of them really worked, for instance for below example:

annotaTR --vcf annotaTR_DR2_03.chr18_Beagle_imputedv2.vcf.gz --out pgen_annotaTR_DR2_03.chr18_Beagle_imputedv2 --vcftype hipstr --outtype vcf pgen --dosages beagleap_norm --ref-panel chr18_final_SNP_merged_additional_TRs.vcf.gz --match-refpanel-on rawalleles

I kept getting this kind of error:

Error: PgfiInitPhase1() was called with raw_variant_ct == 21490, but pgen_annotaTR_DR2_03.chr18_Beagle_imputedv2.pgen contains 0 variants.

Here, 21490 should be the number of TRs I had in the input file, and the error is indicating that pgen file contains 0 variants (even though pgen file itself does not seems to be an empty file). plink2 --pfile pgen_annotaTR_DR2_03.chr18_Beagle_imputedv2 --validate also throws the same error (my plink2 version is PLINK v2.00a6LM AVX2 Intel [9 Jun 2024])

By the way, if I use --dosages bestguess_norm in the above example instead, I get this warning in addition to the problematic pgen:

Error writing PGEN! The output file is likely invalid. Did you run on files merged with bcftools merge? If so try rerunning with option --match-refpanel-on trimmedalleles or --match-refpanel-on locid.

I tried with your example file:

annotaTR --vcf 1kg_snpstr_21_first_100k_second_50_STRs_imputed.vcf.gz --vcftype hipstr --ref-panel 1kg_snpstr_21_first_100k_first_50_annotated.vcf.gz --outtype vcf pgen --dosages bestguess_norm --out test

...which worked actually for bestguess_norm, but not for beagleap_norm as the input file did not have AP format information I think. The main differences I can see in this your example input file are:

  1. the ID column is not containing any "." (all annotated - I tried the same but did not work)
  2. It only contains GT:DS format for samples; not so relevant I think?
  3. It is not chr-prefixed (meanwhile the reference panel is chr-prefixed), but I do not think the problem can be related to this?
  4. It contains all SNPs/STRs/TRs together, and I think they full match to reference panel (in both files, 99990 lines/variants)

Regarding the last difference: I also tried to provide an input file of raw Beagle imputed SNPs/TRs (without DR2 > 0.3 filter first), matching fully with the imputation reference panel (testing chr21) -> and this did not work either, it throws the following error:

File "~/TRTools/trtools/utils/tr_harmonizer.py", line 1149, in GetDosages
max_alt_len = max(self.alt_allele_lengths)
ValueError: max() arg is an empty sequence

...I got a similar error like this earlier when I was analyzing chr18 Beagle-imputed VCF, which was due to a monomorphic variant as far as I remember and another variant with no alternative allele the other time, for example chr18:61752586 in the reference panel with only the reference allele of GGAAGGAAGGAA and no alternative allele (that is why I chose to do a dosage R2 > 0.3 first earlier, this way I can remove such variants also, and anyway in GWAS I want to analyze only TRs whose at least 1 allele have dosage R2 > 0.3, for instance).

I did not try anything else further here onward. Can you please help with this? Thanks again for developing TRTools and thank you very much in advance!

Best,
Fahri

@gymreklab
Copy link
Contributor

Thanks @maegsul for testing it out and this detailed report of issues you ran into. I am going to reopen this issue and we will work on tracking these down, will update shortly.

@gymreklab
Copy link
Contributor

Hi again,

Some comments and questions:

  1. Could you check which version of annotaTR you are using? I am so far not able to reproduce the issue of no TRs detected in the reference panel using --ref-panel 1kg.snp.str.chr18.vcf.gz --vcftype hipstr (I get Loaded 12258 TR loci from ref panel). annotaTR --version gives me 6.0.2.

  2. For the new ensembleTR reference panel: we have been using a new version of that with unique ID's for the STRs that I think should solve issues with using that one. In fact, that new version is motivated by exactly these issues with annotaTR which are caused by duplicate loci in the reference panel. We are hoping to release that next week.

  3. Thanks for catching the issue with monomorphic variants. working on a fix in fix: bug in beagle dosages with no alt alleles #231.

@maegsul
Copy link
Author

maegsul commented Aug 30, 2024

Thanks a lot @gymreklab! To go through the comments and the questions:

  1. annotaTR --version gives me also 6.0.2. I cloned the master repository of TRTools last Monday (August 26), and I installed it with mamba using dev-env.yml:

mamba env create -n trtools_v610 -f dev-env.yml
mamba run -n trtools_v610 poetry install
mamba activate trtools_v610
poetry shell

This actually runs TRTools on Python 3.8.19, this should be fine I think?

The reference panel I use is https://s3.amazonaws.com/snp-str-imputation/1000genomes/1kg.snp.str.chr18.vcf.gz and the first SNP and STR variants appearing in the file are as below:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096
18 10644 rs115926245 C G . PASS AR2=1.00;DR2=1.00;AF=0.024 GT:DS 0|0:0
18 136703 STR_614109 TCCGGCAAAAAAAAAAAAAAAA TCCAGCAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAA,TGCGGCAAAAAAAAAAAAAAAA,TCCGGCAAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAGAA,TCCGGAAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAAGAA . PASS AR2=0.83;DR2=0.85;AF=0.0021,0.027,0.0020,0.040,0.0019,0.042,0.017,0.23,0.0023;IMP GT:DS 0|0:0,0.01,0,0,0,0,0,0.01,0

And my input imputed VCF looks like below (first SNP and first STR):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
18 10644 rs115926245 C G . PASS DR2=0.22;AF=0.0039;IMP GT:DS:AP1:AP2:GP 0|0:0:0:0:1,0,0
18 136703 STR_614109 TCCGGCAAAAAAAAAAAAAAAA TCCAGCAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAA,TGCGGCAAAAAAAAAAAAAAAA,TCCGGCAAAAAAAAAAAAAAAAA,TCCGGAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAGAA,TCCGGAAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAGAA,TCCGGCAAAAAAAAAAAAAAAAAAAGAA . PASS DR2=0,0.35,0,0.28,0,0.97,0.33,0.93,0.45;AF=0,0.008,0,0.0082,0,0.0352,0.0044,0.2689,0.0045;IMP GT:DS:AP1:AP2:GP 0|8:0,0,0,0,0,0,0,1,0:0,0,0,0,0,0,0,0,0:0,0,0,0,0,0,0,1,0:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Do these files look OK and similar? (if you need more information, I can provide as well). Also, if you have a test file available for the use of annotaTR on 2018 Saini et al. imputation reference panel, I can also test it.

  1. Sounds great, I will be waiting for this new release then! But do you think that adding unique STR IDs would solve the plink pgen output related errors I have described? I tried adding the unique IDs to variants in the imputed STR file (using EnsembleTR reference panel) and then trying annotaTR to get pgen output, but I still got the same error of "empty pgen" (PgfiInitPhase1)

  2. Thanks, I saw that there is already a commit for this fix! Whenever you think it is best, I can clone the latest GitHub repo of TRTools and try again or I can also wait until you release the new 6.1.0 version.

Thank you once again!

@gymreklab
Copy link
Contributor

Hi @maegsul below is a test using the Saini panel to impute in a couple 1000 Genomes samples (which is a bit awkward since those are also the samples in the ref panel, but this is just to test the pipeline steps) followed by annotaTR + checking the pgen files with plink2.

The first SNP/STR in the ref panel and output match what I see. However I am still not sure why you got an error loading the reference panel with annotaTR. that step is to supposed to happen before the imputed VCF is even read so should only depend on the reference panel itself.

# Download Saini panel
wget https://s3.amazonaws.com/snp-str-imputation/1000genomes/1kg.snp.str.chr18.vcf.gz
wget https://s3.amazonaws.com/snp-str-imputation/1000genomes/1kg.snp.str.chr18.vcf.gz.tbi

# Download 1000G VCF
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi

# Subset to a couple samples for testing
bcftools view -s NA12878,HG00138 ALL.chr18.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz -Oz -o test_genotypes.vcf.gz
tabix -p vcf test_genotypes.vcf.gz

# Run beagle
# (using a version we used in the past stored here: https://github.com/gymreklab/1000Genomes-TR-Analysis/blob/main/phasing/validation/beagle.19Apr22.7c0.jar, although we've also used newer versions of beagle)
java -jar beagle.19Apr22.7c0.jar \
	gt=test_genotypes.vcf.gz \
	ref=1kg.snp.str.chr18.vcf.gz \
	out=test_chr18 \
	impute=true ap=true
tabix -p vcf test_chr18.vcf.gz

# Download and install TRTools
git clone https://github.com/gymrek-lab/trtools
cd trtools
pip install -e .
cd ../

# Run annotaTR
annotaTR \
	--vcf test_chr18.vcf.gz \
	--out test_chr18_annotated \
	--ref-panel 1kg.snp.str.chr18.vcf.gz --vcftype hipstr \
	--outtype vcf pgen --dosages beagleap_norm \
	--match-refpanel-on trimmedalleles # should be ok to remove this option but using based on your command

# Validate the pgen output
plink2 --pfile test_chr18_annotated --validate

AnnotaTR outputs to the terminal:

Loading reference panel
Loaded 12258 TR loci from ref panel
Processed 1000 variants
Processed 2000 variants
...
Processed 12258 variants

and outputs these files:

test_chr18_annotated.psam
test_chr18_annotated.vcf
test_chr18_annotated.pvar
test_chr18_annotated.pgen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants