docs: Add TR GWAS tutorial #242

gymreklab · 2024-11-06T01:28:31Z

This PR introduces a new vignette showing how to run a TR GWAS, including TR imputation, computing dosages, and running GWAS with plink2. Not all of these steps use TRTools but I think TRTools docs is a good place to put this tutorial.

For the review: this needs to be proofread for typos, correctness, and clarity!

Checklist

[x ] I've checked to ensure there aren't already other open pull requests for the same update/change
[ x] I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
[x ] At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0

[x ] I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them

[x ] I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
[x ] All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
[x ] I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
[ x] All functions, modules, classes etc. still conform to numpy docstring standards
[ x] (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
[x ] In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

nicholema · 2024-11-06T16:57:46Z

They look good to me! should we mention target TR genotyping and gwas as well? People might be interested in that

aryarm

I know my review wasn't solicited!
but I was reading through this anyway because I was curious, and I noticed some typos, so I thought I'd quickly mention em

aryarm · 2024-11-06T17:32:28Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+* **SNP-TR reference panel (VCF or BREF3 format)**. You will need a precomputed referene panel that contains phased SNP and TR genotypes from an orthogonal cohort. The panel needs to be in either VCF or BREF3 format to be compatible with Beagle. We have generated two such panels:
+
+  * `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome.
+  * `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.


Suggested change

* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.

* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M SNPs/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>`_ and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.

aryarm · 2024-11-06T17:35:45Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+Background
+----------
+
+To run a GWAS, you will need to have genotypes and phenotype data for your phenotype of interest available for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles an individual has vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define TR dosage defined below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_.


Suggested change

To run a GWAS, you will need to have genotypes and phenotype data for your phenotype of interest available for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles an individual has vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define TR dosage defined below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_.

To run a GWAS, you will need to have genotypes and phenotype data for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles of each individual vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define a TR dosage below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_.

rewording slightly to improve readability
feel free to ignore!

aryarm · 2024-11-06T17:46:42Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping:
+
+* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs
+* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster.


Could even mention that Margoliash et al have compared imputed STRs for association testing to WGS and found the results to be concordant

so they can go look at those results if they want

I would leave that out - we showed that results are reasonably concordant for TRs which passed strong fine-mapping requirements. But that cannot say how concordant association results will be for the genome at large.

aryarm · 2024-11-06T17:49:04Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+
+The :code:`split` command above will result in files :code:`batch1_chr21.vcf.gz` and :code:`batch2_chr21.vcf.gz`.
+
+An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step).


Suggested change

An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step).

A full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note: Given the cohort size, we had to further break up the split step by genomic region, and then concatenate the results for all the regions from each chromosome in a final step).

aryarm · 2024-11-06T18:08:34Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+_________________________________________
+
+
+Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::


Suggested change

Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::

Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) bcftools is much faster, and (2) the set of REF/ALT alleles from Beagle should be identical across each batch and match the reference panel. We prefer to use mergeSTR when alleles can be represented differently across files. To merge batches::

rewording to make it a bit easier to read. Otherwise, the sentence felt a bit long

aryarm · 2024-11-06T18:10:36Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+
+Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::
+
+	bcftools merge batch1_chr${chrom}_imputed_TRs.vcf.gz batch2_chr${chrom}_imputed_TRs.vcf.gz ... -Oz -o merged_${chrom}_TRs.vcf.gz


Suggested change

bcftools merge batch1_chr${chrom}_imputed_TRs.vcf.gz batch2_chr${chrom}_imputed_TRs.vcf.gz ... -Oz -o merged_${chrom}_TRs.vcf.gz

bcftools merge -Oz -o merged_${chrom}_TRs.vcf.gz \

batch1_chr${chrom}_imputed_TRs.vcf.gz \

batch2_chr${chrom}_imputed_TRs.vcf.gz ...

aryarm · 2024-11-06T18:12:45Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+	merged_${chrom}_TRs_annotated.pgen
+	merged_${chrom}_TRs_annotated.pvar
+
+Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.


Suggested change

Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.

Note: if your TR genotypes are obtained directly from WGS and without imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel`. You should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.

aryarm · 2024-11-06T18:13:38Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+Step 3: Running GWAS with TR genotypes
+--------------------------------------
+
+Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associated are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files::


Suggested change

Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associated are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files::

Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associaTR are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files::

LiterallyUniqueLogin · 2024-11-25T19:57:42Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+
+The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping:
+
+* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs


Suggested change

* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs

* If you do not have raw sequencing data (e.g. CRAM files) and only have genotype data (e.g. SNP and indel calls) in your GWAS cohort and therefore cannot directly genotype TRs

LiterallyUniqueLogin · 2024-11-25T20:00:50Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping:
+
+* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs
+* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster.


Suggested change

* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster.

* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the `All of Us dataset <https://workbench.researchallofus.org/>`_ has WGS for around 246,000 individuals at the time of writing. Running HipSTR on a single sample takes around 16 hours/sample, and we have projected that running HipSTR on the whole cohort would cost tens to hundreds of thousands of dollars. Instead, we developed this GWAS pipeline using imputation, which can give accurate genotypes at the majority of TRs and is much faster and cheaper.

LiterallyUniqueLogin · 2024-11-25T20:13:52Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+  * `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome.
+  * `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.


Is there any reason to use the Saini panel anymore? If not, I'd wouldn't mention it. If we do want to mention it, I'd at least swap the order. (I also made a few wording changes in the paragraph, so don't throw those out even if you don't need the swap).

Suggested change

* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome.

* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.

* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iv-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>`_ and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel. This panel is available in VCF and BREF3 formats with one file/chromosome.

* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp and total repeat length less than 100bp. This panel is available in VCF format with one file/chromosome.

LiterallyUniqueLogin · 2024-11-25T20:21:58Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+_________________________________________
+
+
+Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::


Here is a suggested change on top of Arya's:

Suggested change

Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::

Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) bcftools is much faster, and (2) the set of REF/ALT alleles from Beagle should be identical across each batch and match the reference panel. (mergeSTR can handle when alleles have been represented differently across files while bcftools cannot, but that is not the case here). To merge batches::

LiterallyUniqueLogin · 2024-11-25T20:25:04Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+
+Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.
+
+See the `annotaTR documentation page <https://github.com/gymrek-lab/TRTools/tree/master/trtools/annotaTR>`_ for full details for each option.


Suggested change

See the `annotaTR documentation page <https://github.com/gymrek-lab/TRTools/tree/master/trtools/annotaTR>`_ for full details for each option.

See the `annotaTR documentation page <https://trtools.readthedocs.io/en/stable/source/annotaTR.html>`_ for full details for each option.

LiterallyUniqueLogin · 2024-11-25T20:33:08Z

Tangential to the topic of this PR, and I don't expect it to be addressed here, but associaTR had hidden code to not only compute P values, but also mean phenotype values per average TR length (or even paired TR length genotype), which could then be used to create TR vs phenotype plots for individual TRs. Plink cannot do that. It would be good to split that functionality out on its own. If we did so, we could then documenting that plotting workflow here as well.

The plugin split function may not available for some version of bcftools. I add some notes on how to split samples without the plugin split function.

I was mean to make a suggestion but accidentally committed the changes. I removed the changes and make it in suggestions format. Sorry for the confusion.

yli091230 · 2024-12-02T22:53:35Z

doc/VIGNETTE-GWAS-TUTORIAL.rst

+	bcftools plugin split full_vcf_chr1.vcf.gz -G sample_groups.txt -Oz -o .
+
+    	for f in *.vcf.gz; do tabix -p vcf $f; done
+


Suggested change

Note, the :code:`bcftools plugin split` function may not be available for some version of bcftools. If that is the case, can use following example command to split the files to batch::

# split samples by ${batch_size}

bcftools query -l full_vcf_chr${chrom}.vcf.gz | awk -v chrom=chr${chrom} -v group_size=${batch_size} 'BEGIN {FS=OFS="\t"} {print $1,"batch"int(NR / group_size)+1"_"chrom}' > ${output_dir}/chr${chrom}/batch_files/

mkdir -p ${output_dir}/chr${chrom}/batch_files/

# creat batched sample_list files

awk -v outdir="${output_dir}/chr${chrom}/batch_files/" 'BEGIN {FS=OFS="\t"} {print $1 > outdir$2".txt"}' ${output_dir}/chr${chrom}/chr${chrom}_sample_list.txt

mkdir -p ${output_dir}/chr${chrom}/split_by_samples

# split vcf by sample batches

for batch in ${output_dir}/chr${chrom}/batch_files/*.txt; do

batch_name=$(basename ${batch} | cut -d "." -f 1)

bcftools view -S ${batch} ${output_dir}/chr${chrom}/chr${chrom}_with_af_filtered_hg38.vcf.gz -Oz -o ${output_dir}/chr${chrom}/split_by_samples/${batch_name}.vcf.gz

done

gymreklab · 2024-12-11T17:38:46Z

Adding a note to add when I go through these changes: mention bcftools older merge versions potentially swapping allele order: #244

gymreklab · 2025-01-01T00:18:07Z

Another note to add on revision: https://github.com/gymrek-lab/TRTools/blob/tr-gwas-tutorial/doc/VIGNETTE-GWAS-TUTORIAL.rst?plain=1#L150 this is misleading, since annotaTR will only output the TRs

Melissa Gymrek added 5 commits November 3, 2024 17:51

add outline for tr gwas tutorial

1789149

imputation step 1.1

669d519

imputation steps 1.2 and 1.3

1237869

adding steps 2 and 3 of gwas tutorial

a81d9e5

add instructions for WGS genotypes

1ec2cc0

gymreklab marked this pull request as ready for review November 6, 2024 01:32

gymreklab assigned LiterallyUniqueLogin and nicholema Nov 6, 2024

aryarm reviewed Nov 6, 2024

View reviewed changes

LiterallyUniqueLogin reviewed Nov 25, 2024

View reviewed changes

yli091230 added 2 commits December 2, 2024 14:35

Update VIGNETTE-GWAS-TUTORIAL.rst

65f4311

The plugin split function may not available for some version of bcftools. I add some notes on how to split samples without the plugin split function.

Update VIGNETTE-GWAS-TUTORIAL.rst

5e735d2

I was mean to make a suggestion but accidentally committed the changes. I removed the changes and make it in suggestions format. Sorry for the confusion.

yli091230 reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add TR GWAS tutorial #242

docs: Add TR GWAS tutorial #242

gymreklab commented Nov 6, 2024

nicholema commented Nov 6, 2024

aryarm left a comment •

edited

Loading

aryarm Nov 6, 2024

aryarm Nov 6, 2024 •

edited

Loading

aryarm Nov 6, 2024 •

edited

Loading

LiterallyUniqueLogin Nov 25, 2024

aryarm Nov 6, 2024

aryarm Nov 6, 2024 •

edited

Loading

aryarm Nov 6, 2024

aryarm Nov 6, 2024

aryarm Nov 6, 2024 •

edited

Loading

LiterallyUniqueLogin Nov 25, 2024

LiterallyUniqueLogin Nov 25, 2024

LiterallyUniqueLogin Nov 25, 2024

LiterallyUniqueLogin Nov 25, 2024

LiterallyUniqueLogin Nov 25, 2024

LiterallyUniqueLogin commented Nov 25, 2024

yli091230 Dec 2, 2024

gymreklab commented Dec 11, 2024

gymreklab commented Jan 1, 2025


		The :code:`split` command above will result in files :code:`batch1_chr21.vcf.gz` and :code:`batch2_chr21.vcf.gz`.

		An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step).

	An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step).
	A full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note: Given the cohort size, we had to further break up the split step by genomic region, and then concatenate the results for all the regions from each chromosome in a final step).

		_________________________________________


		Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::


		Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches::

		bcftools merge batch1_chr${chrom}_imputed_TRs.vcf.gz batch2_chr${chrom}_imputed_TRs.vcf.gz ... -Oz -o merged_${chrom}_TRs.vcf.gz

	Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.
	Note: if your TR genotypes are obtained directly from WGS and without imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel`. You should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.


		The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping:

		* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs

	* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs
	* If you do not have raw sequencing data (e.g. CRAM files) and only have genotype data (e.g. SNP and indel calls) in your GWAS cohort and therefore cannot directly genotype TRs

		* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome.
		* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome.


		Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field.

		See the `annotaTR documentation page <https://github.com/gymrek-lab/TRTools/tree/master/trtools/annotaTR>`_ for full details for each option.

		bcftools plugin split full_vcf_chr1.vcf.gz -G sample_groups.txt -Oz -o .

		for f in *.vcf.gz; do tabix -p vcf $f; done

+Note, the :code:`bcftools plugin split` function may not be available for some version of bcftools. If that is the case, can use following example command to split the files to batch::
+	# split samples by ${batch_size}
+	bcftools query -l full_vcf_chr${chrom}.vcf.gz |  awk -v chrom=chr${chrom} -v group_size=${batch_size} 'BEGIN {FS=OFS="\t"} {print $1,"batch"int(NR / group_size)+1"_"chrom}' > ${output_dir}/chr${chrom}/batch_files/
+	mkdir -p ${output_dir}/chr${chrom}/batch_files/
+	# creat batched sample_list files
+  	awk -v outdir="${output_dir}/chr${chrom}/batch_files/" 'BEGIN {FS=OFS="\t"} {print $1 > outdir$2".txt"}' ${output_dir}/chr${chrom}/chr${chrom}_sample_list.txt
+ 	mkdir -p ${output_dir}/chr${chrom}/split_by_samples
+	# split vcf by sample batches
+  	for batch in ${output_dir}/chr${chrom}/batch_files/*.txt; do
+    		batch_name=$(basename ${batch} | cut -d "." -f 1)
+    		bcftools view -S ${batch} ${output_dir}/chr${chrom}/chr${chrom}_with_af_filtered_hg38.vcf.gz -Oz -o ${output_dir}/chr${chrom}/split_by_samples/${batch_name}.vcf.gz
+  	done

docs: Add TR GWAS tutorial #242

Are you sure you want to change the base?

docs: Add TR GWAS tutorial #242

Conversation

gymreklab commented Nov 6, 2024

Checklist

nicholema commented Nov 6, 2024

aryarm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryarm Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

aryarm Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryarm Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aryarm Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiterallyUniqueLogin commented Nov 25, 2024

Choose a reason for hiding this comment

gymreklab commented Dec 11, 2024

gymreklab commented Jan 1, 2025

aryarm left a comment •

edited

Loading

aryarm Nov 6, 2024 •

edited

Loading

aryarm Nov 6, 2024 •

edited

Loading

aryarm Nov 6, 2024 •

edited

Loading

aryarm Nov 6, 2024 •

edited

Loading