-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add TR GWAS tutorial #242
base: master
Are you sure you want to change the base?
Conversation
They look good to me! should we mention target TR genotyping and gwas as well? People might be interested in that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know my review wasn't solicited!
but I was reading through this anyway because I was curious, and I noticed some typos, so I thought I'd quickly mention em
* **SNP-TR reference panel (VCF or BREF3 format)**. You will need a precomputed referene panel that contains phased SNP and TR genotypes from an orthogonal cohort. The panel needs to be in either VCF or BREF3 format to be compatible with Beagle. We have generated two such panels: | ||
|
||
* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome. | ||
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome. | |
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M SNPs/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>`_ and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome. |
Background | ||
---------- | ||
|
||
To run a GWAS, you will need to have genotypes and phenotype data for your phenotype of interest available for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles an individual has vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define TR dosage defined below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To run a GWAS, you will need to have genotypes and phenotype data for your phenotype of interest available for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles an individual has vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define TR dosage defined below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_. | |
To run a GWAS, you will need to have genotypes and phenotype data for a large cohort. Whereas SNP-based GWAS typically tests each variant for linear association between the number of alternate alleles of each individual vs. the phenotype value, the TR-based GWAS below tests for linear association between "TR dosages" and the phenotype at each TR. There are multiple ways to define a TR dosage below, but all are related to the mean number of copies of the repeat each person has across their two alleles at a particular TR. This overall framework is similar to that described in `Margoliash et al. 2023 <https://pubmed.ncbi.nlm.nih.gov/38116119/>`_. |
rewording slightly to improve readability
feel free to ignore!
The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping: | ||
|
||
* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs | ||
* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could even mention that Margoliash et al have compared imputed STRs for association testing to WGS and found the results to be concordant
so they can go look at those results if they want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave that out - we showed that results are reasonably concordant for TRs which passed strong fine-mapping requirements. But that cannot say how concordant association results will be for the genome at large.
|
||
The :code:`split` command above will result in files :code:`batch1_chr21.vcf.gz` and :code:`batch2_chr21.vcf.gz`. | ||
|
||
An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note in that workflow given the cohort size, we further broke up the split step by genomic region and then concatenate the results for all the regions from each chromosome in a final step). | |
A full workflow for generating these VCF subsets in All of Us is available from our `CAST subset VCF workflow <https://github.com/CAST-genomics/cast-workflows/tree/main/subset_vcf>`_ page. (Note: Given the cohort size, we had to further break up the split step by genomic region, and then concatenate the results for all the regions from each chromosome in a final step). |
_________________________________________ | ||
|
||
|
||
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches:: | |
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) bcftools is much faster, and (2) the set of REF/ALT alleles from Beagle should be identical across each batch and match the reference panel. We prefer to use mergeSTR when alleles can be represented differently across files. To merge batches:: |
rewording to make it a bit easier to read. Otherwise, the sentence felt a bit long
|
||
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches:: | ||
|
||
bcftools merge batch1_chr${chrom}_imputed_TRs.vcf.gz batch2_chr${chrom}_imputed_TRs.vcf.gz ... -Oz -o merged_${chrom}_TRs.vcf.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bcftools merge batch1_chr${chrom}_imputed_TRs.vcf.gz batch2_chr${chrom}_imputed_TRs.vcf.gz ... -Oz -o merged_${chrom}_TRs.vcf.gz | |
bcftools merge -Oz -o merged_${chrom}_TRs.vcf.gz \ | |
batch1_chr${chrom}_imputed_TRs.vcf.gz \ | |
batch2_chr${chrom}_imputed_TRs.vcf.gz ... |
merged_${chrom}_TRs_annotated.pgen | ||
merged_${chrom}_TRs_annotated.pvar | ||
|
||
Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field. | |
Note: if your TR genotypes are obtained directly from WGS and without imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel`. You should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field. |
Step 3: Running GWAS with TR genotypes | ||
-------------------------------------- | ||
|
||
Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associated are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associated are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files:: | |
Now that we have imputed TRs, we are finally ready to perform GWAS! There are two ways to do this. You can use the VCF file as input to `associaTR <https://trtools.readthedocs.io/en/stable/source/associaTR.html>`_ which is part of TRTools. Alternatively, you can use the PGEN files directly as input to `plink2 <https://www.cog-genomics.org/plink/2.0/>`_. For linear association testing, these tools should give identical results. The advantages of using plink2 over associaTR are that: (1) plink2 is faster, (2) you have access to a rich set of plink options, including logistic regression, filtering samples, calculating LD, etc. that are not all available in associaTR currently, and (3) you can process SNPs and TRs all in one place. Therefore, our example below performs GWAS using plink2 on the PGEN files:: |
|
||
The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping: | ||
|
||
* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs | |
* If you do not have raw sequencing data (e.g. CRAM files) and only have genotype data (e.g. SNP and indel calls) in your GWAS cohort and therefore cannot directly genotype TRs |
The steps below assume you will be imputing, rather than directly genotyping, TRs for your cohort. There are multiple scenarios when imputing TRs might be preferred over direct genotyping: | ||
|
||
* If you only have genotype data (typically SNPs and indels) available for your GWAS cohort and therefore cannot directly genotype TRs | ||
* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the pipeline below is based on what our group has applied to the `All of Us dataset <https://workbench.researchallofus.org/>`_ which at the time of writing has WGS for around 246,000 individuals. Running HipSTR on a single sample takes around 16 hours/sample, and would cost tens to hundreds of thousands of dollars at this scale. Instead, we can rely on imputation, which can give accurate genotypes at the majority of TRs and is much faster. | |
* Even when WGS data is available, it can be expensive and time-consuming to perform genome-wide TR genotyping. For example, the `All of Us dataset <https://workbench.researchallofus.org/>`_ has WGS for around 246,000 individuals at the time of writing. Running HipSTR on a single sample takes around 16 hours/sample, and we have projected that running HipSTR on the whole cohort would cost tens to hundreds of thousands of dollars. Instead, we developed this GWAS pipeline using imputation, which can give accurate genotypes at the majority of TRs and is much faster and cheaper. |
* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome. | ||
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to use the Saini panel anymore? If not, I'd wouldn't mention it. If we do want to mention it, I'd at least swap the order. (I also made a few wording changes in the paragraph, so don't throw those out even if you don't need the swap).
* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp. This panel is available in VCF format with one file/chromosome. | |
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iii-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>_` and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel but should also be mostly relevant to imputation with the Saini reference panel. This panel is available in VCF and BREF3 formats with one file/chromosome. | |
* `Ziaei-Jam et al. <https://github.com/gymrek-lab/EnsembleTR/blob/fix-ref/README.md#version-iv-of-reference-snptr-haplotype-panel-for-imputation-of-tr-variants>`_ This panel is in the GRCh38 reference build and contains 1,070,698 TRs and 70M Snps/indels from 3,202 samples from the 1000 Genomes Project. TR genotypes are based on `EnsembleTR <https://github.com/gymrek-lab/ensembleTR>`_ and contain both STRs (repeat unit 1-6bp) and VNTRs (repeat unit 7+bp). The steps below were specifically tested with this panel. This panel is available in VCF and BREF3 formats with one file/chromosome. | |
* `Saini et al. <https://gymreklab.com/2018/03/05/snpstr_imputation.html>`_ This panel is in the GRCh37 reference build and contains 445,725 STRs and 27M SNPs/indels for 2,504 samples from the 1000 Genomes Project. Genotypes had been imputed into 1000 Genomes samples based on calls in the Simons Simplex Collection, which is predominantly European. It is also restricted to STRs called by HipSTR with repeat unit lengths <= 6bp and total repeat length less than 100bp. This panel is available in VCF format with one file/chromosome. |
_________________________________________ | ||
|
||
|
||
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a suggested change on top of Arya's:
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) after imputation with Beagle the set of REF/ALT alleles should be identical across each batch and match the reference panel, so there is not a need to use mergeSTR which specifically deals with managing differences in alleles represented across files and (2) bcftools is much faster. To merge batches:: | |
Before downstream steps we would like to merge the VCFs across all batches to have a single VCF file per chromosome. While TRTools does include the :code:`mergeSTR` tool which can be used for this, we recommend using :code:`bcftools merge` for large cohorts when using imputed TRs. This is because: (1) bcftools is much faster, and (2) the set of REF/ALT alleles from Beagle should be identical across each batch and match the reference panel. (mergeSTR can handle when alleles have been represented differently across files while bcftools cannot, but that is not the case here). To merge batches:: |
|
||
Note: if your TR genotypes are obtained directly from WGS and wihtout imputation, you can leave out the options :code:`--update-ref-alt` and :code:`--ref-panel` and should use :code:`--dosages bestguess_norm` since you will not have the Beagle AP field. | ||
|
||
See the `annotaTR documentation page <https://github.com/gymrek-lab/TRTools/tree/master/trtools/annotaTR>`_ for full details for each option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the `annotaTR documentation page <https://github.com/gymrek-lab/TRTools/tree/master/trtools/annotaTR>`_ for full details for each option. | |
See the `annotaTR documentation page <https://trtools.readthedocs.io/en/stable/source/annotaTR.html>`_ for full details for each option. |
Tangential to the topic of this PR, and I don't expect it to be addressed here, but associaTR had hidden code to not only compute P values, but also mean phenotype values per average TR length (or even paired TR length genotype), which could then be used to create TR vs phenotype plots for individual TRs. Plink cannot do that. It would be good to split that functionality out on its own. If we did so, we could then documenting that plotting workflow here as well. |
The plugin split function may not available for some version of bcftools. I add some notes on how to split samples without the plugin split function.
I was mean to make a suggestion but accidentally committed the changes. I removed the changes and make it in suggestions format. Sorry for the confusion.
bcftools plugin split full_vcf_chr1.vcf.gz -G sample_groups.txt -Oz -o . | ||
|
||
for f in *.vcf.gz; do tabix -p vcf $f; done | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, the :code:`bcftools plugin split` function may not be available for some version of bcftools. If that is the case, can use following example command to split the files to batch:: | |
# split samples by ${batch_size} | |
bcftools query -l full_vcf_chr${chrom}.vcf.gz | awk -v chrom=chr${chrom} -v group_size=${batch_size} 'BEGIN {FS=OFS="\t"} {print $1,"batch"int(NR / group_size)+1"_"chrom}' > ${output_dir}/chr${chrom}/batch_files/ | |
mkdir -p ${output_dir}/chr${chrom}/batch_files/ | |
# creat batched sample_list files | |
awk -v outdir="${output_dir}/chr${chrom}/batch_files/" 'BEGIN {FS=OFS="\t"} {print $1 > outdir$2".txt"}' ${output_dir}/chr${chrom}/chr${chrom}_sample_list.txt | |
mkdir -p ${output_dir}/chr${chrom}/split_by_samples | |
# split vcf by sample batches | |
for batch in ${output_dir}/chr${chrom}/batch_files/*.txt; do | |
batch_name=$(basename ${batch} | cut -d "." -f 1) | |
bcftools view -S ${batch} ${output_dir}/chr${chrom}/chr${chrom}_with_af_filtered_hg38.vcf.gz -Oz -o ${output_dir}/chr${chrom}/split_by_samples/${batch_name}.vcf.gz | |
done | |
Adding a note to add when I go through these changes: mention bcftools older merge versions potentially swapping allele order: #244 |
Another note to add on revision: https://github.com/gymrek-lab/TRTools/blob/tr-gwas-tutorial/doc/VIGNETTE-GWAS-TUTORIAL.rst?plain=1#L150 this is misleading, since annotaTR will only output the TRs |
This PR introduces a new vignette showing how to run a TR GWAS, including TR imputation, computing dosages, and running GWAS with plink2. Not all of these steps use TRTools but I think TRTools docs is a good place to put this tutorial.
For the review: this needs to be proofread for typos, correctness, and clarity!
Checklist
fix:
. Otherwise, if it introduces a new feature, please prefix it withfeat:
. If it introduces a breaking change, please add an exclamation before the colon, likefeat!:
. If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.poetry lock --no-update
to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions