diff --git a/paper/paper.bib b/paper/paper.bib index 3fbe6f9..da90e37 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -1,6 +1,6 @@ -@online{ahrens21_genomic, +@online{ahrens21_genomicconstraints, title = {Genomic Constraints to Drought Adaptation}, - author = {Ahrens, Collin W. and Murray, Kevin and Mazanec, Richard A. and Ferguson, Scott and Bragg, Jason and Jones, Ashley and Tissue, David T. and Byrne, Margaret and Borevitz, Justin O. and Rymer, Paul D.}, + author = {Ahrens, Collin W. and Murray, Kevin D. and Mazanec, Richard A. and Ferguson, Scott and Bragg, Jason and Jones, Ashley and Tissue, David T. and Byrne, Margaret and Borevitz, Justin O. and Rymer, Paul D.}, date = {2021-08-08}, eprinttype = {bioRxiv}, eprintclass = {New Results}, @@ -350,9 +350,9 @@ @article{murray17_kwipkmer file = {/home/kevin/work/bibliography/pdfs/murray17_kwip__plos_computational_biology__kwip.pdf} } -@article{murray19_landscape, +@article{murray19_landscapedrivers, title = {Landscape Drivers of Genomic Diversity and Divergence in Woodland {{Eucalyptus}}}, - author = {Murray, Kevin D and Janes, Jasmine K and Jones, Ashley and Bothwell, Helen M and Andrew, Rose L and Borevitz, Justin O}, + author = {Murray, Kevin D. and Janes, Jasmine K and Jones, Ashley and Bothwell, Helen M and Andrew, Rose L and Borevitz, Justin O}, date = {2019-12}, journaltitle = {Molecular Ecology}, shortjournal = {Mol Ecol}, diff --git a/paper/paper.md b/paper/paper.md index 71030e3..a4e1cf4 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -39,7 +39,7 @@ bibliography: paper.bib Acanthophis is a comprehensive pipeline for the joint discovery and analysis of both plant genetic variation and variation in the composition and abundance of plant-associated microbiomes. -Implemented in Snakemake[@koster12_snakemakescalable], Acanthophis handles data from raw FASTQ read files through quality control, alignment of the reads to a plant reference, variant calling, taxonomic classification and quantification of microbes, and metagenome analysis. +Implemented in Snakemake [@koster12_snakemakescalable], Acanthophis handles data from raw FASTQ read files through quality control, alignment of the reads to a plant reference, variant calling, taxonomic classification and quantification of microbes, and metagenome analysis. The workflow contains numerous practical optimisations, both to reduce disk space usage and maximise utilisation of computational resources. Acanthophis is available under the Mozilla Public Licence v2 at as a python package installable from conda or PyPI (`pip install acanthophis`). @@ -49,11 +49,11 @@ Understanding plant biology benefits from ecosystem-scale analysis of genetic va Such analyses are often data intensive, particularly at the scale required for quantitative analyses, i.e. thousands of host individuals [@regalado20_combining; @karasov22_drought]. They demand computationally-efficient pipelines that perform both host genotyping and host-associated microbiome characterisation in a consistent, flexible, and reproducible fashion. -Currently, no such unified pipelines exist. Previous pipelines perform only a subset of these tasks (e.g., Snakemake's variant calling pipeline @koster21_snakemakeworkflows). In addition, most host-aware microbiome analysis pipelines do not allow for host genotyping and/or assume an animal host (e.g. Taxprofiler @yates23_nfcore). Acanthophis has attracted many users, and has been referred to in peer-reviewed journal articles and preprints (e.g., @murray19_landscape; @ahrens21_genomic). +Currently, no such unified pipelines exist. Previous pipelines perform only a subset of these tasks (e.g. Snakemake's variant calling pipeline; @koster21_snakemakeworkflows). In addition, most host-aware microbiome analysis pipelines do not allow for host genotyping and/or assume an animal host (e.g. Taxprofiler; @yates23_nfcore). Acanthophis has attracted many users, and has been referred to in peer-reviewed journal articles and preprints (e.g. @murray19_landscapedrivers; @ahrens21_genomicconstraints). # Components and Features -Acanthophis can be configured to do any of the following analyses: mapping reads to a reference, calling variants, annotating variant effects, estimating genetic distances *de novo*, and profiling and/or assembling metagenomes. +Acanthophis is a pipeline for the analysis of plant population resequencing pipeline. It expect short-read shotgun whole (meta-)genome sequencing data, typically of plants collected in the field. A typical dataset might be 10s-1000s of samples from one or multiple closely related species, sequenced with 2x150bp paired-end short read sequencing. In a plant-microbe interaction genomics study, these plants and therefore sequencing libraries can contain microbes (a "hologenome"), however datasets focusing only on host genome variation are also catered for. Acanthophis can be configured to do any of the following analyses: mapping reads to a reference, calling variants, annotating variant effects, estimating genetic distances *de novo*, and profiling and/or assembling metagenomes. While we developed Acanthophis to handle plant data, there is no reason why it cannot be applied to other taxa, however some parameters may need adjustment. Across the entire pipeline, we operate on 'sample sets', named groups of one or more samples; each sample can be in any number of sample sets. For each sample set, we can configure the analyses to run (most can be disabled if not needed). We can also configure tool-specific settings or thresholds. The pipeline is configured via a global `config.yaml` file. We provide a documented template. @@ -63,14 +63,15 @@ Input data consists FASTQ files per **run** of each **library** corresponding to ## Stage 2: Alignment to reference(s) -For read alignment to reference genomes we provide several configurable aligners, currently `BWA MEM`[@li13_aligningsequence], `NGM`[@sedlazeck13_nextgenmapfast], and `minimap2`[@li18_minimap2pairwise;@li21_newstrategies]. We then merge per-runlib BAMs to per-sample BAMs, and use `samtools markdup`[@li09_sequencealignment;@danecek21_twelveyears] to mark duplicate reads. Input reference genomes should be uncompressed, `samtools faidx`ed FASTA files. +For read alignment to reference genomes we provide several configurable aligners, currently `BWA MEM` [@li13_aligningsequence], `NGM` [@sedlazeck13_nextgenmapfast], and `minimap2` [@li18_minimap2pairwise;@li21_newstrategies]. We then merge per-runlib BAMs to per-sample BAMs, and use `samtools markdup` [@li09_sequencealignment;@danecek21_twelveyears] to mark duplicate reads. Input reference genomes should be uncompressed, `samtools faidx`ed FASTA files. ## Stage 3: Variant Calling We provide `bcftools mpileup` or `freebayes` to call raw variants, using priors and thresholds configurable for each sample set. We then normalise variants with `bcftools norm`, split multiallelic variants, filter each allele with per-sample set filters, and combine filter-passing alleles back into unique sites. Resulting variants are indexed and statistics calculated (bcftools stats). To parallelize variant calling: either a static list of non-overlapping genome windows is used (as supplied in a BED file), or mosdepth is used to break the genome into buckets with approximately equal amounts of data. ## Stage 4: Taxon profiling -We use any of Kraken 2 (with or without Bracken [@lu17_brackenestimating]), Kaiju[@menzel16_fastsensitive], Centrifuge[@kim16_centrifugerapid], and Diamond[@buchfink15_fastsensitive] to create taxonomic profiles for each sample against any number of supplied databases. We then use taxpasta[@beber23_taxpastataxonomic] to combine multiple profiles into tables for easy downstream use. + +We use any of Kraken 2 [@wood19_improved], Bracken [@lu17_brackenestimating], Kaiju [@menzel16_fastsensitive], Centrifuge [@kim16_centrifugerapid], and Diamond [@buchfink15_fastsensitive] to create taxonomic profiles for each sample against any number of supplied databases. We then use taxpasta [@beber23_taxpastataxonomic] to combine multiple profiles into tables for easy downstream use. ## Stage 5: *De novo* Estimates of Genetic Dissimilarity @@ -78,7 +79,7 @@ Acanthophis can use either `kWIP` [@murray17_kwipkmer] or Mash [@ondov16_mashfas ## Stage 5: Reporting and Statistics -Throughout all pipeline stages, various tools output summaries of their actions and/or outputs. We optionally combine these into unified reports by pipeline stage and sample set using MultiQC[@ewels16_multiqcsummarize]. +Throughout all pipeline stages, various tools output summaries of their actions and/or outputs. We optionally combine these into unified reports by pipeline stage and sample set using MultiQC [@ewels16_multiqcsummarize]. # Acknowledgements