Skip to content

Latest commit

 

History

History
84 lines (54 loc) · 6.2 KB

README.md

File metadata and controls

84 lines (54 loc) · 6.2 KB

docker build docker pulls gh last commit

nextflow-kraken2

A relatively simple metagenomics analysis pipeline written in nextflow [1]. The pipeline is based on kraken2/bracken and kaiju, and is supplemented with Krona visualizations and interactive html tables. It is written with the idea to get taxonomic and abundance information for many samples, and not to compare different taxonomy assignment tools (but can be used for this as well).

Description

The pipeline runs in a docker container by default. Both Illumina and Nanopore data can be processed (separately). For a set of fastq files it executes:

  • fastp - filter and trim reads with default parameters
  • kraken2 [2] - taxonomic assignment of the reads
  • bracken [3] - abundance estimation at a single level in the taxonomic tree, e.g. species, using the kraken2 output
  • kaiju [4] - taxonomic classification of the reads based on maximum exact matches on protein level
  • krona [5] - plots are generated from the output of kraken2
  • DataTables - generates an interactive HTML table with the results from bracken for each sample, as well as a summary table for all the samples
  • MultiQC [6] - aggregates the results into a single html report

The pipeline runs kraken2/bracken or kaiju depending on the parameters supplied: use --kraken_db to run kraken2/bracken or --kaiju_db to run kaiju (or both parameters to run both).

The --kraken_db parameter is a path to a previously downloaded kraken2 database. A collection of ready-to-use kraken2/bracken RefSeq indexes can be downloaded from here.

The --kaiju_db can be one of refseq, progenomes, viruses, plasmids, fungi, nr, nr_euk, mar or rvdb. See the links above for available databases for each tool.

If none of these parameters is used, the pipeline will just run fastp.

Installation and running the pipeline

Nothing to install, as soon as you have docker and nextflow. Choose a kraken2 and/or a kaiju database (see below), and run the pipeline:

# run with a test dataset (included)
nextflow run angelovangel/nextflow-kraken2 -profile test

# see options and how to run
nextflow run angelovangel/nextflow-kraken2 --help

Output

All output files are in the folder results-kraken2, which is found in the folder with reads data used for running the pipeline. An example of the outputs, generated with a small Illumina dataset can be downloaded here.

The outputs are:

  • timmed_fastq/ - directory with fastq files after trimming, these are also used for taxonomic profiling
  • bracken_summary_heatmap/table.html- standalone html files with summary information from bracken. Note that these files will be generated only if there are less than 34 samples
  • bracken_summary_long/wide.csv- summary bracken information (all found taxa in all samples), in different formats
  • kraken2taxonomy_krona.html- an interactive Krona plot of the kraken2 output for all samples
  • samples/ - directory with individual (per sample) kraken2 and bracken-corrected report files and with the abundance table from bracken (as html and tsv). Tip: the report files can be directly imported in Pavian for nice interactive visualizations.

Choosing a kraken2 and/or kaiju database

--kraken_db

An absolute path to a folder containing a kraken2 database. See the kraken2 homepage or Ben Langmead's collection for a list of avalable pre-built databases. These databases have the required Bracken files included (for read lengths 50, 100, 150, 200 and 250). Take care to use the correct --readlen parameter according to your reads data.

Note: although still controversial, recent work has shown that kraken2 may be performing better than QIIME in the analysis of 16S amplicons.

--kaiju_db

This argument can be one of refseq, progenomes, viruses, plasmids, fungi, nr, nr_euk, mar or rvdb. When this parameter is used, a source database and the taxonomy files are downloaded from the NCBI FTP server, converted into a protein database and indexed (kaiju-makedb). Check the memory and space requirements here before using.

References

This pipeline just uses some really nice work from others:

[1] P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) https://doi.org/10.1038/nbt.3820

[2] Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019) https://doi.org/10.1186/s13059-019-1891-0

[3] Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104 https://doi.org/10.7717/peerj-cs.104

[4] Menzel, P., Ng, K. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 7, 11257 (2016). https://doi.org/10.1038/ncomms11257

[5] Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12:385. Published 2011 Sep 30. https://doi.org/10.1186/1471-2105-12-385

[6] Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (2016). https://doi.org/10.1093/bioinformatics/btaa559