███ ███ ██ ██ ██████ ██████ ██████ ███████ ██████
████ ████ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ████ ██ ██ ██ ██ ██ ██ ██ ██ ███ █████ ██████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ██ ██████ ██████ ██████ ██████ ███████ ██ ██
Multi-Domain Genome Recovery
Version 1.0.0
MuDoGeR version 1.0 was designed to be an easy-to-use genome recovery tool. Therefore, we created a setup procedure that requires little user input. Consequently, one important aspect of MuDoGeR usage is the output folder architecture. The file names and folder architecture are important for the pipeline to work smoothly. If you want to prepare your data using other tools, and later use MuDoGeR, please keep the folder structure and file naming according to the MuDoGeR requirements. Please, check the outputs and folders created during the successful execution of each module in the following sections.
Please note that you can click the tool's name on the text to be redirected to the tool's own documentation. You can find MuDoGeR default parameters here
Currently, MuDoGeR v1.0 only works with paired-end sequences. Future updates will add tools to work with long-read sequencing samples. The pipeline requires, as input, a metadata table in tsv (tab-separated values) format containing the samples to be processed and the path in your computer to its raw sequence reads. The metadata file should have the sample name and the path to the forward reads file from the sample in one line, followed by the same sample name and the path to the reverse reads from the sample. An example metadata file is as follows:
#Show the content of the metadata.tsv file
$ cat metadata.tsv
EA_ERX4593008 /path/to/EA_ERX4593008/raw_reads_1.fastq
EA_ERX4593008 /path/to/EA_ERX4593008/raw_reads_2.fastq
EA_ERX4593009 /path/to/EA_ERX4593009/raw_reads_1.fastq
EA_ERX4593009 /path/to/EA_ERX4593009/raw_reads_2.fastq
EA_ERX4593010 /path/to/EA_ERX4593010/raw_reads_1.fastq
EA_ERX4593010 /path/to/EA_ERX4593010/raw_reads_2.fastq
EA_ERX4593011 /path/to/EA_ERX4593011/raw_reads_1.fastq
EA_ERX4593011 /path/to/EA_ERX4593011/raw_reads_2.fastq
Please make sure your forward and reverse raw sequencing files end in "_1.fastq" and in "_2.fastq", respectively.
Please pay attention to your samples' names. Avoid using spaces and special characters in your samples' names as good practices. If possible, use only letters and numbers. Special and space characters could cause malfunctioning in some of the used tools.
Following, you have a usage tutorial for each module. Each Module's final consideration section has the description of the expected folder and file names output.
Keep in mind that you only need one single command to run each module
MuDoGeR was designed to assemble each sample separately. However, depending on your goals, you may want to co-assemble selected samples. For instance, perhaps you have replicate samples that you would like to assemble together.
To perform co-assembly, simply concatenate the reads from the samples to be assembled together before feeding them to MuDoGeR:
#Concanate all forward reads together
cat RAW_READS/*_1.fastq > RAW_READS/CONCAT_READS_1.fastq
#Concanate all reverse reads together
cat RAW_READS/*_2.fastq > RAW_READS/CONCAT_READS_2.fastq
For running all module 1, use
$ mudoger --module preprocess --meta /path/to/metadata.tsv -o /path/to/output/folder -t 20 --metaspades -m 100
The available parameters for module 1 are:
- --meta metadata table as described here
- -o output directory (mandatory)
- -m Given RAM to assembly (optional)
- -t number of threads/cores (default = 1)
- Assembler. Can be --megahit or --metaspades (default)
The -m parameter is optional, once it is calculated during step 1.b. If you specify the amount of RAM, the result from 1.b will be ignored. Be aware that RAM is a relevant bottleneck for MAG recovery.
For quality control, MuDoGeR uses the implementation present in metaWRAP. Trim Galore is used for adapter and quality trimming with default PHRED quality and stringency. Future updates will allow the option to remove host contamination using different databases.
The output directory of the read quality control module (sample_name/qc
) contains:
final_pure_reads_1.fastq pre-QC_report
final_pure_reads_2.fastq post-QC_report
The final_pure_reads_
files contain the sequences of the trimmed and decontaminated reads. The pre-QC_report
and post-QC_report
folders include the html reports for the reads before and after the read quality control.
After quality control, MuDoGeR estimates the required RAM for assembly.
Initially, it counts the 33 and 55 unique k-mers from the quality-controlled reads (final_pure_reads_1.fastq) in the pre-processed reads (forward or reversed).
The k-mer count is used by a linear regression model to predict the amount of memory necessary for assembling the reads with metaSPAdes.
Inside the calculation of resources output folder (sample_name/khmer
) the user can find the following files:
final_prediction.tsv input_for_predictR.tsv kmer-33 kmer-55 metaspades_prediction.tsv
The final_prediction.tsv
file contains the final RAM estimation in MB for assembling the sample with metaspades
There are two possible tools for assembling data: MegaHiT and metaSPAdes. Both tools are considered reliable. MegaHiT uses lower memory and is faster compared to metaSPAdes, but metaSPAdes tends to produced longer contigs. In case of very large data sets, the user may use MegaHiT.
The assembly output folder (sample_name/assembly
) should contain the following:
assembly_report.html final_assembly.fasta megahit metaspades QUAST_out
The final_assembly.fasta
is the assembled sequences that are going to be used in the other modules.
Please, be aware of the folder structure expected as a result of module 1. After a successful run of module 1 you should have the following folder structure:
output/path/
└── sample_name
├── assembly
│ ├── assembly_report.html
│ ├── final_assembly.fasta
│ ├── megahit
│ ├── metaspades
│ └── QUAST_out
├── khmer
│ ├── final_prediction.tsv
│ ├── input_for_predictR.tsv
│ ├── kmer-33
│ ├── kmer-55
│ └── metaspades_prediction.tsv
└── qc
├── final_pure_reads_1.fastq
├── final_pure_reads_2.fastq
├── post-QC_report
└── pre-QC_report
If you want to use the other MuDoGeR modules, you can copy the showed folder structure with your own data and use the same -o output/path
when running the other modules.
The relevant files generated in Module 1 used for the other modules are assembly/final_assembly.fasta
, qc/final_pure_reads_1.fastq
, and qc/final_pure_reads_2.fastq
. Please, make sure your resulted files are specificaly named as final_assembly.fasta
and final_pure_reads_1/2.fastq
.
This module integrates the prokariotic binning procedure implemented in metaWRAP using Metabat2, Maxbin2, and CONCOCT binners, a MuDoGeR bin dereplacation method, the GTDB-tk taxonomic annotation, CheckM quality estimation, Prokka metagenomic gene annotation, MuDoGeR sequence metrics and BBtools sequence metrics calculation tools.
The tools that require specific databases in this module are GTDB-tk and CheckM. Both should be ready to use after running the database-setup.sh script. See instructions here
For running all steps in module 2 use
$ mudoger --module prokaryotes --meta /path/to/metadata.tsv -o /path/to/output/folder -t 20
The available parameters for module 2 are:
- --meta metadata table as described here
- -o output directory (mandatory)
- -t number of threads/cores (default = 1)
Additional modularity for this module is scheduled to happen.
Please refer to Module 2 final consideration to understand the folder structure from Module 2.
The binnig process starts by using Metabat2, Maxbin2, and CONCOCT to bin the sequences from the final_assembly.fasta
file.
Following, the results from all binners are used to refine bacterial bins. For bacterial bins, the refinement process uses 50% minimum completeness and 10% maximum contamination as default. For archeal bins, the refinement process uses 40% minimum completeness and 30% maximum contamination as default. The refinement process used is implemented in metaWRAP. Finally, MuDoGeR removes redundant bins.
The most relevant files from this step are the prokaryotic bins inside the unique_bins/
folder.
After the binning step is completed, the resulted bins are taxonomic annotated using the GTDB-tk software and its most updated database. Following, the unique_bins/
are checked for quality using the CheckM tool. Finally, the recovered prokaryotic bins are annotated using Prokka.
Inside the metrics/
folder, you should have one folder for each tool. Inside prokka/
you will find one folder for each bin containing the outputs from the tool.
The resulted files will be used by MuDoGeR to generate a more comprehensive report of the bins, as well as further processing. If you would like to know more about the outputs of each tool, please check their respective documentation. You can find links to the tools here or by clicking on the tool's name.
Finally, MuDoGeR calculates some relevant metrics from the recovered bins, such as genome_size, number_of_scaffolds, largest_scaffold_size, N50, and N90. In addition, it also counts the number of annotated and unknown genes by prokka. BBtools is also used to extract sequence metrics. Later, MuDoGeR merges the results from the other tools and calculates the sequence quality (completeness – 5×contamination (Parks, 2018)). Bins with quality greater than or equal to 50 are considered MAGs and have their information summarised in the mags_results_summary.tsv
file.
The mags_results_summary.tsv
contains relevant annotations from the recovered MAGs. You can also check the annotated genes for each MAG by looking at the .tsv
output by Prokka for each MAG.
By the end, MuDoGeR parses the outputs from all used tools and outputs the following comprehensive results to the sample_name/prokaryotes/final_outputs/
folder. Please go to the understand outputs to better understand MuDoGeR results.
After a successful run of Module 2 you should have the following folder structure:
sample_name
└── prokaryotes
├── binning
│ ├── initial-binning
│ ├── refinement-arc
│ ├── refinement-bac
│ └── unique_bins
│ ├── bin.0.fa
│ ├── bin.10.fa
│ └── bin.11.fa
├── final_outputs
│ ├── all_bins_seq
│ ├── allbins_metrics_summary.tsv
│ ├── bins_genes_prokka_summary
│ ├── bins_metrics_summary
│ │ ├── qual_bins_checkm_summary.tsv
│ │ └── taxa_bins_gtdbtk_summary.tsv
│ ├── mags_results_summary.tsv
│ └── only_mags_seq
└── metrics
├── checkm_qc
│ ├── bins
│ ├── lineage.ms
│ ├── outputcheckm.tsv
│ └── storage
├── genome_statistics
│ ├── bbtools.tsv
│ ├── genome_metrics.tsv
│ └── prok_genomes_stats.tsv
├── GTDBtk_taxonomy
│ └── GTDB-tk_outputs
└── prokka
├── bin.0.fa/
├── bin.10.fa/
└── bin.11.fa/
Please refer to Module 3 final consideration to understand the folder structure from Module 3.
The recovery of the UViGs module integrates the viral sequence recovery tools VirSorter2, VirFinder, and VIBRANT. Later, the sequences are dereplicated using Stampede-clustergenomes. Following the putative viral contigs are analyzed and taxonomy is estimated using Vcontact2, and quality is estimated with CheckV. MuDoGeR also uses WiSH to estimate the viral-host pairs from the putative viral contigs. Finally, MuDoGeR compiles the outputs from the used tools and selects high-quality UViGs as defined by Roux, S., et al.(2019) and Nayfach, S., et al. (2021).
The tools that require specific databases in the viral module are VIBRANT, WiSH, and CheckV. All the databases should be ready to use after running the database-setup.sh script. See instructions here
For running all stpe in module 3 use
$ mudoger --module viruses --meta /path/to/metadata.tsv -o /path/to/output/folder -t 20
The available parameter for module 2 are:
- --meta metadata table as described here (mandatory)
- -o output directory (mandatory)
- -t number of threads/cores (default = 1)
In 3.a, the viral recovery tools VirSorter2, VirFinder, and VIBRANT are applied to the assembly fasta file final_assembly.fasta
created during the preprocess module. Sequences recovered with VirFinder with p-value > 0.01 and length < 1000 bp are removed. Later, the independent results of each tool are combined and dereplicated with Stampede-clustergenomes using 70% minimum coverage and 95% minimum identity.
The result files vibrant_filtered_data.txt
, virfinder_filtered_data.txt
, and virsorter2_filtered_data.txt
contain the names from the sequences identified as putative viral contigs. Those are the files used during the next steps.
The final outputs from this step are inside the dereplication/
folder. The file dereplication/uvigs_95-70.fna
contains the viral sequences dereplicated, and the file dereplication/uvigs_95-70.clstr
contains the clusters formed from all identified putative viral contigs.
Initially, MuDoGeR uses the dereplication/uvigs_95-70.fna
file in prodigal to produce the amino acid sequence from the putative viral contigs. These amino acid sequences are input for the Vcontact2 analysis and taxonomical estimation. Following, viral sequence quality is calculated using CheckV.
The main output from Vcontact2 used by MuDoGeR is the vcontact-output/genome_by_genome_overview.csv
. There you can find the estimated viral taxonomy and confidence score. For a more detailed explanation, please check the Vcontact2 documentation. Finally, you can find the outputs from CheckV inside the vcheck_quality
folder. The resulted file from CheckV is organized in vcheck_quality/quality_summary.tsv
.
The Viral-Host pair estimation process is done based on the WIsH software. For MuDoGeR to run it automatically, please keep the defined folder structure from the Prokaryotes recovery and Viral recovery. MuDoGeR will use the results from Module 2 and Module 3 to calculate the Viral-Host potential.
Inside the host_prediction/
folder you will find the Models created by WIsH for the prokaryotic MAGs. Inside the potential_host_genomes/
you will find the MAGs recovered in Module 2, and inside the uvigs/
folder, you have the fasta files for the viral contigs recovered in Module 3. Finally, you will have your main output in host_prediction/output_results/prediction.list
. There you will have the viral contigs name, the best provided MAGs match, the LogLikelihood, and the p-value for the match.
Finally, MuDoGeR retrieves all the previously created files and creates the sample_name/viruses/final_outputs/
folder. A quality filter is also applied.
The putative_vir_contigs_summary.tsv
contains the compiled information from each putative viral contig (original_contig, uvig_length, provirus, proviral_length, gene_count, viral_genes, host_genes, checkv_quality, miuvig_quality, completeness, completeness_method, contamination, kmer_freq, warnings, putative_host, and likelihood). The uvigs_high_quality_summary.tsv
is a subset from putative_vir_contigs_summary.tsv
with only defined UViGs.
By the end, MuDoGeR parses the outputs from all used tools and outputs the following comprehensive results to the sample_name/viruses/final_outputs/
folder. Please go to the understand outputs to better understand MuDoGeR results.
After a successful run of Module 3 you should have the following folder structure:
After step 3.a is finished, you should have the following folder structure:
sample_name
└── viruses
└── investigation
├── dereplication
│ ├── uvigs_95-70.clstr
│ ├── uvigs_95-70.fna
│ ├── uvigs-cover.csv
│ ├── uvigs.fa
│ ├── uvigs_mapping.txt
│ ├── uvigs-nucmer.out.coords
│ ├── uvigs-nucmer.out.delta
│ └── viral_unique_contigs
├── vibrant
│ └── VIBRANT_final_assembly
├── vibrant_filtered_data.txt
├── virfinder
│ └── virfinder_output.tsv
├── virfinder_filtered_data.txt
├── virsorter
│ ├── config.yaml
│ ├── final-viral-boundary.tsv
│ ├── final-viral-combined.fa
│ ├── final-viral-score.tsv
│ ├── iter-0
│ └── log
└── virsorter2_filtered_data.txt
After step 3.b is finished, you should have the following folder structure:
sample_name
└── viruses
├── investigation
│ └── CONTENT_FROM_THE_Uvigs_RECOVERY_SHOWN_BEFORE
├── taxonomy
│ ├── AUX-1
│ ├── AUX-2
│ ├── vcontact-output
│ │ ├── c1.clusters
│ │ ├── c1.ntw
│ │ ├── genome_by_genome_overview.csv
│ │ ├── merged_df.csv
│ │ ├── merged.dmnd
│ │ ├── merged.faa
│ │ ├── merged.self-diamond.tab
│ │ ├── merged.self-diamond.tab.abc
│ │ ├── merged.self-diamond.tab.mci
│ │ ├── merged.self-diamond.tab_mcl20.clusters
│ │ ├── merged.self-diamond.tab_mcxload.tab
│ │ ├── modules_mcl_5.0.clusters
│ │ ├── modules_mcl_5.0_modules.pandas
│ │ ├── modules_mcl_5.0_pcs.pandas
│ │ ├── modules.ntwk
│ │ ├── sig1.0_mcl2.0_clusters.csv
│ │ ├── sig1.0_mcl2.0_contigs.csv
│ │ ├── sig1.0_mcl2.0_modsig1.0_modmcl5.0_minshared3_link_mod_cluster.csv
│ │ ├── sig1.0_mcl5.0_minshared3_modules.csv
│ │ ├── vConTACT_contigs.csv
│ │ ├── vConTACT_pcs.csv
│ │ ├── vConTACT_profiles.csv
│ │ ├── vConTACT_proteins.csv
│ │ └── viral_cluster_overview.csv
│ ├── viral_genomes.faa
│ ├── viral_genomes_g2g.csv
│ └── viral_genomes.genes
└── vcheck_quality
├── complete_genomes.tsv
├── completeness.tsv
├── contamination.tsv
├── proviruses.fna
├── quality_summary.tsv
├── tmp
│ └──checkv_only_files
└── viruses.fna
After step 3.c is finished, you should have the following folder structure:
sample_name
└── viruses
├── host_prediction
│ ├── modelDir
│ │ ├── sample_name-bin.0.mm
│ │ └── sample_name-bin.1.mm
│ ├── nullmodels
│ │ ├── computeNullParameters.R
│ │ ├── llikelihood.matrix
│ │ └── nullParameters.tsv
│ ├── output_results
│ │ ├── llikelihood.matrix
│ │ └── prediction.list
│ ├── potential_host_genomes
│ │ ├── sample_name-bin.0.fa
│ │ └── sample_name-bin.1.fa
│ └── uvigs
│ ├── sample_name_uvig-0.fa
│ └── sample_name_uvig-1000.fa
├── investigation
│ └── CONTENT_FROM_THE_Uvigs_RECOVERY_SHOWN_BEFORE
├── taxonomy
│ └── CONTENT_FROM_THE_Uvigs_TAXONOMY_SHOWN_BEFORE
└── vcheck_quality
└── CONTENT_FROM_THE_Uvigs_CheckV_SHOWN_BEFORE
After successfully running step 3.d, you should have the following folder structure:
sample_name
└── viruses
├──final_outputs
│ ├── only_uvigs_seq
│ │ ├── sample_name_putative_viral_contig-15.fa
│ │ ├── sample_name_putative_viral_contig-16.fa
│ │ └── ...
│ ├── putative_vir_contigs_summary.tsv
│ ├── putative_vir_seq_metrics_summary
│ │ ├── host_vir_pair_wish_summary.tsv
│ │ ├── quality_checkv_summary.tsv
│ │ ├── taxa_estimation_vir_vcontact2_summary.csv
│ │ ├── vir_contigs_annotation_vibrant_summary.tsv
│ │ └── vir_contigs_genbank_annotation_vibrant.tsv
│ ├── putative_viral_contigs
│ │ ├── sample_name_putative_viral_contig-0.fa
│ │ ├── sample_name_putative_viral_contig-1.fa
│ │ └── …
│ ├── uvigs_high_quality_summary.tsv
│ └── viral_contigs_seq_names.csv
├── host_prediction
├── investigation
├── taxonomy
└── vcheck_quality
Please refer to Module 4 final consideration to understand the folder structure from Module 4.
The recovery of eukaryotic genomes can be trickier than the prokaryotic genome recovery. This means that we do not assume the recovered eukaryotic bins to be genomes at any moment. That is why we keep the nomenclature as eMABs instead of eMAGs. The recovery of the eMABs module integrates EukRep for selecting the eukaryotic contigs from the initial assembly, the CONCOCT binner, GeneMark for prediction of eukaryotic genes, EukCC for for quality estimation from eukaryiotic sequences, MAKER2 for gene annotation, and BUSCO for detection of single-copy orthologous genes.
The tools that require specific databases in the eukaryotic module are EukCC, BUSCO, and MAKER2. BUSCO and EukCC databases should be ready to use after running the database-setup.sh script. See instructions here.
Module 4 requires specific configuration that can't be done automatically. Please, make sure you follow the instructions here to complete GeneMark and MAKER2 installation.
For running all module 4 use:
$ mudoger --module eukaryotes --meta /path/to/metadata.tsv -o /path/to/output/folder -t 20
The available parameters for module 4 are:
- --meta metadata table as described here (mandatory)
- -o output directory (mandatory)
- -t number of threads/cores (default = 1)
The eMABs recovery starts with final_assembly.fasta
as input to the EukRep tool. This splits the contigs within the assembled sequences into prokaryotic_contigs.fa
and eukaryotic_contigs.fa
. Following, the eukaryotic contigs serve as input to the CONCOCT binner. Finally, the resulted eukaryotic bins are filtered by minimum size (by default we use 1.5 MB) and considered eMABs.
The main result of step 4.a is the eMABs fasta files located within the filtered_euk_bins/
folder.
The second part of eMABs recovery starts by predicting eukaryiotic genes using GeneMark. The input for this step are the eMABs recovered in step 4.a located within the filtered_euk_bins/
folder. Following EukCC is used to calculate quality and completeness from the eMABs recovered in 4.a.
For the annotation of eukaryotic genes, MuDoGeR feeds the MAKER2 tool with the eukaryotic bins and the necessary files generated by GeneMark for each eMAB. Finally, BUSCO receives the MAKER2 genes as input for detecting single-copy orthologous genes (SCGs)
The results from GeneMark are calculated for each eMAB and you may find them in eukaryotes/genemarker_annotation/eMAB.fa_genemark
. The annotated genes are located in the genemark.gtf
file. The model created for the eMAB, the gmhmm.mod
is then used by MAKER2.
A summary of the quality results calculated buy EukCC can be found in eukaryotes/eukcc_quality/eMAB.fa_eukcc/eukcc.csv
.
The main results from the MAKER2 annotation can be found for each eMAB within the folder eukaryotes/maker2_gene_annotation/eMAB.fa_maker2/eMAB.maker.output/
Finally, you can find BUSCO results in the files eukaryotes/euk_completeness/eMAB_busco/short_summary.specific.eukaryota_odb10.eMAB_busco.txt
and eukaryotes/euk_completeness/eMAB_busco/run_eukaryota_odb10/full_table.tsv
.
By the end, MuDoGeR parses the outputs from all used tools. Please go to the understand outputs to better understand MuDoGeR results.
After a successful run of Module 4 you should have the following folder structure:
After step 4.a is finished, you should have the following folder structure:
sample_name
└── eukaryotes
├── eukaryotes_bins
│ ├── bin.0.fa
│ ├── bin.10.fa
│ ├── bin.11.fa
│ └── bin.8.fa
├── eukaryotic_contigs.fa
├── filtered_euk_bins
│ └── bin.8.fa
└── prokaryotic_contigs.fa
After step 4.b is finished, you should have the following folder structure:
sample_name
└── eukaryotes
├── eukaryotes_bins
│ ├── bin.0.fa
│ ├── bin.10.fa
│ ├── bin.11.fa
│ └── bin.8.fa
├── eukaryotic_contigs.fa
├── eukcc_quality
│ └── bin.8.fa_eukcc
│ ├── eukcc.csv
│ ├── eukcc.log
│ └── savestate.json.gz
├── euk_completeness
│ ├── bin.8_busco
│ │ ├── logs
│ │ ├── run_eukaryota_odb10
│ │ ├── short_summary.specific.eukaryota_odb10.bin.8_busco.json
│ │ └── short_summary.specific.eukaryota_odb10.bin.8_busco.txt
│ └── busco_downloads
│ └── file_versions.tsv
├── filtered_euk_bins
│ └── bin.8.fa
├── genemarker_annotation
│ └── bin.8.fa_genemark
│ ├── bin.8.fa
│ ├── data
│ ├── genemark.gtf
│ ├── gmes.log
│ ├── gmhmm.mod
│ ├── info
│ ├── output
│ ├── run
│ └── run.cfg
├── maker2_gene_annotation
│ └── bin.8.fa_maker2
│ ├── bin.8.fa
│ ├── bin.8.maker.output
│ ├── maker_bopts.ctl
│ ├── maker_evm.ctl
│ ├── maker_exe.ctl
│ └── maker_opts.ctl
└── prokaryotic_contigs.fa
Module 5 of MuDoGeR maps the quality-controlled reads to the recovered pMAGs/UViGs/eMAB to calculate their abundance within the WGS samples.
Consequently, MuDoGeR uses the path to the quality-controlled reads and the path to the recovered pMAGs/UViGs/eMAB sequences.
Initially, module 5 selects the representatives' pMAGs/UViGs/eMAB based on the Average nucleotide identity (ANI) 95 distance measure using the gOTUpick software integrated into the MuDoGeR custom scripts. Following, MuDoGeR provides two mapping pipelines, called --reduced
and --complete
.
When using the --reduced
flag, MuDoGeR maps the representative pMAGs/UViGs/eMAB only to the WGS samples where they were recovered. When using the --complete
flag, MuDoGeR maps all WGS samples to all representative pMAGs/UViGs/eMAB selected by the gOTUpick. In both cases, you can use the --coverage
and --relative-abundance
flags to also calculate the pMAGs/UViGs/eMABs coverage and relative abundance, respectively.
In addition, you can use the --gene
flag to map the WGS quality-controlled reads to all the Prokka annotated genes found in the samples' final_assembly.fasta
files. You can also use the --coverage
and --relative-abundance
flags to output coverage and relative abundance values, respectively, for the annotated genes.
You can run all module 5 as follows:
$ mudoger --module abundance_tables --meta /path/to/metadata.tsv -o /path/to/output/folder -t 20 --reduced --absolute-values --coverage --relative-abundance
The available parameters for module 5 are:
- --meta metadata table as described here (mandatory)
- -o output directory (mandatory)
- -t number of threads/cores (default = 1)
- --reduced (default), --complete, or --genes mapping type
- --absolute-values --coverage, and/or --relative-abundance output values
Please refer to Module 5 final consideration to understand the folder structure from Module 5.
The first step from module 5 is to group all recovered pMAGs within OTU groups. To do so, MuDoGeR uses the ANI 95 distance strategy implemented in the gOTUpick tool. Essentially, it starts by calculating the ANI distance from the pMAGs and separates them into a group according to 95% identity within the groups. Following, it uses information from the GTDB-tk taxonomical annotation, and CheckM quality estimation to select the highest quality pMAG from the same taxonomical class within the same ANI95 group as the group's representative MAG.
After step 5.a is finished you should find the mapping_results/gOTUpick_results/final_output/
folder. Inside this folder, you will have the bestbins.txt
and final_groups_output.csv
files. Those files indicate the representatives' pMAGs, marked with an *, and their groups. Those are also the files used in the next steps of the pipeline. Please go to the understand outputs to better understand MuDoGeR results.
Following the MuDoGeR copies the OTUs selected in step 5.a, the selected UViGs and eMABs recovered in module 3 and module 4, to the pmags_otu_mapping/otus_fasta
, uvigs_mapping/uvigs_fasta
, and euk_mabs_mapping/emabs_fasta
, respectively.
Later, MuDoGeR uses the Bowtie2 software to index the copied fasta files.
If --coverage
is selected, MuDoGeR calculates the pMAGs/UViGs/eMABs sizes, and the average read length of all samples to be mapped. Consequently, you should find the file merged_reads/avg_reads_len.tsv
containing the average read length of all used samples. In addition, you should find the files genome_sizes
, uvigs_sizes.txt
, and emabs_sizes.txt
inside the folder relative to the pMAGs, UViGS, and eMABs mapping, respectively. If --relative-abundance
is selected, it also calculates the total reads per sample, and you should also find the file merged_reads/total_reads_per_lib.tsv
.
Following this, the MuDoGeR does the actual mapping according to the --reduced
or --complete
strategy. You should find the folders map_results_ reduced
or map_results_complete
inside the folder for each domain. These folders will contain one file for each performed mapping named “mapped_bin_name”-LIB-“sample_used”.txt.
Finally, MuDoGeR calculates the absolute number of hit, coverage, and relative abundance tables and outputs the results inside the map_final_tables_reduced
and map_final_tables_complete
. You should have the tables named as map_reduced(complete)_absolute_n_hits(coverage)(relative_abundance)_table.tsv
.
Please go to the understand outputs to better understand MuDoGeR results.
If you chose the --genes
mapping type, MuDoGeR will copy and index the sample’s final_assembly.fasta
file. After that, you should have the assembly_gene_map/raw_mapping/
folder containing the assembly fasta file and the index outputs. Later, MuDoGeR maps the assembled sequence and you should have the sample_name.map.sorted.bam
file inside the raw_mapping
folder.
Following, MuDoGeR annotates the sample’s final_assembly.fasta
file using Prokka. It then converts the resulted .gff file from Prokka into .gtf file. You should find the results from this functional annotation step inside the assembly_gene_map/functional_annotation/sample_name/
folder.
MuDoGeR then counts the mapped reads on each gene using the information present on the .gtf file and the mapped .map.sorted.bam
file. The result from the counting is in the map_absolute_count/sample_name.count
file.
To calculate the gene coverage and relative abundance, MuDoGeR also calculates the average read length from the used sample and the gene length. The results from this step is found on the avg_reads_len.tsv
and the genelength/sample_name.genelength
respectively.
Finally, MuDoGeR calculates the gene coverage and gene TPM normalization. The results from this step are at the map_coverage_norm/sample_name.cov
and map_tpm_norm/sample_name.tpm
files respectively.
Please go to the understand outputs to better understand MuDoGeR results.
By the end, MuDoGeR parses the outputs from all used tools. Please go to the understand outputs to better understand MuDoGeR results. After a successful run of Module 5 you should have the following folder structure: After step 5.a is finished, you should have the following folder structure:
mapping_results
└── gOTUpick_results
├── ANI_OTU95
│ └── group-1-1
├── ANI_distances
│ └── group-1
├── all_fastani-out-1500-0.txt
├── allgtdb.tsv
├── final_output
│ ├── bestbins.txt
│ └── final_groups_output.csv
├── results
│ ├── bestbins_metadata.tsv
│ ├── final_groups.tsv
│ ├── final_groups_qual.tsv
│ ├── further_95.txt
│ ├── groups_ANI95.tsv
│ ├── metadata_bbtools_sorted.tsv
│ ├── metadata_checkm_sorted.tsv
│ ├── metadata_gtdb_taxonomy_sorted.tsv
│ └── not_further_95.txt
└── tax_groups
├── group-1
└── unique_tax
After step 5.b is finished, you should have the following folder structure:
mapping_results
├── euk_mabs_mapping
│ ├── emabs_fasta
│ │ ├── sample_name-bin.6.1.bt2
│ │ └── …
│ ├── emabs_sizes
│ │ └── sample_name-bin.6.emab-size
│ ├── emabs_sizes.txt
│ ├── map_final_tables_complete
│ │ ├── map_complete_absolute_n_hits_list.tsv
│ │ ├── map_complete_absolute_n_hits_table.tsv
│ │ ├── map_complete_coverage_list.tsv
│ │ ├── map_complete_coverage_table.tsv
│ │ ├── map_complete_relative_abundance_list.tsv
│ │ └── map_complete_relative_abundance_table.tsv
│ ├── map_final_tables_reduced
│ │ ├── map_reduced_absolute_n_hits_list.tsv
│ │ ├── map_reduced_absolute_n_hits_table.tsv
│ │ ├── map_reduced_coverage_list.tsv
│ │ ├── map_reduced_coverage_table.tsv
│ │ ├── map_reduced_relative_abundance_list.tsv
│ │ └── map_reduced_relative_abundance_table.tsv
│ ├── map_results_complete
│ │ └── sample_name-bin.6-LIB-A.txt
│ └── map_results_reduced
│ └── sample_name-bin.6-LIB-A.txt
├── merged_reads
│ ├── sample_name.fasta
│ ├── avg_reads_len.tsv
│ └── total_reads_per_lib.tsv
├── pmags_otu_mapping
│ ├── genome_size
│ │ ├── sample_name-bin.0.genome-size
│ │ └── sample_name-bin.7.genome-size
│ ├── genomes_sizes
│ ├── map_final_tables_complete
│ │ ├── map_complete_absolute_n_hits_list.tsv
│ │ ├── map_complete_absolute_n_hits_table.tsv
│ │ ├── map_complete_coverage_list.tsv
│ │ ├── map_complete_coverage_table.tsv
│ │ ├── map_complete_relative_abundance_list.tsv
│ │ └── map_complete_relative_abundance_table.tsv
│ ├── map_final_tables_reduced
│ │ ├── map_reduced_absolute_n_hits_list.tsv
│ │ ├── map_reduced_absolute_n_hits_table.tsv
│ │ ├── map_reduced_coverage_list.tsv
│ │ ├── map_reduced_coverage_table.tsv
│ │ ├── map_reduced_relative_abundance_list.tsv
│ │ └── map_reduced_relative_abundance_table.tsv
│ ├── map_results_complete
│ │ ├── sample_name-bin.0-LIB-sample_name.txt
│ │ └── sample_name-bin.7-LIB-sample_name.txt
│ ├── map_results_reduced
│ │ ├── sample_name-bin.0-LIB-sample_name.txt
│ │ └── sample_name-bin.7-LIB-sample_name.txt
│ └── otus_fasta
│ ├── A-bin.0.1.bt2
│ ├── A-bin.7.rev.1.bt2
│ └── …
└── uvigs_mapping
├── map_final_tables_complete
│ ├── map_complete_absolute_n_hits_list.tsv
│ ├── map_complete_absolute_n_hits_table.tsv
│ ├── map_complete_coverage_list.tsv
│ ├── map_complete_coverage_table.tsv
│ ├── map_complete_relative_abundance_list.tsv
│ └── map_complete_relative_abundance_table.tsv
├── map_final_tables_reduced
│ ├── map_reduced_absolute_n_hits_list.tsv
│ ├── map_reduced_absolute_n_hits_table.tsv
│ ├── map_reduced_coverage_list.tsv
│ ├── map_reduced_coverage_table.tsv
│ ├── map_reduced_relative_abundance_list.tsv
│ └── map_reduced_relative_abundance_table.tsv
├── map_list_reduced
├── map_results_complete
│ ├── sample_name-sample_name_putative_viral_contig-15-LIB-sample_name.txt
│ └── …
├── map_results_reduced
│ ├── sample_name-sample_name_putative_viral_contig-15-LIB-sample_name.txt
│ └── …
├── uvigs_fasta
│ ├── sample_name-sample_name_putative_viral_contig-15.1.bt2
│ └── …
├── uvigs_sizes
│ ├── sample_name-sample_name_putative_viral_contig-15.uvig-size
│ └──…
└── uvigs_sizes.txt
After step 5.c is finished, you should have the following folder structure:
mapping_results
└── assembly_gene_map
├── avg_reads_len.tsv
├── functional_annotation
│ └── sample_name
├── genelength
│ └── sample_name.genelength
├── map_absolute_count
│ └── sample_name.count
├── map_coverage_norm
│ └── sample_name.cov
├── map_tpm_norm
│ └── sample_name.tpm
└── raw_mapping
├── sample_name.map.sam
└── …