Skip to content

Commit

Permalink
Merge pull request #62 from snayfach/dev
Browse files Browse the repository at this point in the history
Major pull from Dev
  • Loading branch information
snayfach authored Jul 25, 2017
2 parents 023411a + 7e9abbb commit c8290f3
Show file tree
Hide file tree
Showing 69 changed files with 2,648 additions and 2,398 deletions.
7 changes: 1 addition & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1 @@
phylo_cnv/example/ex1/
lib_bak/
**/*.pyc
scripts2/
phylo_db/
phylo_db.tar.gz
\*.pyc
64 changes: 22 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,27 @@
# Metagenomic Intra-Species Diversity Analysis System (MIDAS)


MIDAS is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagnomes.

## Applications
1. **Profile bacterial species abundance**: rapidly estimate the abundance of 5,952 bacterial species
2. **Strain-level pan-genome profiling**: estimate the gene content of populations based on mapping to genes from reference genomes
3. **Single nucleotide polymorphism prediction**: identify single-nucleotide polymorphisms (SNPs) of populations based on mapping to reference genomes
4. **Phylogenetic inference**: reconstruct the phylogeny of strains from metagenomes and reference genomes
5. **Population genetic inference**: quantify strain-level diversity, differentiation, and selection within and between metagenomes


## Table of Contents
1. [Step-by-step tutorial] (docs/tutorial.md)
2. [Install or update MIDAS] (docs/install.md)
3. [Download default MIDAS database] (docs/ref_db.md)
4. [Build your own custom database] (docs/build_db.md)
5. Scripts to run MIDAS on a single sample:
* [Estimate species abundance] (docs/species.md)
* [Predict pan-genome gene content] (docs/cnvs.md)
* [Call single nucleotide polymorphisms] (docs/snvs.md)
6. Scripts to merge MIDAS results across samples:
* [Merge species abundance] (docs/merge_species.md)
* [Merge gene content] (docs/merge_cnvs.md)
* [Merge SNPs] (docs/merge_snvs.md)
7. Example scripts for analyzing gene content and SNPs:
* [Strain tracking] (docs/strain_tracking.md)
* [Population diversity] (docs/snp_diversity.md)
* [Core-genome phylogenetic trees] (docs/snp_trees.md)
* [Gene content dynamics] (docs/compare_genes.md)


## Citation
If you use this tool, please cite:
S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". Genome Research 2016. doi:
10.1101/gr.201863.115

## Pipeline
<img src="images/pipeline.jpg" width="600" align="middle"/>
**An integrated pipeline to estimate bacterial species abundance and strain-level genomic variation from shotgun metagnomes**
<sub>**A) Metagenome species profiling.** Reads from a metagenomic sample are aligned against a database of phylogenetic marker genes and are assigned to species groups. Mapped reads are used to estimate the genome-coverage and relative abundance of 5,952 genome-clusters. **B) Metagenome pan-genome profiling.** A pan-genome database is dynamically constructed based on the subset of species that are present at high coverage (e.g. >1x) in the metagenome. Reads are mapped to the gene database using Bowtie2. Mapped reads are used to infer gene copy number and gene presence/absence. **C) Single-nucleotide variant prediction.** A representative genome database is constructed, as described in (B). Reads are globally aligned to the genome database using Bowtie2. Mapped reads are used to identify variants, predict consensus alleles, and estimate allele frequencies. **D) Merge results.** For each species, results are merged across one or more samples to generate several outputs, including: a gene presence/absence matrix, an allele frequency matrix, an approximate maximum-likelihood phylogenetic tree.</sub>

## Examples
<img src="images/enrichment.jpg" width="600" align="middle"/>
**Comparative genomics of *Bacteroides ovatus* strains across host microbiomes**
<sub> **A)** Presence or absence of genes in the *Bacteroides ovatus* pangenome across human faecal metagenomes. Column colors indicate whether a gene is core (blue; occurs in >95% of samples), auxiliary (red; occurs in 1-95% of samples ), or absent (green; occurs in < 1% of samples). **B)** Gene set enrichment analysis identifies functions overrepresented in the core genome, auxiliary genome, and genes that only occur in reference genomes.</sub>
1. Getting started
* [How it works](docs/overview.md)
* [Install or update software](docs/install.md)
* [Step-by-step tutorial](docs/tutorial.md)
* [Mailing list](https://groups.google.com/forum/#!forum/midas-user-group)
2. Reference databse
* [Download default database](docs/ref_db.md)
* [Build your own custom database](docs/build_db.md)
4. Run MIDAS on a single sample:
* [Estimate species abundance](docs/species.md)
* [Predict pan-genome gene content](docs/cnvs.md)
* [Call single nucleotide polymorphisms](docs/snvs.md)
5. Merge MIDAS results across samples:
* [Merge species abundance](docs/merge_species.md)
* [Merge gene content](docs/merge_cnvs.md)
* [Merge SNPs](docs/merge_snvs.md)
6. Example scripts for analyzing gene content and SNPs:
* [Core-genome phylogenetic trees](docs/snp_trees.md)
* [Population diversity](docs/snp_diversity.md)
* [Strain tracking](docs/strain_tracking.md)
* [Gene content dynamics](docs/compare_genes.md)
7. [Citing](docs/citing.md)
40 changes: 32 additions & 8 deletions bin/Darwin/bowtie2
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ while (-f $prog && -l $prog){

($vol,$script_path,$prog)
= File::Spec->splitpath($prog);
my $os_is_nix = ($^O eq "linux") || ($^O eq "darwin");
my $os_is_nix = $^O ne "MSWin32";
my $align_bin_s = $os_is_nix ? 'bowtie2-align-s' : 'bowtie2-align-s.exe';
my $build_bin = $os_is_nix ? 'bowtie2-build' : 'bowtie2-build.exe';
my $align_bin_l = $os_is_nix ? 'bowtie2-align-l' : 'bowtie2-align-l.exe';
Expand All @@ -58,6 +58,25 @@ my $idx_ext = $idx_ext_s;
my %signo = ();
my @signame = ();

sub quote_params {
my %params_2_quote = ('--rg' => 1, '--rg-id' => 1,
'-S' => 1, '-U' => 1,
'-1' => 1, '-2' => 1
);
my $param_list = shift;
my $quoting = 0;

for (my $i=0; $i<scalar(@{$param_list}); $i++){
if($quoting){
$quoting = 0;
$param_list->[$i] = "\"".$param_list->[$i]."\"";
next;
}
$quoting = 1 if(exists($params_2_quote{$param_list->[$i]}));
}
}


{
# Get signal info
use Config;
Expand All @@ -76,7 +95,7 @@ my @signame = ();
# 2 args from wrapper args
sub getBt2Desc($) {
my $d = shift;
my $cmd = "$align_prog --wrapper basic-0 --arg-desc";
my $cmd = "\"$align_prog\" --wrapper basic-0 --arg-desc";
open(my $fh, "$cmd |") || Fail("Failed to run command '$cmd'\n");
while(readline $fh) {
chomp;
Expand Down Expand Up @@ -198,6 +217,10 @@ for(my $i = 0; $i < scalar(@bt2_args); $i++) {
last;
}
}
if ($arg =~ /(bz2|lz4)$/) {
push @bt2w_args, ("-U", $arg);
$bt2_args[$i] = undef;
}
}
# If the user asked us to redirect some reads to files, or to suppress
# unaligned reads, then we need to capture the output from Bowtie 2 and pass it
Expand Down Expand Up @@ -273,13 +296,13 @@ sub cat_file($$) {
my ($ifn, $ofh) = @_;
my $ifh = undef;
if($ifn =~ /\.gz$/) {
open($ifh, "gzip -dc $ifn |") ||
open($ifh, "gzip -dc \"$ifn\" |") ||
Fail("Could not open gzipped read file: $ifn \n");
} elsif($ifn =~ /\.bz2/) {
open($ifh, "bzip2 -dc $ifn |") ||
open($ifh, "bzip2 -dc \"$ifn\" |") ||
Fail("Could not open bzip2ed read file: $ifn \n");
} elsif($ifn =~ /\.lz4/) {
open($ifh, "lz4 -dc $ifn |") ||
open($ifh, "lz4 -dc \"$ifn\" |") ||
Fail("Could not open lz4ed read file: $ifn \n");
} else {
open($ifh, $ifn) || Fail("Could not open read file: $ifn \n");
Expand All @@ -293,7 +316,7 @@ sub cat_file($$) {
sub wrapInput($$$) {
my ($unps, $mate1s, $mate2s) = @_;
for my $fn (@$unps, @$mate1s, @$mate2s) {
return 1 if $fn =~ /\.gz$/ || $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
return 1 if $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
}
return 0;
}
Expand Down Expand Up @@ -446,7 +469,8 @@ else {
my $debug_str = ($debug ? "-debug" : "");

# Construct command invoking bowtie2-align
my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
quote_params(\@bt2_args);
my $cmd = "\"$align_prog$debug_str\" --wrapper basic-0 ".join(" ", @bt2_args);

# Possibly add read input on an anonymous pipe
$cmd = "$readpipe $cmd" if defined($readpipe);
Expand Down Expand Up @@ -527,7 +551,7 @@ if(defined($cap_out)) {
my $unal = ($fl & 4) != 0;
$filt = 1 if $no_unal && $unal;
if($passthru) {
if(scalar(keys %read_fhs) == 0) {
if(scalar(keys %read_fhs) == 0 || ($fl & 256) != 0) {
# Next line is read with some whitespace escaped
my $l = <BT>;
} else {
Expand Down
Binary file modified bin/Darwin/bowtie2-align-l
Binary file not shown.
Binary file added bin/Darwin/bowtie2-align-l-debug
Binary file not shown.
Binary file modified bin/Darwin/bowtie2-align-s
Binary file not shown.
Binary file added bin/Darwin/bowtie2-align-s-debug
Binary file not shown.
Binary file modified bin/Darwin/bowtie2-build-l
Binary file not shown.
Binary file modified bin/Darwin/bowtie2-build-l-debug
Binary file not shown.
Binary file modified bin/Darwin/bowtie2-build-s
Binary file not shown.
Binary file modified bin/Darwin/bowtie2-build-s-debug
Binary file not shown.
Binary file modified bin/Darwin/samtools
Binary file not shown.
40 changes: 32 additions & 8 deletions bin/Linux/bowtie2
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ while (-f $prog && -l $prog){

($vol,$script_path,$prog)
= File::Spec->splitpath($prog);
my $os_is_nix = ($^O eq "linux") || ($^O eq "darwin");
my $os_is_nix = $^O ne "MSWin32";
my $align_bin_s = $os_is_nix ? 'bowtie2-align-s' : 'bowtie2-align-s.exe';
my $build_bin = $os_is_nix ? 'bowtie2-build' : 'bowtie2-build.exe';
my $align_bin_l = $os_is_nix ? 'bowtie2-align-l' : 'bowtie2-align-l.exe';
Expand All @@ -58,6 +58,25 @@ my $idx_ext = $idx_ext_s;
my %signo = ();
my @signame = ();

sub quote_params {
my %params_2_quote = ('--rg' => 1, '--rg-id' => 1,
'-S' => 1, '-U' => 1,
'-1' => 1, '-2' => 1
);
my $param_list = shift;
my $quoting = 0;

for (my $i=0; $i<scalar(@{$param_list}); $i++){
if($quoting){
$quoting = 0;
$param_list->[$i] = "\"".$param_list->[$i]."\"";
next;
}
$quoting = 1 if(exists($params_2_quote{$param_list->[$i]}));
}
}


{
# Get signal info
use Config;
Expand All @@ -76,7 +95,7 @@ my @signame = ();
# 2 args from wrapper args
sub getBt2Desc($) {
my $d = shift;
my $cmd = "$align_prog --wrapper basic-0 --arg-desc";
my $cmd = "\"$align_prog\" --wrapper basic-0 --arg-desc";
open(my $fh, "$cmd |") || Fail("Failed to run command '$cmd'\n");
while(readline $fh) {
chomp;
Expand Down Expand Up @@ -198,6 +217,10 @@ for(my $i = 0; $i < scalar(@bt2_args); $i++) {
last;
}
}
if ($arg =~ /(bz2|lz4)$/) {
push @bt2w_args, ("-U", $arg);
$bt2_args[$i] = undef;
}
}
# If the user asked us to redirect some reads to files, or to suppress
# unaligned reads, then we need to capture the output from Bowtie 2 and pass it
Expand Down Expand Up @@ -273,13 +296,13 @@ sub cat_file($$) {
my ($ifn, $ofh) = @_;
my $ifh = undef;
if($ifn =~ /\.gz$/) {
open($ifh, "gzip -dc $ifn |") ||
open($ifh, "gzip -dc \"$ifn\" |") ||
Fail("Could not open gzipped read file: $ifn \n");
} elsif($ifn =~ /\.bz2/) {
open($ifh, "bzip2 -dc $ifn |") ||
open($ifh, "bzip2 -dc \"$ifn\" |") ||
Fail("Could not open bzip2ed read file: $ifn \n");
} elsif($ifn =~ /\.lz4/) {
open($ifh, "lz4 -dc $ifn |") ||
open($ifh, "lz4 -dc \"$ifn\" |") ||
Fail("Could not open lz4ed read file: $ifn \n");
} else {
open($ifh, $ifn) || Fail("Could not open read file: $ifn \n");
Expand All @@ -293,7 +316,7 @@ sub cat_file($$) {
sub wrapInput($$$) {
my ($unps, $mate1s, $mate2s) = @_;
for my $fn (@$unps, @$mate1s, @$mate2s) {
return 1 if $fn =~ /\.gz$/ || $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
return 1 if $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
}
return 0;
}
Expand Down Expand Up @@ -446,7 +469,8 @@ else {
my $debug_str = ($debug ? "-debug" : "");

# Construct command invoking bowtie2-align
my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
quote_params(\@bt2_args);
my $cmd = "\"$align_prog$debug_str\" --wrapper basic-0 ".join(" ", @bt2_args);

# Possibly add read input on an anonymous pipe
$cmd = "$readpipe $cmd" if defined($readpipe);
Expand Down Expand Up @@ -527,7 +551,7 @@ if(defined($cap_out)) {
my $unal = ($fl & 4) != 0;
$filt = 1 if $no_unal && $unal;
if($passthru) {
if(scalar(keys %read_fhs) == 0) {
if(scalar(keys %read_fhs) == 0 || ($fl & 256) != 0) {
# Next line is read with some whitespace escaped
my $l = <BT>;
} else {
Expand Down
Binary file modified bin/Linux/bowtie2-align-l
Binary file not shown.
Binary file modified bin/Linux/bowtie2-align-s
Binary file not shown.
Binary file modified bin/Linux/bowtie2-build-l
Binary file not shown.
Binary file removed bin/Linux/bowtie2-build-l-debug
Binary file not shown.
Binary file modified bin/Linux/bowtie2-build-s
Binary file not shown.
Binary file removed bin/Linux/bowtie2-build-s-debug
Binary file not shown.
Binary file modified bin/Linux/samtools
Binary file not shown.
42 changes: 18 additions & 24 deletions docs/build_db.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This script will allow you to build a MIDAS database using your own genomes or m

This script uses user-defined species and user-supplied genomes to build database files for running MIDAS. This will allow you to estimate the abundance of these species in a shotgun metagenome and quantify gene content and SNPs for species with sufficient data.

First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, USEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with 'run_midas.py genes', reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with 'merge_midas.py genes'.
First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program [VSEARCH](https://github.com/torognes/vsearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, VSEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with `run_midas.py genes`, reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with `merge_midas.py genes`.

Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user simply needs to define which genome they want to use as the representative for each species.

Expand All @@ -16,15 +16,15 @@ Finally, a marker genes database is built. Marker genes are defined as universal
As with all scripts, MIDAS and its dependencies will need to be installed.
Additionally, you need to have the following command-line tools installed:

* USEARCH: http://www.drive5.com/usearch/download.html
* VSEARCH: https://github.com/torognes/vsearch
* HMMER3: http://hmmer.org

After installing them, you need to add their installation directories to your PATH:
`export PATH=$PATH:/usearch/installation/directory`
`export PATH=$PATH:/vsearch/installation/directory`
`export PATH=$PATH:/hmmer/installation/directory`

You should be able to call these programs from your command line:
`usearch`
`vsearch`
`hmmsearch`

## Usage
Expand All @@ -33,39 +33,33 @@ You should be able to call these programs from your command line:

### Input/output files:

<b>indir PATH</b>
Path to directory of genomes. Each subdirectory should be named with a genome identifier. Each subdirectory should contain the following files:
<b>indir:</b> Path to directory of genomes. Each subdirectory should be named with a genome identifier. Each subdirectory should contain the following files:

\<genome_id>.fna
Genomic DNA sequence in FASTA format
\<genome_id>.fna: Genomic DNA sequence in FASTA format

\<genome_id>.faa
Genomic DNA sequence in FASTA format
\<genome_id>.faa: Protein sequences in FASTA format

\<genome_id>.ffn
Protein sequences in FASTA format

\<genome_id>.features
File specifies genomic coordinates of genes.
This file should be tab-delimited with a header and 5 fields:
\<genome_id>.ffn: Gene sequences in FASTA format

\<genome_id>.genes: Tab delimited file with genomic coordinates of genes. The file should be tab-delimited file with a header and the following fields:

* scaffold_id (CHAR)
* gene_id (CHAR)
* scaffold_id (CHAR)
* start (INT)
* end (INT)
* strand ('+' or '-')
* gene_type ('CDS' or 'RNA')

<b>mapfile PATH</b>
Path to mapping file that specifies which genomes belonging to the same species.
* strand (+ or -)
* gene_type (CDS or RNA)

<b>mapfile</b>: Path to mapping file that specifies which genomes belonging to the same species.
The file should be tab-delimited file with a header and 3 fields:

* genome_id (CHAR): corresponds to subdirectory within INDIR
* species_id (CHAR): : species identifier for genome_id
* rep_genome (0 or 1): indicator if genome_id should be used for SNP calling

<b>outdir PATH</b>
Directory to store MIDAS database
<b>outdir:</b> Directory to store MIDAS database

For example input files, see:`/path/to/MIDAS/test/genomes.tar.gz` and `/path/to/MIDAS/test/genomes.mapfile`

### Options:

Expand Down
5 changes: 5 additions & 0 deletions docs/citing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Citation


If you use this tool, please cite:
S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". Genome Research 2016. doi:10.1101/gr.201863.115
Loading

0 comments on commit c8290f3

Please sign in to comment.