Merge pull request #62 from snayfach/dev

Major pull from Dev
snayfach · Jul 25, 2017 · c8290f3 · c8290f3
2 parents 023411a + 7e9abbb
commit c8290f3
Show file tree

Hide file tree

Showing 69 changed files with 2,648 additions and 2,398 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1 @@
-phylo_cnv/example/ex1/
-lib_bak/
-**/*.pyc
-scripts2/
-phylo_db/
-phylo_db.tar.gz
+\*.pyc
diff --git a/README.md b/README.md
@@ -1,47 +1,27 @@
 # Metagenomic Intra-Species Diversity Analysis System (MIDAS)
 
-
 MIDAS is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagnomes. 
 
-## Applications
-1. **Profile bacterial species abundance**: rapidly estimate the abundance of 5,952 bacterial species
-2. **Strain-level pan-genome profiling**: estimate the gene content of populations based on mapping to genes from reference genomes
-3. **Single nucleotide polymorphism prediction**: identify single-nucleotide polymorphisms (SNPs) of populations based on mapping to reference genomes
-4. **Phylogenetic inference**: reconstruct the phylogeny of strains from metagenomes and reference genomes
-5. **Population genetic inference**: quantify strain-level diversity, differentiation, and selection within and between metagenomes
-
-
 ## Table of Contents
-1. [Step-by-step tutorial] (docs/tutorial.md)  
-2. [Install or update MIDAS] (docs/install.md)  
-3. [Download default MIDAS database] (docs/ref_db.md)  
-4. [Build your own custom database] (docs/build_db.md)
-5. Scripts to run MIDAS on a single sample:
- * [Estimate species abundance] (docs/species.md)
- * [Predict pan-genome gene content] (docs/cnvs.md)
- * [Call single nucleotide polymorphisms] (docs/snvs.md)
-6. Scripts to merge MIDAS results across samples:
- * [Merge species abundance] (docs/merge_species.md)  
- * [Merge gene content] (docs/merge_cnvs.md)
- * [Merge SNPs] (docs/merge_snvs.md)
-7. Example scripts for analyzing gene content and SNPs:
- * [Strain tracking] (docs/strain_tracking.md)  
- * [Population diversity] (docs/snp_diversity.md)  
- * [Core-genome phylogenetic trees] (docs/snp_trees.md)   
- * [Gene content dynamics] (docs/compare_genes.md)
-
-
-## Citation
-If you use this tool, please cite:
-S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". Genome Research 2016. doi:
-10.1101/gr.201863.115
-
-## Pipeline
-<img src="images/pipeline.jpg" width="600" align="middle"/>   
-**An integrated pipeline to estimate bacterial species abundance and strain-level genomic variation from shotgun metagnomes** 
-<sub>**A) Metagenome species profiling.** Reads from a metagenomic sample are aligned against a database of phylogenetic marker genes and are assigned to species groups. Mapped reads are used to estimate the genome-coverage and relative abundance of 5,952 genome-clusters. **B) Metagenome pan-genome profiling.** A pan-genome database is dynamically constructed based on the subset of species that are present at high coverage (e.g. >1x) in the metagenome. Reads are mapped to the gene database using Bowtie2. Mapped reads are used to infer gene copy number and gene presence/absence. **C) Single-nucleotide variant prediction.** A representative genome database is constructed, as described in (B). Reads are globally aligned to the genome database using Bowtie2. Mapped reads are used to identify variants, predict consensus alleles, and estimate allele frequencies. **D) Merge results.** For each species, results are merged across one or more samples to generate several outputs, including: a gene presence/absence matrix, an allele frequency matrix, an approximate maximum-likelihood phylogenetic tree.</sub>
-
-## Examples
-<img src="images/enrichment.jpg" width="600" align="middle"/>  
-**Comparative genomics of *Bacteroides ovatus* strains across host microbiomes**  
-<sub> **A)** Presence or absence of genes in the *Bacteroides ovatus* pangenome across human faecal metagenomes. Column colors indicate whether a gene is core (blue; occurs in >95% of samples), auxiliary (red; occurs in 1-95% of samples ), or absent (green; occurs in < 1% of samples). **B)** Gene set enrichment analysis identifies functions overrepresented in the core genome, auxiliary genome, and genes that only occur in reference genomes.</sub>
+1. Getting started
+ * [How it works](docs/overview.md)  
+ * [Install or update software](docs/install.md)  
+ * [Step-by-step tutorial](docs/tutorial.md)  
+ * [Mailing list](https://groups.google.com/forum/#!forum/midas-user-group)  
+2. Reference databse 
+ * [Download default database](docs/ref_db.md)  
+ * [Build your own custom database](docs/build_db.md)
+4. Run MIDAS on a single sample:
+ * [Estimate species abundance](docs/species.md)
+ * [Predict pan-genome gene content](docs/cnvs.md)
+ * [Call single nucleotide polymorphisms](docs/snvs.md)
+5. Merge MIDAS results across samples:
+ * [Merge species abundance](docs/merge_species.md)  
+ * [Merge gene content](docs/merge_cnvs.md)
+ * [Merge SNPs](docs/merge_snvs.md)
+6. Example scripts for analyzing gene content and SNPs:
+ * [Core-genome phylogenetic trees](docs/snp_trees.md)
+ * [Population diversity](docs/snp_diversity.md)
+ * [Strain tracking](docs/strain_tracking.md)      
+ * [Gene content dynamics](docs/compare_genes.md)
+7. [Citing](docs/citing.md)
diff --git a/bin/Darwin/bowtie2 b/bin/Darwin/bowtie2
@@ -45,7 +45,7 @@ while (-f $prog && -l $prog){
 
 ($vol,$script_path,$prog) 
                 = File::Spec->splitpath($prog);
-my $os_is_nix   = ($^O eq "linux") || ($^O eq "darwin");
+my $os_is_nix   = $^O ne "MSWin32";
 my $align_bin_s = $os_is_nix ? 'bowtie2-align-s' : 'bowtie2-align-s.exe'; 
 my $build_bin   = $os_is_nix ? 'bowtie2-build' : 'bowtie2-build.exe';               
 my $align_bin_l = $os_is_nix ? 'bowtie2-align-l' : 'bowtie2-align-l.exe'; 
@@ -58,6 +58,25 @@ my $idx_ext       = $idx_ext_s;
 my %signo       = ();
 my @signame     = ();
 
+sub quote_params {
+    my %params_2_quote = ('--rg' => 1, '--rg-id' => 1,
+                          '-S' => 1, '-U' => 1,
+                          '-1' => 1, '-2' => 1
+    );
+    my $param_list = shift;
+    my $quoting = 0;
+
+    for (my $i=0; $i<scalar(@{$param_list}); $i++){
+        if($quoting){
+            $quoting = 0;
+            $param_list->[$i] = "\"".$param_list->[$i]."\"";
+            next;
+        }
+    	$quoting = 1 if(exists($params_2_quote{$param_list->[$i]}));
+    }
+}
+
+
 {
 	# Get signal info
 	use Config;
@@ -76,7 +95,7 @@ my @signame     = ();
 # 2 args from wrapper args
 sub getBt2Desc($) {
 	my $d = shift;
-	my $cmd = "$align_prog --wrapper basic-0 --arg-desc";
+	my $cmd = "\"$align_prog\" --wrapper basic-0 --arg-desc";
 	open(my $fh, "$cmd |") || Fail("Failed to run command '$cmd'\n");
 	while(readline $fh) {
 		chomp;
@@ -198,6 +217,10 @@ for(my $i = 0; $i < scalar(@bt2_args); $i++) {
 			last;
 		}
 	}
+    if ($arg =~ /(bz2|lz4)$/) {
+        push @bt2w_args, ("-U", $arg);
+        $bt2_args[$i] = undef;
+    }
 }
 # If the user asked us to redirect some reads to files, or to suppress
 # unaligned reads, then we need to capture the output from Bowtie 2 and pass it
@@ -273,13 +296,13 @@ sub cat_file($$) {
 	my ($ifn, $ofh) = @_;
 	my $ifh = undef;
 	if($ifn =~ /\.gz$/) {
-		open($ifh, "gzip -dc $ifn |") ||
+		open($ifh, "gzip -dc \"$ifn\" |") ||
 			 Fail("Could not open gzipped read file: $ifn \n");
 	} elsif($ifn =~ /\.bz2/) {
-		open($ifh, "bzip2 -dc $ifn |") ||
+		open($ifh, "bzip2 -dc \"$ifn\" |") ||
 			Fail("Could not open bzip2ed read file: $ifn \n");
 	} elsif($ifn =~ /\.lz4/) {
-		open($ifh, "lz4 -dc $ifn |") ||
+		open($ifh, "lz4 -dc \"$ifn\" |") ||
 			Fail("Could not open lz4ed read file: $ifn \n");
 	} else {
 		open($ifh, $ifn) || Fail("Could not open read file: $ifn \n");
@@ -293,7 +316,7 @@ sub cat_file($$) {
 sub wrapInput($$$) {
 	my ($unps, $mate1s, $mate2s) = @_;
 	for my $fn (@$unps, @$mate1s, @$mate2s) {
-		return 1 if $fn =~ /\.gz$/ || $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
+		return 1 if $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
 	}
 	return 0;
 }
@@ -446,7 +469,8 @@ else {
 my $debug_str = ($debug ? "-debug" : "");
 
 # Construct command invoking bowtie2-align
-my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
+quote_params(\@bt2_args);
+my $cmd = "\"$align_prog$debug_str\" --wrapper basic-0 ".join(" ", @bt2_args);
 
 # Possibly add read input on an anonymous pipe
 $cmd = "$readpipe $cmd" if defined($readpipe);
@@ -527,7 +551,7 @@ if(defined($cap_out)) {
 			my $unal = ($fl & 4) != 0;
 			$filt = 1 if $no_unal && $unal;
 			if($passthru) {
-				if(scalar(keys %read_fhs) == 0) {
+				if(scalar(keys %read_fhs) == 0 || ($fl & 256) != 0) {
 					# Next line is read with some whitespace escaped
 					my $l = <BT>;
 				} else {

diff --git a/bin/Darwin/bowtie2-align-l b/bin/Darwin/bowtie2-align-l
diff --git a/bin/Darwin/bowtie2-align-l-debug b/bin/Darwin/bowtie2-align-l-debug
diff --git a/bin/Darwin/bowtie2-align-s b/bin/Darwin/bowtie2-align-s
diff --git a/bin/Darwin/bowtie2-align-s-debug b/bin/Darwin/bowtie2-align-s-debug
diff --git a/bin/Darwin/bowtie2-build-l b/bin/Darwin/bowtie2-build-l
diff --git a/bin/Darwin/bowtie2-build-l-debug b/bin/Darwin/bowtie2-build-l-debug
diff --git a/bin/Darwin/bowtie2-build-s b/bin/Darwin/bowtie2-build-s
diff --git a/bin/Darwin/bowtie2-build-s-debug b/bin/Darwin/bowtie2-build-s-debug
diff --git a/bin/Darwin/samtools b/bin/Darwin/samtools
diff --git a/bin/Linux/bowtie2 b/bin/Linux/bowtie2
@@ -45,7 +45,7 @@ while (-f $prog && -l $prog){
 
 ($vol,$script_path,$prog) 
                 = File::Spec->splitpath($prog);
-my $os_is_nix   = ($^O eq "linux") || ($^O eq "darwin");
+my $os_is_nix   = $^O ne "MSWin32";
 my $align_bin_s = $os_is_nix ? 'bowtie2-align-s' : 'bowtie2-align-s.exe'; 
 my $build_bin   = $os_is_nix ? 'bowtie2-build' : 'bowtie2-build.exe';               
 my $align_bin_l = $os_is_nix ? 'bowtie2-align-l' : 'bowtie2-align-l.exe'; 
@@ -58,6 +58,25 @@ my $idx_ext       = $idx_ext_s;
 my %signo       = ();
 my @signame     = ();
 
+sub quote_params {
+    my %params_2_quote = ('--rg' => 1, '--rg-id' => 1,
+                          '-S' => 1, '-U' => 1,
+                          '-1' => 1, '-2' => 1
+    );
+    my $param_list = shift;
+    my $quoting = 0;
+
+    for (my $i=0; $i<scalar(@{$param_list}); $i++){
+        if($quoting){
+            $quoting = 0;
+            $param_list->[$i] = "\"".$param_list->[$i]."\"";
+            next;
+        }
+    	$quoting = 1 if(exists($params_2_quote{$param_list->[$i]}));
+    }
+}
+
+
 {
 	# Get signal info
 	use Config;
@@ -76,7 +95,7 @@ my @signame     = ();
 # 2 args from wrapper args
 sub getBt2Desc($) {
 	my $d = shift;
-	my $cmd = "$align_prog --wrapper basic-0 --arg-desc";
+	my $cmd = "\"$align_prog\" --wrapper basic-0 --arg-desc";
 	open(my $fh, "$cmd |") || Fail("Failed to run command '$cmd'\n");
 	while(readline $fh) {
 		chomp;
@@ -198,6 +217,10 @@ for(my $i = 0; $i < scalar(@bt2_args); $i++) {
 			last;
 		}
 	}
+    if ($arg =~ /(bz2|lz4)$/) {
+        push @bt2w_args, ("-U", $arg);
+        $bt2_args[$i] = undef;
+    }
 }
 # If the user asked us to redirect some reads to files, or to suppress
 # unaligned reads, then we need to capture the output from Bowtie 2 and pass it
@@ -273,13 +296,13 @@ sub cat_file($$) {
 	my ($ifn, $ofh) = @_;
 	my $ifh = undef;
 	if($ifn =~ /\.gz$/) {
-		open($ifh, "gzip -dc $ifn |") ||
+		open($ifh, "gzip -dc \"$ifn\" |") ||
 			 Fail("Could not open gzipped read file: $ifn \n");
 	} elsif($ifn =~ /\.bz2/) {
-		open($ifh, "bzip2 -dc $ifn |") ||
+		open($ifh, "bzip2 -dc \"$ifn\" |") ||
 			Fail("Could not open bzip2ed read file: $ifn \n");
 	} elsif($ifn =~ /\.lz4/) {
-		open($ifh, "lz4 -dc $ifn |") ||
+		open($ifh, "lz4 -dc \"$ifn\" |") ||
 			Fail("Could not open lz4ed read file: $ifn \n");
 	} else {
 		open($ifh, $ifn) || Fail("Could not open read file: $ifn \n");
@@ -293,7 +316,7 @@ sub cat_file($$) {
 sub wrapInput($$$) {
 	my ($unps, $mate1s, $mate2s) = @_;
 	for my $fn (@$unps, @$mate1s, @$mate2s) {
-		return 1 if $fn =~ /\.gz$/ || $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
+		return 1 if $fn =~ /\.bz2$/ || $fn =~ /\.lz4$/;
 	}
 	return 0;
 }
@@ -446,7 +469,8 @@ else {
 my $debug_str = ($debug ? "-debug" : "");
 
 # Construct command invoking bowtie2-align
-my $cmd = "$align_prog$debug_str --wrapper basic-0 ".join(" ", @bt2_args);
+quote_params(\@bt2_args);
+my $cmd = "\"$align_prog$debug_str\" --wrapper basic-0 ".join(" ", @bt2_args);
 
 # Possibly add read input on an anonymous pipe
 $cmd = "$readpipe $cmd" if defined($readpipe);
@@ -527,7 +551,7 @@ if(defined($cap_out)) {
 			my $unal = ($fl & 4) != 0;
 			$filt = 1 if $no_unal && $unal;
 			if($passthru) {
-				if(scalar(keys %read_fhs) == 0) {
+				if(scalar(keys %read_fhs) == 0 || ($fl & 256) != 0) {
 					# Next line is read with some whitespace escaped
 					my $l = <BT>;
 				} else {

diff --git a/bin/Linux/bowtie2-align-l b/bin/Linux/bowtie2-align-l
diff --git a/bin/Linux/bowtie2-align-s b/bin/Linux/bowtie2-align-s
diff --git a/bin/Linux/bowtie2-build-l b/bin/Linux/bowtie2-build-l
diff --git a/bin/Linux/bowtie2-build-l-debug b/bin/Linux/bowtie2-build-l-debug
diff --git a/bin/Linux/bowtie2-build-s b/bin/Linux/bowtie2-build-s
diff --git a/bin/Linux/bowtie2-build-s-debug b/bin/Linux/bowtie2-build-s-debug
diff --git a/bin/Linux/samtools b/bin/Linux/samtools
diff --git a/docs/build_db.md b/docs/build_db.md
@@ -6,7 +6,7 @@ This script will allow you to build a MIDAS database using your own genomes or m
 
 This script uses user-defined species and user-supplied genomes to build database files for running MIDAS. This will allow you to estimate the abundance of these species in a shotgun metagenome and quantify gene content and SNPs for species with sufficient data.
 
-First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, USEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with 'run_midas.py genes', reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with 'merge_midas.py genes'.
+First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program [VSEARCH](https://github.com/torognes/vsearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, VSEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with `run_midas.py genes`, reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with `merge_midas.py genes`.
 
 Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user simply needs to define which genome they want to use as the representative for each species.
 
@@ -16,15 +16,15 @@ Finally, a marker genes database is built. Marker genes are defined as universal
 As with all scripts, MIDAS and its dependencies will need to be installed.  
 Additionally, you need to have the following command-line tools installed:
 
-* USEARCH: http://www.drive5.com/usearch/download.html
+* VSEARCH: https://github.com/torognes/vsearch
 * HMMER3: http://hmmer.org
 
 After installing them, you need to add their installation directories to your PATH:  
-`export PATH=$PATH:/usearch/installation/directory`  
+`export PATH=$PATH:/vsearch/installation/directory`  
 `export PATH=$PATH:/hmmer/installation/directory`
 
 You should be able to call these programs from your command line:  
-`usearch`  
+`vsearch`  
 `hmmsearch`
 
 ## Usage
@@ -33,39 +33,33 @@ You should be able to call these programs from your command line:
 
 ### Input/output files:
 
-<b>indir PATH</b>
-Path to directory of genomes. Each subdirectory should be named with a genome identifier. Each subdirectory should contain the following files:
+<b>indir:</b> Path to directory of genomes. Each subdirectory should be named with a genome identifier. Each subdirectory should contain the following files:
 
-\<genome_id>.fna
-Genomic DNA sequence in FASTA format    
+\<genome_id>.fna: Genomic DNA sequence in FASTA format    
 
-\<genome_id>.faa
-Genomic DNA sequence in FASTA format
+\<genome_id>.faa: Protein sequences in FASTA format
 
-\<genome_id>.ffn
-Protein sequences in FASTA format
-
-\<genome_id>.features
-File specifies genomic coordinates of genes.
-This file should be tab-delimited with a header and 5 fields:  
+\<genome_id>.ffn: Gene sequences in FASTA format
+
+\<genome_id>.genes: Tab delimited file with genomic coordinates of genes. The file should be tab-delimited file with a header and the following fields:  
 
-* scaffold_id (CHAR)
 * gene_id (CHAR)
+* scaffold_id (CHAR)
 * start (INT)
 * end (INT)
-* strand ('+' or '-')	 
-* gene_type ('CDS' or 'RNA')
-
-<b>mapfile PATH</b>  
-Path to mapping file that specifies which genomes belonging to the same species.  
+* strand (+ or -)
+* gene_type (CDS or RNA)
+
+<b>mapfile</b>: Path to mapping file that specifies which genomes belonging to the same species.  
 The file should be tab-delimited file with a header and 3 fields:  
 
 * genome_id (CHAR): corresponds to subdirectory within INDIR 
 * species_id (CHAR): : species identifier for genome_id
 * rep_genome (0 or 1): indicator if genome_id should be used for SNP calling
 
-<b>outdir PATH</b>  
-Directory to store MIDAS database
+<b>outdir:</b> Directory to store MIDAS database
+
+For example input files, see:`/path/to/MIDAS/test/genomes.tar.gz` and `/path/to/MIDAS/test/genomes.mapfile` 
 
 ### Options:
 

diff --git a/docs/citing.md b/docs/citing.md
@@ -0,0 +1,5 @@
+## Citation
+
+
+If you use this tool, please cite:
+S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". Genome Research 2016. doi:10.1101/gr.201863.115