Analysing pathogen genetic diversity and relationships between and within hosts at once, in windows along the genome.
phyloscanner's input is bam files: reads (fragments of nucleotide sequence) that have been mapped (aligned) to the correct part of some reference genome. We wrote phyloscanner to analyse bam files that each represent a pathogen population in one host, exhibiting within-host and between-host diversity; in general use each bam file should be a sample representing some subpopulation, and we analyse within- and between-sample diversity.
phyloscanner is freely available under the GNU General Public License version 3, described here.
phyloscanner runs natively on Linux and Mac OS, but not Windows.
However on any operating system (including Windows), if you have VirtualBox installed, you can run this image of Linux Ubuntu 16.04 which contains phyloscanner v1.1.2.
To make phylogenies from mapped reads, phyloscanner requires samtools, pysam (0.8.1 or later), biopython, mafft and RAxML; notes on installing these are here.
To analyse these phylogenies, or your own provided as input, phyloscanner needs some R packages; notes on installing these are here.
Info and help:
- The phyloscanner preprint, discussing the method and its scientific context, is here.
- The code's manual is here.
- Instructions and example data for a practical on using phyloscanner are here.
- Problem with the code? Create a New Issue.
- Query? Ask it publicly at the google group.
If you use phyloscanner for published work, please cite it and the tools it uses, details here.
The simulated bam files in ExampleInputData illustrate interesting within- and between-host diversity.
Let's analyse them with phyloscanner.
For this example I'll assume you've downloaded the phyloscanner code to your home directory, i.e. that its found in ~/phyloscanner/
.
First we need to make a file listing the files we want analysed; that file should be comma-separated variable format, containing lines like this: BamFile1,ReferenceFile1,ID1
where the third field is optional (if present it is used as an ID for that bam file).
You can make that csv file manually if you like, but here is an example of making it automatically from the command line for these bam files:
for i in ~/phyloscanner/ExampleInputData/*.bam; do
RefFile=${i%.bam}_ref.fasta;
echo $i,$RefFile
done > InputFileList.csv
To make some within- & between-host phylogenies, run this command from the directory where your local copy of this code repostory lives (or adjusting paths appropriately if run from another directory):
~/phyloscanner/phyloscanner_make_trees.py InputFileList.csv --auto-window-params 300,-700,1000,8300 --alignment-of-other-refs ~/phyloscanner/InfoAndInputs/2refs_HXB2_C.BW.fasta --pairwise-align-to B.FR.83.HXB2_LAI_IIIB_BRU.K03455
(Those window parameters make best use of this simulated data, the -A
option includes an alignment of extra reference sequences along with the reads, see the manual for the --pairwise-align-to
option.)
Now let's analyse those phylogenies:
~/phyloscanner/phyloscanner_analyse_trees.R RAxMLfiles/RAxML_bestTree. MyOutput s,20 --outgroupName C.BW.00.00BW07621.AF443088 --multifurcationThreshold g
In the output you'll see trees and summary information indicating that these samples constitute:
- a straightforward, singly infected individual,
- a dually infected individual i.e. infected by two distinct strains of virus,
- a contamination pair (one contaminating the other), and
- a pair exhibiting ancestry, i.e. one having evolved from the other. For populations of pathogens, this implies transmission, either indirectly (via unsampled intermediate patients) or directly.
Here is one of the trees from the output. Each patient has many sequences (reads), with one colour per patient. Extra reference sequences are coloured black. Can you see why each of the patients merits their label?
(These bam files were generated by taking HIV sequences from the LANL HIV database, using them as starting points for evolution simulated with SeqGen and calibrated to our own real data on within-host diversity. Reads were then generated in eight windows of the genome, instead of all along the genome, purely to keep file sizes nice and small while remaining an interesting example.)
I've only got raw reads fresh off the sequencing machine, not mapped reads. Any recommendations on how to map?
Why yes! For HIV at least, try shiver. We discussed here problems and solutions for mapping well in general.