These are the scripts written and used to implement all bioinformatic and statistical analyses by the authors of manuscript:
"The evolutionary history of human colitis-associated colorectal cancer"
Authors: Ann-Marie Baker^*, William Cross*, Kit Curtius*, Ibrahim Al-Bakir*, Chang-Ho Ryan Choi*, Hayley Davis, Daniel Temko, Sujata Biswas, Pierre Martinez, Marc Williams, James O Lindsay, Roger Feakins, Roser Vega, Stephen J Hayes, Ian PM Tomlinson, Stuart AC McDonald, Morgan Moorghen, Andrew Silver, James E East, Nicholas A Wright, Lai Mun Wang, Manuel Rodriguez-Justo, Marnix Jansen, Ailsa L Hart^, Simon J Leedham^ and Trevor A Graham^
* joint first authors
^For correspondence:
- Ann-Marie Baker
Email: [email protected]
- Simon Leedham
Email: [email protected]
- Trevor Graham
Email: [email protected]
Data used in the analyses (except for publicly available TCGA datasets, indicated in script placeholders) can be accessed at European Genome-phenome Archive (EGA) Study number EGAS00001003028
- Bash script that contains bioinformatics pipeline to create bam files from 81 low-pass whole genome sequenced tissue samples
- takes fastq file format and .txt with sample info as input (see workflow)
- R script to take processed bam files
- Saves segmentation and CNA call data in a .Rdata file for downstream analyses (see workflow below)
-
R script performs the analyses to create:
Figures 3B, 3C, 4
Supplementary Tables 9, 10
All corresponding Results in the Main text regarding the above
- Makes a shell script for each sample in a list. Shell script is designed to run on a cluster.
- Runs BCFtools to call a latent list of potential SNVs
- Same as above but for indels
- Makes a shell script to run Platypus jointly for a sample set (defined in the sample list)
- Each chromosome is run separately
- The source command is used to assess the BCFtools variant proposals
- The mergeFinalVCF script concatenates the resulting vcfs to one
- Same as above but for indels
- Script runs locally
- Filters the Platypus derived variants by coverage (min 10X for each variant)
- Annotates variants using AnnoVar
- Outputs a .txt version of the vcf for germline and somatic variants separately
- Script produces .seqz sequenza files from bams
- Script runs locally to analyse .seqz files as per manual
- Takes a .txt vcf files and outputs a .nexux file for phylogenetic analyses of variants
- Nexus file can be run in PAUP, Phylip or other compatible software
- Runs locally
- Reads in a multiple phylogeny, .tre file and outputs the most parsimonious tree along with statistics
- Runs locally
- Produces a table of tree shape statistics and homoplasy indexes
There are two scripts in the 'scripts' directory.
This script is in three parts:
A) Code is provided to read in mutation data for CA-CRC and (TCGA) sporadic CRC tumours, and classify mutations in each tumour among 96 mutation channels. For the CA-CRC data, since there are multiple sequenced regions for the same tumour, mutations are also assigned to individual tumour samples/regions in each sample.
B) Code is provided to assign mutational signature activities to each CA-CRC and sporadic CRC sample using non-negative least squares regression, implemented in the R package 'nnls'. This section also contains code for visualising the mutational signature assignments in CA-CRC samples.
C) Code is provided to assess the differences in inferred mutational signature composition between CA-CRC and sporadic CRC samples.
This script is in three parts:
A) Code is provided to read in mutation data for CA-CRC tumours and classify mutations in each tumour among 96 mutation channels. Since there are multiple sequenced regions for the same tumour, mutations are also assigned to individual tumour samples/regions in each sample. Based on the regional mutation assignments mutations are also classified in terms of timing as 'pre', 'early', or 'late'
B) Code is provided to assign mutational signatures to the mutations in each timing class for each tumour
C) Code is provided to test and visualise differences in each mutational signature between timing classes across samples.