-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpipeline.Rmd
378 lines (289 loc) · 17.3 KB
/
pipeline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
---
title: "RNAseq data simulation and benchmark"
output:
html_document:
number_sections: yes
theme: cosmo
toc: yes
toc_float: yes
html_notebook:
number_sections: yes
theme: cosmo
toc: yes
toc_float: yes
bibliography: pipeline.bib
---
```{r}
library(knitr)
library(ggplot2)
library(tools)
source("util_align.R")
source("util_de.R")
```
#Introduction
To run all the simulations in this notebook, set runSimulation=TRUE
To run the entire pipeline, set runPipeline=TRUE
```{r}
runSimulation=FALSE
runPipeline=FALSE
```
#Run the pipeline
This section automatically generates lists of differentially expressed genes after specifying [user inputs](#userInput_pipeline). For the interpretation and validation of DE, see [this section](#afterDE)
Before running the code, [check the installation of required packages](#runPipeline_package) and add the path of each package to environment.
If using Windows, Cygwin or Linux Subsystem for Windows 10 is required for the shell scripts (called by system function).
Salmon is not supported on Windows, as of version 1.2.0
##Required packages {#runPipeline_package}
The versions used while constructing this pipeline is listed in parentheses
FastQC(v0.11.8): http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Salmon(0.8.2):https://salmon.readthedocs.io/en/latest/index.html [@patro_salmon_2017]
R packages:
- limma 3.32.10 [@ritchie_limma_2015]
- biomaRt 2.32.1
##Pipeline
###User Input {#userInput_pipeline}
Specify the following parameters before running the pipeline
```{r, eval=runPipeline}
outDir="Path to output directory"
## fastq files ##
dataDir_fq=c("vector of paths to directories containing all fastq files") #if there are multiple directories, each will be considered as a batch; the directory names will be used as batch name
fastq_extension=c('fastq','fq','fastq.gz','fq.gz') #add fastq file extension if not already listed
## pseudo-alignment with salmon ##
runSalmon=TRUE #option to reuse previous salmon results
salmon_index=NA #if using pre-computed index, replace NA with path
reference_fa=NA #if not using pre-computed index, replace NA with path to the fasta of reference transcriptome
reftxpt='ensembl' #name of reference
paired_end=TRUE #if library is paired-end
if(paired_end){
#option 1: provide the min number of starting characters to identify a library pair, e.g. 9 for library01_left.fq, library01_right.fq, library02_left.fq, library01_right.fq
libID_len=NA #int or vector of length len(dataDir_fq)
#option 2: enter full path to all library pairs as shown
libPairs=list(c('path to library 1 left reads','path to library 1 right reads'),c('library 2 left reads','library 2 right reads'))
}
## DE ##
condition<-list(covariate1=c("condition1_1","condition1_2"),covariate2=c("condition2_1","condition2_2")) #list the experimental conditions; e.g. list(sex=c('female',"male"),treatment=c("high","low"))
refcondition<-list(c("covariateName","conditionName"),c("covariateName","conditionName")) #condition with respect to which DE of other samples is calculated
### example of loading transcript to gene matrx from ensembl using biomaRt; alternatively you can provide a matrix with the first column (target_id) containing transcript id and second column (ens_gene) containing gene id; this step aggregates transcription-level alignment to gene-level counts ###
ensembl <- useMart("ENSEMBL_MART_ENSEMBL",dataset="mmusculus_gene_ensembl")
ttg<-biomaRt::getBM(attributes = c("ensembl_transcript_id","ensembl_gene_id"),mart = ensembl)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,ens_gene = ensembl_gene_id)
```
Optional adjustments to default parameters
```{r,eval=runPipeline}
alignmethod<-"salmon" #name of the aligner
## salmon ##
kmer_length=31
libType='A'
## DE ##
demethod=c('voom') #list of DE methods; choose from 'voom', 'edgeR', 'DEseq2'
lowcountfilter=TRUE #if filter out low-count genes by CPM threshold
threshCPM_byAvgColSum=1.5/10e6 #genes passing filter have averages of replicates > avgColsum*threshCPM_byAvgColSum; avgColSum = average of sums of counts of all samples; genes passing filter for each batch are calculated separately
maxNDEG<-'all' #maximum number of genes in the output matrix
FDR<-1 #maximum false discovery rate above which genes will not be included in output
multipletesting_method<-'BH' #method to adjust for multiple testing; choose from "none", "BH", "BY" and "holm"
### optionally provide/construct id2symbol matrix to convert gene ID to symbol, e.g. ensembl ID to MGI symbol
id2symbol=NA #option 1: 2-column matrix - "id" column contains gene ID; "symbol" column contains gene symbols to be converted to; each row specifies conversion for one gene; duplication of id is allowed
convID=NA #option 2: build id2symbol from biomaRt, input format c(id='name of gene ID for biomaRt attribute',symbol='name of gene symbol for biomaRt attribute'), e.g. c(id="ensembl_gene_id",symbol="mgi_symbol")
```
Initializing analysis...
```{r, eval=runPipeline}
## checking output directory ##
if(!dir.exists(outDir)){dir.create(outDir)}
batches<-c()
for(fqi in dataDir_fq){
fqi_name=basename(fqi)
batches<-c(batches,fqi_name)
}
contrasts<-condition[-which(condition %in% c(refcondition))] #TODO: modify this to generate a list with length=length(refcondition)
```
###QC with fastQC
```{r, eval=(FALSE | runPipeline)}
for(fqi in dataDir_fq){
fqi_name=basename(fqi)
fastqc_out=file.path(outDir,"fastQC_results",fqi_name)
dir.create(fastqc_out)
for(fq in list_files_with_exts(fqi,fastq_extension,recursive=TRUE,full.names = TRUE)){
system(paste0("fastqc -f fastq ",fq," --extract ","--outdir=",fastqc_out))
}
}
```
###Pseudo-alignment with salmon
```{r, eval=(FALSE | (runPipeline & runSalmon))}
setwd(outDir)
if(is.na(salmon_index)){
system(paste0("salmon index -t ",reference_fa," -i transcripts_index_salmon --type quasi -k 31"))
salmon_index="transcripts_index_salmon"
}
if(!paired_end){
for(fqi in dataDir_fq){
fqi_name=basename(fqi)
salmon_out=file.path(outDir,paste0("salmon",reftxpt,fqi_name))
dir.create(salmon_out)
for(fq in list_files_with_exts(fqi,fastq_extension,recursive=TRUE,full.names = TRUE)){
salmon_out_i=file.path(salmon_out,file_path_sans_ext(basename(fq)))
dir.create(salmon_out_i)
system(paste0("salmon quant -i ",salmon_index," -l ",libType," -1 ",fq," -o ",salmon_out_i))
}
}
}else{
if(!is.na(libID_len)){
getLibPairs(libID_len,dataDir_fq) #TODO: implement getLibPairs in util_align
}
for(fqpair in libPairs){
fqpair_name=basename(dirname(fqpair[[1]]))
salmon_out=file.path(outDir,paste0("salmon",reftxpt,fqpair_name))
if(!dir.exists(salmon_out)){dir.create(salmon_out)}
salmon_out_i=file.path(salmon_out,file_path_sans_ext(basename(fqpair[[1]])))
dir.create(salmon_out_i)
system(paste0("salmon quant -i ",salmon_index," -l ",libType," -1 ",fqpair[[1]]," -2 ",fqpair[[2]]," -o ",salmon_out_i))
}
}
```
###Differential expression analysis
```{r,eval=(FALSE | runPipeline)}
#TODO: modify util_de.R, include multiple covariates & refcondition
sourcedir<-outDir
setwd(sourcedir)
##get counts
countsdf<-countdf(alignmethod,reftxpt,batches,condition)
##low count filtering##
if(lowcountfilter==TRUE){
countsdfCPM<-filterCPM(countsdf,threshCPM_byAvgColSum)
}
savepath<-file.path(outDir,'de')
if(!dir.exists(savepath)){dir.create(savepath)}
for(dem in demethod){
for(c in contrasts){
deresults<-de(countsdfCPM_photo2_no20mW,c,dem,multipletesting_method,nDEG = maxNDEG,FDR = FDR)
# drawvenn(deresults,"/Users/Xinyi/rna-seq/data/n2a_prelim/de/venn_de",refcondition,c)
genelist(deresults,savepath,paste0(c,dem),id2symb=id2symbol,convID=convID)
}
}
```
#Benchmark of pipelne & alternative choicees
##(Pseudo-)alignment and Quantification
The first step in the analysis is usually to transform the fastq files into gene expression matrices. This involves aligning reads to reference genome or reference transcriptome and quantification of aligned reads. Here's a list of choices to consider:
- choice of reference
* The choice of reference transcriptomes has pronounced influence on gene quantifications due to reasons such as different gene definitions [@zhao_comprehensive_nodate]. See [our demonstration of the difference in DE results due to choice of references ](#ref_align_deagreement).
* alignment to genome vs transcriptome
- choice of alignment/pseudo-alignment method
###simulation and benchmark
We used Polyester [@frazee_polyester_2015] to simulate RNAseq libraries where a user generated fold-change matrix is used as the ground truth. This simulation is based on mouse mRNA reference transcriptome from UCSC (simAlign/reftxmrna_ucsc_refgene.fasta.gz, assembly GRCm38/mm10, UCSC RefSeq (refGene) table [@karolchik_ucsc_2004]).
To use our simulated data, please refer to simAlign/simfasta1, simAlign/simfasta2, simAlign/simfasta3
To use our framework directly, please refer to the code below and replace parameter values as needed (set eval or runSimulation to TRUE before running).
####Run simulation
Required packages:
- Polyester [@frazee_polyester_2015]
- seqinr
- Biostrings
```{r, eval=(FALSE | runSimulation)}
refPath<-"reftx/reftxmrna_ucsc_refgene.fasta.gz" #path to reference transcriptome
ngroup<-2 #number of conditions in the library
ndata<-3 #number of replicates in each condition
deprob<-0.15 #probability of differential expression
demean<-2 #mean of fold change for DE genes
savepath_fcmtx='simfcmtx' #directory to save fold change matrix
savepath_lib="simfasta" #directory to save simulated libraries
simulateLib_polyester(refPath,ngroup,ndata,deprob,savepath_fcmtx,savepath_lib)
```
####Comparisons of mapping rates
Mapping rates between Salmon and kallisto (0.43.1) are comparible when using ensemble and UCSC as the references.Mapping rates can be found as outputs of Salmon and Kallisto
![](simAlign/results/MappingRates.jpeg){ width=70% }
####Comparisons of quantification errors
percent error of transcript X = abs(estimated transcript count of X - actual count of X)/actual count of X * 100
![](simAlign/results/quantError.png){ width=70% }
####Impacts of reference and (pseudo-)alignment methods on downstream DE analysis {#ref_align_deagreement}
We compared the impacts on two DE algorithms: DEseq2 and limma-voom. The comparison was performed for two datasets listed in below.
#####Differentiating 3T3L1 pre-adipocytes
To reproduce our results, run deAgreement.R in branch master (results will be stored in simAlign/results/3T3L1/deagreement and simAlign/results/3T3L1/deagreement_commonGene). This detects DE between Day 1 and Day 5 after initiation of differentiation.
Examples using DEseq2 and voom are shown below for FDR < 0.05. After filtering out DE genes with low logFC, proportions of overlapping DE genes decrease. See the results not shown here (including other FDR, logFC thresholds) in simAlign/results/3T3L1/deagreement and simAlign/results/3T3L1/deagreement_commonGene.
![DEseq2; FDR<0.05; no logFC cutoff](simAlign/results/3T3L1/deagreement/pvalue05lfc0deseq2.jpeg){ width=70% }
![DEseq2; FDR<0.05; absolute value of logFC>1](simAlign/results/3T3L1/deagreement/pvalue05lfc1deseq2.jpeg){ width=70% }
![voom; FDR<0.05; no logFC cutoff](simAlign/results/3T3L1/deagreement/pvalue05lfc0voom.jpeg){ width=70% }
![voom; FDR<0.05; absolute value of logFC > 1](simAlign/results/3T3L1/deagreement/pvalue05lfc1voom.jpeg){ width=70% }
There are genes not shared by all combinations of reference and alignment method. So we also plotted the Venn diagrams with the same DE analysis results but only included genes shared by all combinations of {salmon,kallisto}x{ensembl, UCSC refSeq}.
![DEseq2; FDR<0.05](simAlign/results/3T3L1/deagreement_commonGene/pvalue05deseq2.jpeg){ width=70% }
![voom; FDR<0.05](simAlign/results/3T3L1/deagreement_commonGene/pvalue05voom.jpeg){ width=70% }
#####Neuro2A treated with alpha-synuclein fibril
To reproduce our results, run deAgreement.R in branch n2a (results will be stored in simAlign/results/n2a_asyn/deagreement and simAlign/results/n2a_asyn/deagreement_commonGene). This detects DE between two time points in our alpha-syn experiment (T2 and T7).
Two examples using DEseq2 and voom are shown below. Only genes shared by all combinations of {salmon,kallisto}x{ensembl, UCSC refSeq} are included. See the results not shown here (including other FDR, logFC thresholds) in simAlign/results/n2a_asyn/deagreement and simAlign/results/n2a_asyn/deagreement_commonGene.
![DEseq2; FDR<0.05; no logFC cutoff](simAlign/results/n2a_asyn/deagreement_commonGene/pvalue05lfc0deseq2Batch1.jpeg){ width=70% }
![voom; FDR<0.05; no logFC cutoff](simAlign/results/n2a_asyn/deagreement_commonGene/pvalue05lfc0voomBatch1.jpeg){ width=70% }
##Normalization and DE analysis
Statistical inferences of differential gene expression (DE), including the normalization procedures, usually make assumptions about typical datasets' characteristics, such as gene-count distribution and degree of DE [@seyednasrollah_comparison_2015, @dillies_comprehensive_2013, @dillies_comprehensive_2013]. It is important to understand how robust each normalization or DGE analysis method is when its assumptions are violated.
To test the performance given datasets with distinct features, we simulated
- 23 datasets with 11 varying parameters
- 11 empirically simulated datasets with 5 varying parameters
![](testdata/parameters.png){ width=70% }
All simulations used Splatter [@zappia_splatter:_2017] with dropout rate equals to zero. Fixed parameters of the empirical simulations were derived from a dataset by Bottomly et al [@bottomly_evaluating_2011].
###Simulation results
We tested the following combinations of normalization and DE analysis methods:
- edgeR with
- TPM
- FPKM
- *Trimmed Mean of M-values (TMM), default normalization of edgeR*
- Relative Log Expression (RLE)
- Upper Quartile (UQ)
- RUVg + TMM
- RUVg (remove unwanted variation) + RLE
- RUVg + UQ
- DEGES (DEG elimination strategy) + TMM
- DEGES + RLE, DEGES + TbT
- DEGES with
- DESeq2
- baySeq
- edgeR
- limma-voom
- baySeq, DESeq2, and limma-voom with their default normalization methods
The benchmark results can be found in testdata/normtestresults and testdata/DEtestresults
The simulated datasets can be found in testdata/simulatedData
To reproduce our results, see the [next section](#de_simtest)
Choice of normalization methods had very small impact on the benchmarking results. The ROC, accuracy vs FDR, and FDR control curves are comparable for all normalizations. An example is shown for ROC
![](testdata/normtestresults/visualizedresults/roc/empsim_de.downProb_def.jpeg){ width=70% }
Choice of DE analysis methods has trade-off between true positive rates and false discovery rates
![](testdata/DEtestresults/visualizedderesults/tprvalue/empsim_de.downProb_def.jpeg){ width=70% }
![](testdata/DEtestresults/visualizedderesults/trueFDR/empsim_de.downProb_def.jpeg){ width=70% }
###Run simulation & benchmark {#de_simtest}
Requires:
- splatter 1.0.3 [@zappia_splatter:_2017]
```{r, eval=(FALSE | runSimulation)}
## enter the parameter values you want to test ##
groupCells<-c(3,9,6)
mean.shape<-c(0.4,0.8,0.6)
mean.rate<-c(0.1,0.5,0.3)
lib.loc<- c(6,16,11)
lib.scale<-c(0.15,0.5,0.35)
out.prob<-c(0.01,0.1,0.05)
de.prob<-c(0.05,0.3,0.1)
de.downProb<-c(0.25,0.75,0.5)
de.facLoc<-c(0.05,0.3,0.1)
de.facScale<-c(0.2,0.6,0.4)
bcv.common<-c(0.05,0.2,0.1)
generateSimulatedCounts(normalizationMethod,deMethod,groupCells,mean.shape,mean.rate,lib.loc,lib.scale,out.prob,de.prob,de.downProb,de.facLoc,bcv.common)
```
#Interpretation and validation of differential expression analysis {#afterDE}
##False positives
We performed RNA-seq twice on a total of 6 samples of mouse Neuro2A cell cultures (3 samples for each RNA-seq). The same protocols were used and all libraries have comparable quality. We found 2665 differentially expressed genes with FDR < 0.05 using the default settings of this pipeline. With the limited number of replicates available in a typical RNA-seq experiment, it is important to consider confounding differences caused by experimental design and handling. In the following sections, we provide some methods for validation and interpretation of DE genes.
##Functional enrichment analysis
###Statistical over-representation test
This tests for any over or under represented annotation terms in the query list with respect to the background. Several options are listed below:
* DAVID [@huang_systematic_2009]: https://david.ncifcrf.gov/
- Only over-representations are calculated
- Last knowledge base update was in May 2016
- DAVID has access to multiple databases
* PANTHER [@mi_protocol_2019]: http://pantherdb.org/
- Both over and under representations are calculated
- Maintained up to date with Gene Ontology annotations
####Choosing background gene list
Some common choices of background include genes known to be expressed in a particular cell type of interest. A proxy for cell-type specific background can be found by using all non-zero genes:
```{r,eval=FALSE}
countsmtx<-as.matrix(as.data.frame(countsdf))
background<-rownames(countsmtx)[which(rowSums(countsmtx) > 0)]
```
However, it is possible that this list does not cover all genes that can be expressed due to the limited experimental conditions tested.
###GSEA
###WGCNA
###ADAGE
##Dimensionality reduction
##Enrichment of transcription factors
##Batch consensus
# References