This repository has been archived by the owner on Jun 23, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_dge_descriptive_viz.Rmd
532 lines (444 loc) · 19 KB
/
2_dge_descriptive_viz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
---
title: "2. Visualisation of DESeq2 results"
description: "PCA and clustered heatmap"
principal investigator: "Joaquín de Navascués"
researchers: "Aleix Puig-Barbé, Joaquín de Navascués"
output:
html_document:
toc: true
toc_float: true
code_folding: show
theme: readable
df_print: paged
css: doc.css
---
```{r setup, echo=FALSE, cache=FALSE}
ggplot2::theme_set(ggpubr::theme_pubr(base_size=10))
knitr::opts_chunk$set(dev = 'png',
fig.align = 'center', fig.height = 7, fig.width = 8.5,
pdf.options(encoding = "ISOLatin9.enc"),
fig.path='notebook_figs/', warning=FALSE, message=FALSE)
```
# 1 Preparation
**Libraries/utils:**
```{r libraries, warning=FALSE, message=FALSE}
if (!require("librarian")) install.packages("librarian")
librarian::shelf(
# tidyverse
tibble, stringr, tidyr, purrr, dplyr,
# data other
reshape2, santoku, calibrate, readxl, writexl,
DESeq2, DescTools, matrixStats,
# graphics
ggplot2, ggthemes, ggtext, ggrepel, ggpubr,
eulerr, RColorBrewer, cetcolor,
# convenience
here)
if(!exists("ggmaplot2", mode="function")) source("utils.R")
```
**Set working directory:**
```{r setwd}
if (Sys.getenv("RSTUDIO")==1) {
# setwd to where the editor is, if the IDE is RStudio
setwd( dirname(rstudioapi::getSourceEditorContext(id = NULL)$path) )
} else {
# setwd to where the editor is in a general way - maybe less failsafe than the previous
setwd(here::here())
# the following checks that the latter went well, but assuming
# that the user has not changed the name of the repo
d <- str_split(getwd(),'/')[[1]][length(str_split(getwd(),'/')[[1]])]
if (d != 'RNAseq-EmcDaSc-adult_midgut') { stop(
paste0("Could not set working directory automatically to where this",
" script resides.\nPlease do `setwd()` manually"))
}
}
```
**To save images outside the repo (to reduce size):**
```{r define_dir2figs}
figdir <- paste0(c(head(str_split(getwd(),'/')[[1]],-1),
paste0(tail(str_split(getwd(),'/')[[1]],1), '_figures')),
collapse='/')
dir.create(figdir, showWarnings = FALSE)
```
# 2 RNAseq results visualisation: Principal Components Analysis
## 2.1 Load and check expression data
We have seen how batch correction is necessary to show a meaningful PCA. Of the different expression measures, vst-normalised data is the most reasonable to make comparisons across samples.
```{r load-vst}
targets <- readRDS('output/targets.RDS')
vsd <- readRDS('output/vst_pseudocounts_batchCorrected.RDS')
```
When performing PCA on RNAseq data, it is often useful to filter genes with very low variance across experimental conditions. To test whether this makes sense in our case:
```{r visualise-variance, warning=FALSE}
# get per-gene variance and mean expression
var_vsd <- rowVars(assay(vsd), rm.na=TRUE)
avg_vsd <- rowMeans(assay(vsd))
df <- data.frame(Variance=var_vsd, Mean=avg_vsd)
# plot close to variance==0 with inset:
p1 <- ggplot(df, aes(x=Variance, y=Mean)) +
geom_point(size=0.1, alpha=0.5, colour='#fe4b03') +
geom_vline(xintercept=0, colour='black', size=0.1)
p2 <- ggplot(df, aes(x=Variance, y=Mean)) +
geom_point(alpha=0.1) + xlim(0,0.1) + ylim(5,20) +
theme(panel.background = element_rect(fill='grey90')) +
geom_vline(xintercept=0, colour='#02ccfe')
p1 + annotation_custom(ggplotGrob(p2),
xmin=5, xmax=12,
ymin=5, ymax=25)
```
Some genes have very low variance (which is the whole point of the `vst` transformation), but none of them are zero.
There seems to be little point in drawing an arbitrary threshold, so I move on with the whole dataset.
## 2.2 PCA plot
```{r}
pca <- prcomp(t(assay(vsd)), center=TRUE)
scores <- data.frame(targets$sampleIDs, pca$x[,1:2])
xtitle <- paste0('**PC1** (', round(summary(pca)$importance['Proportion of Variance','PC1']*100,1), ' %)')
ytitle <- paste0('**PC2** (', round(summary(pca)$importance['Proportion of Variance','PC2']*100,1), ' %)')
ptitle <- 'Principal component scores'
stitle <- '(*vst* pseudocounts, batch-corrected)'
ggplot(scores,
aes(x = PC1, y = PC2,
label=factor(targets$sampleIDs),
colour=factor(targets$condition_md) )
) +
# data
geom_vline(xintercept=0, linetype='dashed', colour='grey60', linewidth=0.5) +
geom_hline(yintercept=0, linetype='dashed', colour='grey60', linewidth=0.5) +
geom_point(size=4) +
geom_point(size=2, colour='white', alpha=0.5) +
geom_text_repel(size=4, alpha=0.75,
max.overlaps=Inf,
seed=42,
force=0.3, force_pull=1,
box.padding=1, point.padding=0.5) +
lims(x= c(-100, 150), y = c(-80, 80)) +
# decorations
theme_minimal(base_size=16) +
labs(x=xtitle,
y=ytitle,
title=ptitle,
subtitle=stitle) +
scale_colour_discrete(name="condition") +
theme(legend.text = element_markdown(size=10),
axis.title.x = element_markdown(),
axis.title.y = element_markdown(),
plot.title = element_text(hjust=0.5),
plot.subtitle = element_markdown(hjust=0.5, color="grey40"),
axis.text.x = element_text(color="grey60"),
axis.text.y = element_text(color="grey60"),
axis.ticks.x = element_line(linewidth=0.5, linetype = "solid", color="grey60"),
axis.ticks.y = element_line(linewidth=0.5, linetype = "solid", color="grey60"),
axis.ticks.length=unit(-0.25, "cm"),
panel.border = element_rect(linewidth=1, linetype = "solid", fill = NA),
panel.grid = element_blank())
ggsave('PCA_vsd_batchcorr.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300)
```
# 3 RNAseq results visualisation: Euler diagrams
To visualise the degree of overlap and independence of the differentially expressed gene sets, I tried both Venn diagrams and UpSet plots in different colourmaps and spatial organisations. Having four genetic conditions made them all too complicated to grasp any pattern at a glance, so I turned to Euler diagrams. These have the advantage of capturing the size of sets and their intersections in the area of the circles and their overlaps, but the disadvantage of doing so only _approximately_. However, this is a very minor problem that does not affect the message.
# 3.1 Prepare data
```{r load-dge}
# experimental design and labels
targets <- readRDS('output/targets.RDS')
# DEG data
DaDaOE_deg <- readRDS('output/Control_vs_DaDaOE.RDS')
DaKD_deg <- readRDS('output/Control_vs_DaKD.RDS')
DaOE_deg <- readRDS('output/Control_vs_DaOE.RDS')
ScOE_deg <- readRDS('output/Control_vs_ScOE.RDS')
# gene symbols
dlist <- read.table(file="resources/gene_symbols.txt", header=TRUE)
names(dlist)[[1]] <- 'ensembl_gene_id'
rownames(dlist) <- dlist$ensembl_gene_id
```
Now, for each condition, we will need to get the lists of misregulated genes for a given threshold of fold change:
```{r extract_regulated_sets, warning = FALSE}
list_of_degs <- list(DaKD_deg, DaOE_deg, DaDaOE_deg, ScOE_deg)
names_degs <- list('daRNAi', 'da', 'da:da', 'scute')
fc_thresh <- 1.5
regs <- extract_regulated_sets(list_of_degs, names_degs, fc_thresh=fc_thresh)
# attach the results to the DEG data frame for each condition, to be used later
DaKD_deg$reg <- regs$daRNAi
DaOE_deg$reg <- regs$da
DaDaOE_deg$reg <- regs$`da:da`
ScOE_deg$reg <- regs$scute
regs[500:504,]
```
Now we need to turn this into logicals:
```{r extract_up|down}
upreg <- get_deg_logical(regs,'up')
downreg <- get_deg_logical(regs,'down')
```
# 3.2 Plot diagrams
Using instructions from the [vignette](https://cran.r-project.org/web/packages/eulerr/vignettes/introduction.html):
```{r euler_plots, results='hide'}
up_colours <- c(
brewer.pal(9,'BrBG')[3], # daRNAi brown
brewer.pal(9,'YlOrBr')[4], # daOVEX yellow/orange
brewer.pal(9,'Reds')[5], # dadaOVEX red
brewer.pal(9,'PuRd')[3] # scOVEX violet
)
down_colours <- c(
brewer.pal(11,'PRGn')[8], # daRNAi green
brewer.pal(11,'RdYlBu')[8], # daOVEX pale blue
brewer.pal(11,'RdBu')[9], # dadaOVEX blue
brewer.pal(9,'Purples')[5] # scOVEX purple
)
upfit <- euler(upreg, loss = 'region', loss_aggregator = 'max')
downfit <- euler(downreg, loss = 'region', loss_aggregator = 'max')
upplot <- plot(upfit,
fill = up_colours,
quantities = TRUE)
downplot <- plot(downfit,
fill = down_colours,
quantities = TRUE)
# save it
pdf(file=paste0(figdir,'/euler_up.pdf'),
width=10, height=12)
upplot
dev.off()
pdf(file=paste0(figdir,'/euler_down.pdf'),
width=10, height=12)
downplot
dev.off()
print(upplot)
print(downplot)
```
# 4 RNAseq results visualisation: MA plots
We use M (log ratio) and A (mean average) plots to have an overview of the distribution of differential gene expression data. Alongside this, it would be informative to visualise where some individual marker genes lie within the distribution, so we describe next how the list of marker genes was selected.
## 4.1 List of marker genes
The aim is to have a list of genes that could serve as cell type _exclusive_ markers. We will only consider a simplified cell type classification: ISC, EB, EE, EC of the adult posterior midgut. They must be either expressed exclusively in a cell type (e.g. _Delta_), or functionally promote one cell type only (e.g. _phyllopod_). We start with a curated list and the markers defined by the Fly Cell Atlas consortium. Then we will remove redundancy.
- Curated markers:
```{r load_GOI, message=FALSE}
genes_of_interest <- read_excel('resources/genes_of_interest.xlsx', sheet='gene_table') %>%
dplyr::select(gene_symbol, bonafide_celltype) %>%
filter(bonafide_celltype != 'NA') %>%
dplyr::rename(celltype = bonafide_celltype) %>%
group_by(celltype) %>%
dplyr::summarise(gene_symbol=paste(gene_symbol, collapse=', '))
genes_of_interest
```
- FCA markers:
```{r loadFCA, message=FALSE}
markers <- read_xlsx(file.path(getwd(), 'resources', 'science.abk2432_table s2.xlsx')) %>%
# get only the gut epithelium/muscle cell types
filter(Tissue=='gut' | `Annotation...1`=='enteroendocrine cell') %>%
# remove a hybrid/transitional cell type
filter(!str_detect(`Annotation...1`, 'differentiating')) %>%
# remove muscle/artefact cells
filter(!str_detect(`Broad annotation`, 'muscle|artefact')) %>%
# get just the columns with marker names
dplyr::select(contains(c('Annotation...1', 'Marker'))) %>%
# join all markers in a new column
unite(gene_symbol, contains('Markers'), sep = ', ')
names(markers)[[1]] <- 'celltype'
markers
```
- Merge markers:
```{r merge_markers, message=FALSE}
# some markers are spelled differently in the Fly Cell Atlas and our genome annotation:
missd.spelld <- c(`wntD, `='', `LManIV, `='', `CG31269, `='', `CG43208, `='', `orcokinin B, `='',
AstB='Mip', poxn='Poxn')
markers <- rbind(genes_of_interest, markers) %>%
# merge same cell type markers
group_by(celltype) %>%
dplyr::summarise(gene_symbol=paste(gene_symbol, collapse=', ')) %>%
# correct difference in synonym/capitalisation with DEG data...
# ... while all gene symbols are in one string
dplyr::mutate(gene_symbol = str_replace_all(gene_symbol, missd.spelld)) %>%
# make each gene name an independent string
rowwise() %>% dplyr::mutate(markers = list(unique(str_split(gene_symbol, ', ')[[1]])))
markers
```
- Filter non-cell type-exclusive markers
```{r filter_markers, message=FALSE}
# filter by exclusivity, but keeping what is common to all enterocytes
## first identify all EC markers
ECmarkers <- markers %>%
filter(str_detect(celltype, 'enterocyte')) %>%
dplyr::select(markers) %>%
unlist(use.names = FALSE) %>%
unique() %>% as.list()
## then non-EC cell markers, with all their repetitions!!
nonECmarkers <- markers %>%
filter(!str_detect(celltype, 'enterocyte')) %>%
dplyr::select(markers) %>%
unlist(use.names = FALSE) %>% as.list()
## get all markers with one occurrence (≠unique!)
repmarkers <- unlist(c(ECmarkers, nonECmarkers))
reps <- rle(sort(repmarkers))
xmarkers <- reps$values[reps$lengths==1]
## apply criterion of exclusivity
exclude <- function(x) x[x %in% xmarkers]
markers <- markers %>%
dplyr::mutate(exclusive = list(exclude(markers)))
pmgcelltypes <- c('intestinal stem cell', 'enteroblast', 'enteroendocrine cell', 'enterocyte of posterior adult midgut epithelium')
pmg.markers <- markers %>%
filter(celltype %in% pmgcelltypes) %>%
dplyr::select(c(1,4)) %>%
dplyr::rename(xmarkers = exclusive)
pmg.markers
```
- Add cell type colours HEX codes
```{r colour_markers}
write_xlsx(unnest_longer(pmg.markers, xmarkers) %>%
dplyr::rename(gene_symbol = xmarkers),
path = 'output/preliminary_Table S4.xlsx')
cellcolours <- c('EB'='#56B2E9', 'EC'='#009E73', 'EE'='#CC79A7', 'ISC'='#D55E00')
pmg.markers$cellcolour <- cellcolours
```
## 4.2 MA plots using `ggpubr::ggmaplot` wrappers
#### MA plot for *daughterless* knockdown
```{r ggmaplot2_da_knockdown}
repulsion <- list(box.padding = 0.1,
point.padding = 0.8,
nudge_x = 3,
nudge_y = 2,
force = 1,
force_pull = 0,
seed.up = 42,
seed.dn = 50,
xlims.up = c(14, NA),
ylims.up = c(5, NA),
xlims.dn = c(18, NA),
ylims.dn = c(NA, -1))
ggmaplot2(deg = DaKD_deg,
fc_thresh = fc_thresh,
markers = pmg.markers,
md_label = '*esg > da^RNAi^*',
repulsion = repulsion)
```
Now with cell type-specific colours:
```{r ggmaplot3_da_kd}
repulsion <- list(box.padding = 0.08,
point.padding = 0.1,
nudge_x = 0,
nudge_y = 0,
force = 1,
force_pull = -0.004,
seed.up = 42,
seed.dn = 5,
xlims.up = c(12, NA),
ylims.up = c(5.5, NA),
xlims.dn = c(16, NA),
ylims.dn = c(NA, -1))
ggmaplot3(deg = DaKD_deg,
fc_thresh = fc_thresh,
markers = pmg.markers,
md_label = '*esg > da^RNAi^*',
repulsion = repulsion)
suppressMessages(ggsave('MA_daKD.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```
#### MA plot for *daughterless* overexpression
```{r ggmaplot3_da_overexpression}
repulsion <- list(box.padding = 0.1,
point.padding = 0.1,
force = 1.5,
force_pull = -0.003,
nudge_x.up = 0,
nudge_y.up = 0,
nudge_x.dn = 0,
nudge_y.dn = 0,
seed.up = 42,
seed.dn = 5,
xlims.up = c(12, NA),
ylims.up = c(5, NA),
xlims.dn = c(16, NA),
ylims.dn = c(NA, -1))
ggmaplot3(deg = DaOE_deg,
fc_thresh = fc_thresh,
markers = pmg.markers,
md_label = '*esg > da*',
repulsion = repulsion)
suppressMessages(ggsave('MA_daOE.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```
#### MA plot for the overexpression of the Daughterless homodimer
```{r ggmaplot2_da:da_overexpression}
repulsion <- list(box.padding = 0.01,
point.padding = 0.1,
nudge_x = 0,
nudge_y = 0,
force = 1,
force_pull = -0.004,
seed.up = 42,
seed.dn = 5,
xlims.up = c(16, NA),
ylims.up = c(3, NA),
xlims.dn = c(16, NA),
ylims.dn = c(NA, -2))
ggmaplot3(deg = DaDaOE_deg,
fc_thresh = fc_thresh,
markers = pmg.markers,
md_label = '*esg > da:da*',
repulsion = repulsion)
suppressMessages(ggsave('MA_dadaOE.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```
#### MA plot for the _scute_ overexpression
```{r ggmaplot2_sc_overexpression}
repulsion <- list(box.padding = 0.01,
point.padding = 0.1,
nudge_x = 0,
nudge_y = 0,
force = 0.7,
force_pull = -0.003,
seed.up = 42,
seed.dn = 5,
xlims.up = c(8, NA),
ylims.up = c(7, NA),
xlims.dn = c(8, NA),
ylims.dn = c(NA, -6))
ggmaplot3(deg = ScOE_deg,
fc_thresh = 2,
markers = pmg.markers,
md_label = '*esg > scute*',
repulsion = repulsion)
suppressMessages(ggsave('MA_ScOE.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```
# 5 Heatmaps of individual genes
It is useful to have a representation of the magnitude of the expression changes across samples of a group of genes. Heatmaps seem the best option, provided there is also a representation of whether the changes are statistically significant. To do this, I use a wrapper of _pretty heatmap_ package main function, `pheatmap`, to draw significance-coloured asterisks as a 2nd layer of information.
## 5.1 Heatmap of cell type-exclusive markers
```{r xmarkers-heatmap, fig.width=20}
# get genes of interest (cell type exclusive markers, in this case) in tidy form
tidy.markers <- pmg.markers %>% unnest_longer(xmarkers) %>%
dplyr::rename(gene_symbol = xmarkers)
# add markdown description of genotype
DaKD_deg$condition <- '*da^RNAi^*'
DaOE_deg$condition <- '*da*'
DaDaOE_deg$condition <- '*da:da*'
ScOE_deg$condition <- '*scute*'
# get all the information needed in a df
genehm.df <- bind_rows(ScOE_deg, DaDaOE_deg, DaOE_deg, DaKD_deg) %>%
dplyr::select(ensemblGeneID, gene_symbol, log2FoldChange, padj, condition) %>%
dplyr::rename(p.adjust = padj) %>%
filter(gene_symbol %in% tidy.markers$gene_symbol) %>%
merge(., tidy.markers, by = 'gene_symbol') %>%
dplyr::mutate(condition = factor(condition, levels=c("*scute*", "*da:da*", "*da*", "*da^RNAi^*")))
# pass df to pheatmap wrapper
p <- layer.heatmaph(genehm.df, arr='celltype')
p + geom_hline(aes(yintercept=3.5), linewidth = 0.5) +
theme(plot.margin = margin(r = 25))
```
This arrangement is more informative:
```{r xmarkers-heatmap-vertical, fig.height = 20}
p <- layer.heatmapv(genehm.df, cluster=TRUE)
p + geom_vline(aes(xintercept=3.5), linewidth = 0.5)
suppressMessages(ggsave('marker_heatmap.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```
## 5.1 Heatmap of the _Enhancer of split_-Complex genes
```{r E(spl)C-heatmap}
# prepare data
Espl.df <- bind_rows(ScOE_deg, DaDaOE_deg, DaOE_deg, DaKD_deg) %>%
dplyr::select(ensemblGeneID, gene_symbol, log2FoldChange, padj, condition) %>%
dplyr::rename(p.adjust = padj) %>%
filter(grepl("^E\\(spl\\)", gene_symbol)) %>%
dplyr::mutate(condition = factor(condition, levels=c("*scute*", "*da:da*", "*da*", "*da^RNAi^*"))) %>%
dplyr::mutate(cellcolour = "#000000")
# plot
p <- layer.heatmapv(Espl.df, cluster=TRUE)
p + geom_vline(aes(xintercept=3.5), linewidth = 0.5)
suppressMessages(ggsave('Esplit_heatmap.pdf', plot = last_plot(), device = 'pdf',
path = figdir, dpi = 300))
```