-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02--quick_example.Rmd
358 lines (279 loc) · 16.2 KB
/
02--quick_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "Example analysis"
output: html_document
bibliography: "bibtexlib.bib"
---
```{r home_setup, echo=FALSE, warning=FALSE, message=FALSE}
library(grid)
source("style.R")
```
This is a short example analysis to give you a feel for how `metacoder` and `taxa` are used in microbiome analysis.
If something does not make sense now, don't worry!
We will cover everything shown here in greater detail later.
## Reading data
The first step in any analysis is getting your data into R.
This can be difficult for taxonomic data since it has a hierarchical component (i.e., the taxonomic tree).
`Metacoder` has functions for `r gloss$add('parsing')` specific file formats used in metagenomics research.
However, for this demonstration, we will be using a more all-purpose parser from the `taxa` package meant for tabular data.
Included in `metacoder` is an example dataset that is a subset of the Human Microbiome Project data.
This dataset has two parts:
* An abundance matrix called `hmp_otus`, with **samples in columns** and `r gloss$add('Operational Taxonomic Units (OTUs)')` in rows
* A sample data table called `hmp_samples`, with **samples as rows** and columns of information describing the samples (e.g., gender)
```{r echo=FALSE, include=FALSE}
knitr::include_graphics('./figure_sources/preferred_data_format_counts.png')
```
```{r echo=FALSE, include=FALSE}
knitr::include_graphics('./figure_sources/preferred_data_format_samples.png')
```
```{r echo=FALSE, include=FALSE}
knitr::include_graphics('./figure_sources/typical_input_format.png')
```
This is a typical way for this kind of data to be formatted and is the preferred way for packages like `metacoder` and `taxa`.
Lets take a look at the dataset we will use in this example:
```{r message=FALSE}
library(metacoder)
print(hmp_otus)
print(hmp_samples)
```
One challenge this data presents is the format of the taxonomic information.
```{r}
hmp_otus$lineage[1:4]
```
We can process the abundance matrix, and parse the taxonomic information at the same time, using a parser from `taxa`.
```{r}
obj <- parse_tax_data(hmp_otus,
class_cols = "lineage", # the column that contains taxonomic information
class_sep = ";", # The character used to separate taxa in the classification
class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is
class_key = c(tax_rank = "info", # A key describing each regex capture group
tax_name = "taxon_name"))
```
This returns a `taxmap` object.
The `taxmap` class is designed to store any number of tables, `r gloss$add('list', shown = 'lists')`, or `r gloss$add('vector', shown = 'vectors')` associated with taxonomic information and facilitate manipulating the data.
Here is what that object looks like:
```{r}
print(obj)
```
Note how the original abundance matrix is contained in the `taxmap` `r gloss$add('object')` with an additional column called "taxon_id".
This table is stored in the list `obj$data`, which can contain any number of user-defined datasets.
Also note that we have a different number of taxa and OTUs.
This is different from a traditional ecological dataset.
An OTU may contain sequence variants, so a single OTU represents multiple similar sequences.
These OTUs are assigned to taxa so a taxon may include multiple OTUs.
Taxa may be organized in different `r gloss$add('Taxonomic ranks', shown = 'ranks')`.
For example, one taxon might be a species while another might be a genus.
These abstractions are necessary because we may have sequences that we can not confidently assign to a traditional taxon.
## Abundance matrix manipulations
### Removing low-abundance counts
Recall that the abundance matrix contains samples in columns and OTUs in rows.
Each cell is the number of times an OTU was observed in a sample.
Some of these cells may contain a low number of observations.
These low-abundance sequences might be the result of sequencing error, so typically we remove any counts/OTUs with less than some number of reads.
Lets set all counts with less than 5 reads to zero, overwriting the original table:
```{r}
obj$data$tax_data <- zero_low_counts(obj, dataset = "tax_data", min_count = 5)
```
By setting low abundance counts to zero we might have created OTUs that no longer contain any observations.
We can check as follows.
```{r}
no_reads <- rowSums(obj$data$tax_data[, hmp_samples$sample_id]) == 0
sum(no_reads)
```
It appears that `r sum(no_reads)` of `r nrow(obj$data$tax_data)` OTUs now have no reads.
We can remove those OTUs and their associated taxa with `filter_obs` from the `taxa` package:
```{r}
obj <- filter_obs(obj, target = "tax_data", ! no_reads, drop_taxa = TRUE)
print(obj)
```
Note how there are fewer taxa now (`r length(taxon_names(obj))` from 174), as well as fewer OTUs (`r nrow(obj$data$tax_data)` from 1000).
This is because the `drop_taxa = TRUE` option caused any taxa without OTUs assigned to them after the filtering to be removed.
This coordinated manipulation of taxonomic and abundance data is one of the main benefits of using the `taxmap` class.
### Accounting for un-even sampling
Ideally, we would sequence each sample the same amount (i.e., the same number of reads).
However, sequencing technologies are imperfect, so some samples may have more reads than others.
This creates a situation where we may have observed more diversity in some samples because they were sequenced more thoroughly than others.
So far we've used raw counts, but people typically work with `r gloss$add('Rarefaction', shown = 'rarefied')` counts or proportions to try to avoid the possibility of sampling depth biasing the results.
Here we use the function `calc_obs_props` to divide each sample's counts by the total number of counts observed for each sample, resulting in a proportion.
```{r}
obj$data$tax_data <- calc_obs_props(obj, "tax_data")
print(obj)
```
### Getting per-taxon information
Currently, we have values for the abundance of each OTU, not each taxon.
To get information on the taxa, we can sum the abundance per-taxon and add the results to the `taxmap` object in a new table:
```{r}
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data",
cols = hmp_samples$sample_id)
```
Note that there is now an additional table called `tax_abund` with one row per taxon.
The name of the table is arbitrary; we could have called it anything.
```{r}
print(obj)
```
We can also easily calculate the number of samples that have reads for each taxon:
```{r}
obj$data$tax_occ <- calc_n_samples(obj, "tax_abund", groups = hmp_samples$body_site)
print(obj)
```
### Plotting taxonomic data
Now that we have per-taxon information (The `tax_abund` and `tax_occ` tables), we can plot the information using **heat trees**.
Heat trees are what we call taxonomic trees in which the size and color of tree parts correspond to some statistic of interest.
The code below plots the number of "Nose" samples that have reads for each taxon as the size of each taxon.
It also plots the number of OTUs assigned to each taxon in the overall dataset as color.
```{r}
set.seed(1) # This makes the plot appear the same each time it is run
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs,
node_color = Nose,
node_size_axis_label = "OTU count",
node_color_axis_label = "Samples with reads",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
```
Note how we did not have to specify the full path to the variable "Nose", but just its name.
This is a shorthand for convenience.
We could have made the exact same plot using this command:
```{r, eval = FALSE}
set.seed(1)
heat_tree(obj,
node_label = obj$taxon_names(),
node_size = obj$n_obs(),
node_color = obj$data$tax_occ$Nose,
node_size_axis_label = "OTU count",
node_color_axis_label = "Samples with reads",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
```
This is known as `r gloss$add('Non-standard evaluation (NSE)')` in programmer jargon and will be used in many functions throughout this workshop.
### Alpha diversity
Alpha diversity is a measure of the diversity **within** each sample or group of samples.
It can be calculated at any rank of the taxonomy, but it is usually calculated at the species or OTU "rank".
There are multiple methods used to calculate a value to represent alpha diversity.
The simplest is just the number of species, but the ones used most often factor in how common each species is as well.
Below, we calculate the alpha diversity of OTUs using the `r gloss$add('Inverse Simpson Index')` using the package `vegan`.
```{r}
library(vegan)
hmp_samples$inv_simp <- diversity(obj$data$tax_data[, hmp_samples$sample_id],
index = "invsimpson",
MARGIN = 2) # What orietation the matrix is in
```
Adding this to the sample data table makes it easy to use the sample information in graphing.
Lets compare the alpha diversity of samples from males and females using `ggplot2`, a popular R package for plotting.
```{r fig.height=5, fig.width=5}
library(ggplot2)
ggplot(hmp_samples, aes(x = sex, y = inv_simp)) +
geom_boxplot()
```
Not much difference there, as you might expect.
We can also compare body sites:
```{r fig.height=5, fig.width=5}
ggplot(hmp_samples, aes(x = body_site, y = inv_simp)) +
geom_boxplot()
```
That's more interesting; skin has much lower diversity than any of the wetter areas, which makes sense.
Lets see if that's a significant difference using `r gloss$add('analysis of variance (ANOVA)')`.
```{r}
anova_result <- aov(inv_simp ~ body_site, hmp_samples)
summary(anova_result)
```
That tells that at least one of the body site means is different from the other, but not which one (although we can make a good guess).
A `r gloss$add("Tukey's Honest Significant Difference (HSD)")` test can compare each site to every other and tell us which are significantly different.
Although `r gloss$add('base R')` has a Tukey's HSD function called `TukeyHSD`, we will use one from the package `agricolae` since it supplies grouping codes that are useful for graphing.
```{r}
library(agricolae)
tukey_result <- HSD.test(anova_result, "body_site", group = TRUE)
print(tukey_result)
```
We are interested in the `$groups` table that says which sites are different.
With a little tweaking, we can add this data to the graph we made.
Lets add some nicer text as well.
```{r fig.height=5, fig.width=5}
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
ggplot(hmp_samples, aes(x = body_site, y = inv_simp)) +
geom_text(data = data.frame(),
aes(x = rownames(group_data), y = max(hmp_samples$inv_simp) + 1, label = group_data$groups),
col = 'black',
size = 10) +
geom_boxplot() +
ggtitle("Alpha diversity of human body sites") +
xlab("Body site") +
ylab("Inverse Simpson Index")
```
This tells us that samples from the saliva and skin are significantly different from each other, but not significantly different from anything else.
### Comparing two treatments/groups
Usually we are interested in how groups of samples compare.
For example, we might want to know which taxa differ between the nose and throat, or between men and women.
The function `compare_groups` facilitates these comparisons:
```{r, warning = FALSE}
obj$data$diff_table <- compare_groups(obj,
dataset = "tax_abund",
cols = hmp_samples$sample_id, # What columns of sample data to use
groups = hmp_samples$sex) # What category each sample is assigned to
print(obj$data$diff_table)
```
For each taxon, a `r gloss$add('Wilcoxon Rank Sum test')` was used to test for differences between the median abundances of samples in each treatment.
We can use this information to create what we call a **differential heat tree**, which indicates which taxa are more abundant in each treatment:
```{r}
set.seed(999)
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_color = log2_median_ratio, # A column from `obj$data$diff_table`
node_color_interval = c(-2, 2), # The range of `log2_median_ratio` to display
node_color_range = c("cyan", "gray", "tan"), # The color palette used
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median proportions",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
```
In this case, taxa colored tan are more abundant in women and those colored blue are more abundant in men.
Note that we have not taken into account statistical significance when showing this, so lets do that.
First, we need to `r gloss$add('Multiple comparison corrections', shown = 'correct for multiple comparisons')`:
```{r}
obj$data$diff_table$wilcox_p_value <- p.adjust(obj$data$diff_table$wilcox_p_value,
method = "fdr")
```
If we then look at the distribution of p-values, we can see that none are even close to significant:
```{r}
range(obj$data$diff_table$wilcox_p_value, finite = TRUE)
```
There is no need to graph this, but if there still were some significant differences, we could set any difference that is not significant to zero and repeat the last `heat_tree` command doing something like:
```{r eval = FALSE}
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
```
### Comparing any number of treatments/groups
A single differential heat tree can compare two treatments (e.g. male and female), but what if you have more?
Then we can make a matrix of heat trees, one for each pairwise comparison of treatments like so:
```{r, warning = FALSE}
obj$data$diff_table <- compare_groups(obj, dataset = "tax_abund",
cols = hmp_samples$sample_id, # What columns of sample data to use
groups = hmp_samples$body_site) # What category each sample is assigned to
print(obj$data$diff_table)
```
There is a special function to plot this type of data called `heat_tree_matrix`:
```{r}
set.seed(1)
heat_tree_matrix(obj,
dataset = "diff_table",
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_label = taxon_names,
node_color = log2_median_ratio, # A column from `obj$data$diff_table`
node_color_range = diverging_palette(), # The built-in palette for diverging data
node_color_trans = "linear", # The default is scaled by circle area
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
output_file = "differential_heat_tree.pdf") # Saves the plot as a pdf file
```
The grey tree on the lower left functions as a key for the unlabeled trees.
Each of the smaller trees represent a comparison between body sites in the columns and rows.
A taxon colored brown is more abundant in the body site in the column and a taxon colored green is more abundant in body site of the row.
For example, Bacteroidetes is more abundant (i.e. has more reads) in the throat samples than the nose samples and this is due to the greater abundance of two genera *Prevotella* and *Porphyromonas*.
Look at the PDF file saved as "differential_heat_tree.pdf" to explore the details of the plot.
```{r cleanup, message=FALSE, warning=FALSE, include=FALSE}
file.remove("differential_heat_tree.pdf")
```