-
Notifications
You must be signed in to change notification settings - Fork 47
/
Copy pathREADME.Rmd
166 lines (107 loc) · 7.54 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
output:
md_document:
variant: markdown_github
pdf_document:
highlight: default
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-",
results = "hide"
)
```
# deconstructSigs
deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample. By using deconstructSigs, one can:
* Determine the weights of each mutational signature contributing to an individual tumor sample
* Plot the reconstructed mutational profile (using the calculated weights) and compare to the original input sample
Once installed, deconstructSigs can be loaded:
```{r}
library(deconstructSigs)
```
## Using deconstructSigs
The most basic initial input to the deconstructSigs package consists of a data frame containing the mutational data for a tumor sample set.
This structure must contain the following columns:
* sample identifier -- `sample.id`
* chromosome -- `chr`
* base position -- `pos`
* reference base -- `ref`
* alternate base -- `alt`
### `mut.to.sigs.input()`
Using the function `mut.to.sigs.input`, the mutational data for a set of tumors is converted to an n-row and 96-columns data frame where n is the number of samples present. Each column respresents how frequently a mutation is found within each trinucleotide context.
For instance, `sample.mut.ref` contains mutation data for two samples. Thus the output data frame will be 2x96 and contain the number of times a mutation is observed in each trinucleotide context for each sample ID.
```{r}
head(sample.mut.ref)
# Convert to deconstructSigs input
sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref,
sample.id = "Sample",
chr = "chr",
pos = "pos",
ref = "ref",
alt = "alt")
```
### `whichSignatures()`
The output from `mut.to.sigs.input` can then be used as input to `whichSignatures`. Alternatively, a user can generate their own input data frame using calculated mutation information for each trinucleotide context per sample.
Two signatures matrices of k rows and 96 columns are also supplied as data in the package -- `signatures.nature2013` and `signatures.cosmic` -- where k is the number of defined signatures. The signatures matrix consists of the fraction of times a mutation is seen in each of the 96-trinucleotide contexts for each signature k. As novel signatures are identified, a user may choose to supply their own signatures matrix of the same format.
The function `whichSignatures` takes these two inputs (`tumor.ref`, `signatures.ref`) and uses an iterative approach to determine weights to assign to each signature in order to best recontruct the mutational profile of the input tumor sample.
```{r}
# Determine the signatures contributing an already normalized sample
test = whichSignatures(tumor.ref = randomly.generated.tumors,
signatures.ref = signatures.nature2013,
sample.id = 2)
```
#### `A bit about normalization`
If the input data frame only contains the counts of the mutations observed in each context, as is the case with the output from `mut.to.sigs.input`, then the data frame must be normalized. In these cases, the value of `contexts.needed` should be TRUE. An error will be raised if each row of the input data frame does not sum to 1.
An additional parameter to `whichSignatures` will dictate how any further normalization is done. This parameter, `tri.counts.method`, is originally set to 'default', which does not result in further normalization.
```{r}
# Determine the signatures contributing to the two example samples
sample_1 = whichSignatures(tumor.ref = sigs.input,
signatures.ref = signatures.nature2013,
sample.id = 1,
contexts.needed = TRUE,
tri.counts.method = 'default')
sample_2 = whichSignatures(tumor.ref = sigs.input,
signatures.ref = signatures.nature2013,
sample.id = 2,
contexts.needed = TRUE,
tri.counts.method = 'default')
```
If `tri.counts.method` is set to 'exome', the input data frame is normalized by number of times each trinucleotide context is observed in the exome.
If `tri.counts.method` is set to 'genome', the input data frame is normalized by number of times each trinucleotide context is observed in the genome.
If `tri.counts.method` is set to 'exome2genome', normalization is performed to reflect the absolute frequency of each trinucleotide context as it would occur across the whole genome. Thus the count data for each trinucleotide context is multiplied by a ratio of that trinucleotide's occurence in the genome to the trinucleotide's occurence in the exome.
Finally, if `tri.counts.method` is set to a data frame present in the workspace, normalization is done by using that data frame as a scaling factor. The count data for each trinucleotide context is multiplied by the corresponding value given in the data frame.
Included with the package are `tri.counts.exome`, which contains the trinucleotide counts for an exome and `tri.counts.genome`, which contains the trinucleotide counts for the hg19 genome. These are the counts used for normalization when `tri.counts.method` is 'exome', 'genome' or 'exome2genome'. Any user provided data frames should match the format of `tri.counts.exome` and `tri.counts.genome`.
The method of normalization chosen should match how the input signatures were normalized. For exome data, the exome2genome method is appropriate for the signatures included in this package.
#### `Optional parameters to whichSignatures()`
Additional optional parameters to `whichSignatures` are:
* `associated` -- Vector of associated signatures. If given, will narrow the signatures tested to only the ones listed.
* `signatures.limit` -- Number of signatures to limit the search to.
* `signature.cutoff` -- Discard any signature contributions with a weight less than this amount.
#### `Output from whichSignatures()`
The output of `whichSignatures` is a list of 5 elements:
* weights -- data frame containing the weights assigned to each of the k signatures of the input signatures matrix
* tumor -- matrix of the trinucleotide contexts for the tumor sample used as input
* product -- matrix obtained when the tumor matrix is multiplied by the assigned weights
* diff -- matrix representing the difference between the tumor matrix and product matrix
* unknown -- numeric weight not assigned to any of the input signatures
### `plotSignatures()`
The output from `whichSignatures` can be visualized using the function `plotSignatures`. This function takes the `whichSignatures` output (`sigs.output`) and an optional identifying parameter (`sub`).
```{r, eval=FALSE}
# Plot example
plot_example <- whichSignatures(tumor.ref = randomly.generated.tumors,
signatures.ref = signatures.nature2013,
sample.id = 13)
# Plot output
plotSignatures(plot_example, sub = 'example')
```
![alt text](inst/extdata/plotSignatures.png)
### `makePie()`
The output from `whichSignatures` can be visualized using the function `makePie`. This function takes the `whichSignatures` output (`sigs.output`) and an optional identifying parameter (`sub`).
```{r, eval=FALSE}
# Plot output
makePie(plot_example, sub = 'example')
```
![alt text](inst/extdata/makePie.png)