-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-kvantifikacija.Rmd
406 lines (274 loc) · 12.6 KB
/
03-kvantifikacija.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# Kvantifikacija {#kvantifikacija}
```{r, include = FALSE}
library(reticulate)
use_condaenv("compbio", required = TRUE)
```
```{python}
import time
import pandas as pd
import pysam
```
## Motivation for RNA-Seq quantification
Typical studies:
- DNA-Seq -> alignment -> variant calling -> genome wide association study (GWAS);
<center><img src="images/GWAS2-700x438.png" width="400"></center>
- RNA-Seq -> alignment -> quantification -> differential expression analysis
## RNA-Seq quantification
Pod pojmom kvantifikacije, u kontekstu RNA-Seq analize, podrazumevamo procenu relativne zastupljenosti mRNA molekula u uzorku.
Counting versus probabilistic estimation dilemma is (mostly) about doing analysis on gene versus transcript level.
- When quantifying on **gene** level - we simply count the number of reads aligned to each gene.
- On **transcript** level, we need an algorithm such as the Expectation Maximization (EM, pictured below) to deal with the uncertainty.
<br>
<center><img src="images/em1.png" width="550"></center>
<center><img src="images/em2.png" width="550"></center>
When performing statistical analysis of RNA expression, doing it on gene level compared to transcript level is more robust and experimentally actionable. The biologists will also usually draw their conclusions on gene level since that's the level that the biological pathways are annotated on.
However, the use of gene counts for statistical analysis can mask transcript-level dynamics. A popular alternative nowadays is to estimate the transcript abundances and then aggregate to gene level, or perform the entire analysis on trancsript level (testing for differential expression) and then aggregate the results.
## Gene level quantification
We shall, for simplicity's sake, only perform gene level quantification. Even though it has some drawbacks, it's a strategy still used by most genomic scientists. Here's what we have at the beginning of the quantification step and what we want to estimate:
- We **have**: aligned reads and annotations. A file format called SAM/BAM is now the standard formats for next-generation sequence alignments.
- We **estimate**: relative abundance.
We'll start with some simple methods to find if two intervals overlap.
```{python}
def overlap(x, y):
"""
This function takes two intervals and determines whether
they have any overlapping segments.
"""
new_start = max(x[1], y[1])
new_end = min(x[2], y[2])
if new_start < new_end and x[0] == y[0]:
return True
return False
```
We will define four intervals as tuples.
```{python}
region1 = ("chr3", 27, 82)
region2 = ("chr4", 27, 82)
region3 = ("chr3", 57, 75)
```
```{python}
overlap(region1, region2)
overlap(region1, region3)
```
We shall now pick a gene of interest and try to find reads mapping to that gene.
```{python}
# the location of DEFB125 gene in human genome, gencode.v27 annotation
gene = ('chr20', 87249, 97094)
```
To start exploring reads from a SAM/BAM file, we first need to load the file.
```{python}
# load BAM file
bamfile = pysam.AlignmentFile("/opt/aligned/sample_01_accepted_hits.bam", "rb")
```
Note the 'rb' parameter indicates to read the file as binary, which is the BAM format (BAM stands for Binary Alignment Map). If loading a SAM file, this parameter does not need to be specified.
An important and very common application is to count the number of reads (aligned fragments) that overlap a given feature (i.e. region of the genome or gene). One simple approach to doing this is to make a list of all reads generated and simply iterate over the reads to identify whether a read overlaps a region.
In this case we need an overlap method that can compare a simple interval (defined by a tuple of sequence, start, end) with a pysam AlignedSegment object. This overlap method also needs the AlignmentFile object to decode the chromosome name.
```{python}
def overlap(x, gene, bamfile):
"""
A modified version of overlap that takes an interval and a pysam
AlignedSegment and tests for overlap
"""
new_start = max(x.reference_start, gene[1])
new_end = min(x.reference_end, gene[2])
if (new_start < new_end and bamfile.getrname(x.tid) == gene[0]):
return True
return False
```
```{python}
start_time = time.time()
# code to iterate over reads and count for a single gene
naive_gene_count = 0
for x in bamfile.fetch():
# Note x is of type pysam.AlignedSegment
if(overlap(x, gene, bamfile)):
naive_gene_count += 1
print("--- %s seconds ---" % (time.time() - start_time))
```
```{python}
print(naive_gene_count)
```
The problem with this approach is that to do this, you will need to store the entire file in memory. Worse than that, in order to find the reads within a single gene, you would need to iterate over the entire file, which can contain hundreds of millions of reads. This is quite slow even for a single gene, but will only increase if you want to look at many genes.
Instead, more efficient datastructures (and indexing schemes) can be used to retrieve reads based on positions in more efficient ways.
To get the reads overlapping a given region, Pysam has a method called fetch(), which takes the genomic coordinates and returns an iterator of all reads overlapping that region.
```{python}
bam_iter = bamfile.fetch(gene[0], gene[1], gene[2])
```
```{python}
start_time = time.time()
pysam_gene_count = 0
for x in bam_iter:
pysam_gene_count += 1
print("--- %s seconds ---" % (time.time() - start_time))
```
```{python}
print(pysam_gene_count)
```
Not surprisingly, they are the same - the big difference is that the second method is significantly faster. The reason for this is that Pysam (like Samtools, Picard, and other similar toolkits) make use of clever tricks to index the file by genomic positions to more efficiently search for reads within a given genomic interval.
Now that we know how to count reads overlapping a region, we can write this as a function and try and compute this for all genes.
```{python}
def read_count(gene, bamfile):
"""
Compute the number of reads contained in a bamfile that overlap
a given interval
"""
bam_iter = bamfile.fetch(gene[0], gene[1], gene[2])
pysam_gene_count = 0
for x in bam_iter:
pysam_gene_count += 1
return pysam_gene_count
```
You can run this with any gene tuple now:
```{python}
# the location of ZNFX1 gene in human genome, gencode.v27 annotation
gene2 = ("chr20", 49237945, 49278426)
read_count(gene2, bamfile)
```
Now, let's compute the gene counts for all genes. We'll start by reading in all genes:
```{python}
start_time = time.time()
gene_counts={}
with open('/opt/gencode.v27.chr20.bed', 'r') as f:
for line in f:
tokens = line.split('\t')
gene_local = (tokens[0], int(tokens[1]), int(tokens[2]))
count = read_count(gene_local, bamfile)
gene_counts.update({tokens[3].rstrip() : count})
print("--- %s seconds ---" % (time.time() - start_time))
```
Now it's easy to query the gene counts for different genes.
```{python}
print(gene_counts.get("ARFGEF2"))
print(gene_counts.get("SOX12"))
print(gene_counts.get("WFDC8"))
```
## (Within-sample) Normalization
In previous slides, we've seen how we can compute raw *counts*, i.e. number of reads that align to a particular feature (gene or transcript).
$$X_{i}$$
These numbers are heavily dependent on two things:
- The amount of fragments sequenced
- The length of the feature, or more appropriately, the effective length. Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:
$$\widetilde{l_{i}}=l_{i}-\mu_{FLD}+1$$
Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length!
### Different units
- **R**eads **P**er **K**ilobase per **M**illion reads mapped (**RPKM**), or for the paired-end reads: **FPKM** (**F**ragments instead of **R**eads).
$$FPKM_{i}=\frac{X_i}{\left ( \frac{\widetilde{l_{i}}}{10^{3}}\left ( \frac{N}{10^{6}} \right ) \right )} = \frac{X_i}{\widetilde{l_{i}}N}\cdot 10^{9}$$
- **T**ranscripts **P**er **M**illion (**TPM**)
$$TPM_{i}=\frac{X_i}{\widetilde{l_{i}}}\cdot \left ( \frac{1}{\Sigma_{j}\frac{X_j}{\widetilde{l_{j}}}} \right ) \cdot 10^{6}$$
```{python}
counts = {'Gene Name': ['A(5kb)', 'B(10kb)', 'C(1kb)', 'D(20kb)'],
'Rep1 Counts': [100000, 200000, 100000, 0],
'Rep2 Counts': [120000, 250000, 80000, 0],
'Rep3 Counts': [300000, 600000, 150000, 10000]
}
df = pd.DataFrame(counts, columns= ['Gene Name', 'Rep1 Counts', 'Rep2 Counts', 'Rep3 Counts']).set_index('Gene Name')
df
```
Let's first calculate the abundances in FPKM units. We first divide by the total number of reads:
```{python}
df_rpkm = df.div(df.sum()/1000000).round(2)
df_rpkm
```
Then we normalize for gene length:
```{python}
# save gene sizes in new column
df_rpkm['kb'] = [5,10,1,20]
# gene size normalization
df_rpkm = df_rpkm.div(df_rpkm.kb, axis=0).round(2)
df_rpkm.drop(['kb'], axis=1)
```
The sums of total normalized counts in each column are not equal as we prove below:
```{python}
# FPKM normalization sums per samples are not equal (samples are not comparable).
df_rpkm.drop(['kb'], axis=1).sum()
```
Now we will calculate the abundances in TPM units. Here, the first thing we do is normalize for gene length!
```{python}
df_tpm = df
# save gene sizes in new column
df_tpm['kb'] = [2,4,1,10]
# gene size normalization is performed first
df_tpm = df_tpm.div(df_tpm.kb, axis=0).round(2)
df_tpm.drop(['kb'], axis=1)
```
And now we perform the library size normalization, using abundances already normalized for gene length:
```{python}
# library size normalization (division by 10 instead of 10^6)
df_tpm = df_tpm.div(df_tpm.sum()/1000000).round(3)
df_tpm.drop(['kb'], axis=1)
```
The sums of total normalized counts in each column are now equal (when using TPM)!
```{python}
# TPM normalization sums per samples are equal (samples are comparable).
df_tpm.drop(['kb'], axis=1).sum()
```
```{python}
def tpm(genefile, bamfile):
"""
Compute the TPM (transcripts per million) metric for all genes within the genefile using
an RNA-Seq bamfile
"""
total_mapped_reads = 0
# here you want all reads
bam_iter = bamfile.fetch()
for x in bam_iter:
total_mapped_reads += 1
gene_counts = {}
with open(genefile, 'r') as f:
for line in f:
tokens = line.split('\t')
gene_local = (tokens[0], int(tokens[1]), int(tokens[2]))
gene_length = gene_local[2] - gene_local[1]
raw_count = read_count(gene_local, bamfile)
if tokens[3] in gene_counts:
gene_counts[tokens[3]] = (gene_counts[tokens[3]][0] + raw_count, gene_length)
else:
gene_counts.update({tokens[3].rstrip() : (raw_count, gene_length)})
countsDF = pd.DataFrame(gene_counts, index=['raw_count', 'gene_length'])
countsDF = countsDF.T
X=countsDF.iloc[:,0].values
l=countsDF.iloc[:,1].values
countsDF = countsDF.assign(tpm = 1e6 * (X/l) / (X/l).sum())
return(countsDF)
```
```{python}
start_time = time.time()
gene_counts = tpm('/opt/gencode.v27.chr20.bed', bamfile)
print("--- %s seconds ---" % (time.time() - start_time))
```
```{python}
gene_counts.head(10)
```
```{python}
def rpkm(genefile, bamfile):
"""
Compute the RPKM (reads per kilobase per million mapped reads) metric for all genes within the genefile using
an RNA-Seq bamfile
"""
total_mapped_reads = 0
# here you want all reads
bam_iter = bamfile.fetch()
for x in bam_iter:
total_mapped_reads += 1
gene_counts = {}
with open(genefile, 'r') as f:
for line in f:
tokens = line.split('\t')
gene_local = (tokens[0], int(tokens[1]), int(tokens[2]))
gene_length=gene_local[2]-gene_local[1]
raw_count = read_count(gene_local, bamfile)
rpk_metric = 1e9 * raw_count \
/ (total_mapped_reads * gene_length)
gene_counts.update({tokens[3].rstrip() : (raw_count, gene_length, rpk_metric)})
countsDF = pd.DataFrame(gene_counts, index=['raw_count', 'gene_length', 'rpkm'])
countsDF = countsDF.T
return(countsDF)
```
```{python}
start_time = time.time()
gene_counts_rpkm = rpkm('/opt/gencode.v27.chr20.bed', bamfile)
print("--- %s seconds ---" % (time.time() - start_time))
```
```{python}
gene_counts_rpkm.head(10)
```