-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbackground.tex
executable file
·2811 lines (2511 loc) · 144 KB
/
background.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Biological and technological context of this thesis}%
\label{ch:background}
\setlength{\epigraphwidth}{0.9\textwidth}
\setlength{\epigraphrule}{0pt}
\epigraphhead[5]%
{%
\epigraph{\emph{It would probably be oversimplifying the
matter, but I am strongly tempted to say,\\
\textquote{All life is nucleic acid; the rest is
commentary}}.}{\citet{asimov:WrongRel}}%
}
\vspace{-1mm}
The following pages (pp.~\pageref{sec:bio}--\pageref{sec:bgConcl})
present a summary of facts and techniques
that form the biological and technological context
of the work presented in this thesis.
The different sections may be read on their own
or skipped by informed readers without understandability issues.\mybr\
\vspace{-2mm}
\section{Universality and diversity of Life}\label{sec:bio}
\vspace{-2mm}
Every known form of life depends on a common set of molecule types,
within which the \glspl{DNA}, \glspl{RNA} and proteins
are arguably the most specific ones and have the widest variety.
The other molecules are either inorganic (water and salts),
small, or simple organic ones (sugars, organic or amino acids,
nitrogenous bases, lipids or their precursors).~\mycite{Callen2005-ik}\mybr\
The entirety of the \gls{DNA} (protein-coding and non-coding) of a living organism
constitutes its genetic material (or genome).
The coding sections of the \gls{DNA}, \ie\ the protein-coding genes,
contain the instructions for making (via \glspl{mRNA}) the proteins,
which are the main effectors supporting life functions.
When a gene is switched on,
it triggers a process,
illustrated in \Cref{fig:transcriptionTranslation},
in which the first step is called transcription
and the second one translation.~\mycite{Callen2005-ik,Pierce2005-ib}\mybr\
\begin{figure}[!htpb]
\vspace{-3mm}
\includegraphics[scale=0.60]{background/transcriptionTranslation.png}\centering
\vspace{-2.3mm}
\caption[Transcription and Translation: an overview]%
{\label{fig:transcriptionTranslation}\textbf{Transcription and Translation: an overview.}
{\small This work, \enquote{Transcription and Translation: an overview},
is a derivative work of
\href{https://commons.wikimedia.org/wiki/File:MRNA.svg}{\enquote{Simplified diagram of mRNA synthesis and processing. Enzymes not shown.}}
and
\href{https://commons.wikimedia.org/wiki/File:Protein\_synthesis.svg}{\enquote{Protein synthesis}}
both by \href{https://commons.wikimedia.org/wiki/User:Kelvinsong}{Kelvinsong},
used under \href{https://creativecommons.org/licenses/by/3.0/}{[CC BY]}
(see \Cref{sec:kelvinsong}).
\enquote{Transcription and Translation: an overview} is licensed under
\href{https://creativecommons.org/licenses/by/4.0/}{[CC BY]} by Mitra P. Barzine.
}}
\end{figure}
The transcription initiation happens when
an \gls{RNA} polymerase attaches to the start of the gene
and uses the \gls{DNA} strand as a template
to create a corresponding \gls{RNA}
from free \gls{RNA} nucleotides~\mycite{molBiolCell}.
The transcription is a directional process and
always happens from the gene 5' end towards its 3' end~\mycite{Callen2005-ik}.
(For 5' and 3', see \Cref{fig:DNARNAstruct} and related section.)
Messenger \glspl{RNA} (\glspl{mRNA}) are \glspl{RNA} that are produced from a gene
and used as a template for translation to synthesise proteins.
There are other kinds of \glspl{RNA},
\eg\ \glspl{rRNA} (which are the most numerous \glspl{RNA} in the cell),
\glspl{tRNA} and many others~\mycite{Callen2005-ik}.
The entire repertoire of transcripts (\ie\ \gls{RNA} molecules) expressed
in a cell or group of cells (such as in a tissue)
is called transcriptome~\mycite{Velculescu1997-rs,Pietu1999-uf,rnaseq-2009}.
Unlike the genome which is roughly identical
regardless of which cell of a particular individual is considered,
the transcriptome may vary, sometimes dramatically,
according to the biological context (different organs or tissues,
or different conditions, \eg\ healthy or in a reaction to a disease)
and through time and life stages.~\mycite{molBiolCell}\mybr\
Before the \mRNA\ is in turn used as a template to make the protein,
a few modifications may occur.
Among the typical post-transcriptional regulations,
there are
the capping and the addition of a polyadenylated (polyA) tail
(both increasing the half-life of the \mRNA),
splicing
(where internal parts of the \mRNA\ ---
either non coding (\ie\ \emph{introns}) or coding (\ie\ \emph{exons}) --- are removed),
or \gls{RNA} editing~\mycite{Darnell2013-xf}.
For eukaryotic species (\eg\ \species{Human}),
the \mRNAs\ have to be exported
from the nucleus to the cytoplasm for the next step to happen~\mycite{Callen2005-ik}.\mybr\
The \glspl{mRNA} initiate the translation,
\ie\ creation of new proteins,
by binding to protein factories called ribosomes
(complexes formed from \glspl{rRNA} and ribosomal proteins).
The ribosomes read the \mRNA\ codons (\ie\ groups of three consecutive bases)
and produce the corresponding protein by synthesising a polypeptide chain
from the free amino acids (\glspl{aa}) carried by \glspl{tRNA}
with the corresponding anticodons (\ie\ complementary sequence to the \mRNA\ codons).
The \glspl{aa} are linked by peptide bonds formed
through the reaction of functional groups on their primary chain:
the carboxyl group (--\chemForm{COOH}) of the first amino acid \cpv{(\gls{aa})}
with the amine group (\chemForm{NH_2}--) of the next one.
This succession of primary chains and peptide bonds constitutes
the protein backbone.
This sequence is by convention described
from the free amino group of the first \gls{aa} (\ie\ N-terminal)
to the free carboxyl group of the last \gls{aa} (\ie\ C-terminal).\mybr\
While the polypeptide chain is completed,
it also folds into a three-dimensional structure,
which is essential for fulfilling its role~\mycite{Morris2016-qv}.
Many proteins need to undergo post-translational modifications (\glspl{PTM})
before being functional.
\glspl{PTM} allow the regulation of the proteins' activity
(activation and deactivation).
They can involve proteolytic cleavages or the creation of new covalent bonds,
as many proteins comprise more than one polypeptide chain to be functional.
Other frequent \glspl{PTM} include the phosphorylation, acetylation,
or glycosylation of the \glspl{aa} (also called residues)~\mycite{molBiolCell,Morris2016-qv}.
Besides the variety of possible \glspl{PTM} and their combination,
proteins can comprise twenty different \glspl{aa} (\Cref{sec:aa}),
which have a vast range of physicochemical properties.
Hence, in turn, the proteins have a wide physicochemical range too.
\mycite{Morris2016-qv,Callen2005-ik}.\mybr\
Although the diversity of the proteome
(\ie\ entire repertoire of expressed proteins)
is the primary contributor to
the final \gls{phenotype}\footnote{Phenotype: \glsentrydesc{phenotype}}
and functions of cells and tissues,
its exhaustive study remains particularly challenging.
Most proteins are quite stable,
but their physicochemical diversity prevents
the use of uniform and straightforward protocols
that would encompass all the proteins
in a given cell or tissue type.~\mycite{Bruce2013}\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.49]{background/DNA-RNAstruct1.png}\centering
\vspace{-2mm}
\caption[DNA and RNA structures]{\label{fig:DNARNAstruct}%
\textbf{DNA and RNA structures} © 2014 Nature Education ---
Adapted from \citet{Pierce2005-ib}.}
\end{figure}
On the other hand,
all \gls{DNA} and \glspl{RNA} have very similar chemical properties,
as their structures are very close as shown in \Cref{fig:DNARNAstruct}.
They are polymeric molecules that are made by a chain of nucleotides.
Nucleotides have three distinct chemical subunits: a phosphate group,
a pentose (either ribose for \gls{RNA} or deoxyribose for \gls{DNA})
and a nitrogenous base, which can be a purine (adenine (A) or guanine (G))
or a pyrimidine (cytosine (C) and
either thymine (T) for \gls{DNA} or uracil (U) for \gls{RNA}).
The alternation of phosphate groups and the pentoses create the biomolecules backbone,
while the information is encoded into the nitrogenous bases sequence.
\gls{DNA} and \glspl{RNA} all share the same directional reading frame,
\ie\ 5' end to 3' end, or in other words,
from the phosphate group that is linked to the pentose carbon annotated 5'
to the phosphate group
that is linked to the pentose carbon annotated 3'\footnote{%
Pentoses are monosaccharides containing a chain of five carbons.
When the pentose is part of a nucleotide,
these carbons are annotated from 1' to 5' to avoid confusion
with the carbons of nitrogenous cycles.}.~\mycite{Morris2016-qv,molBiolCell,Callen2005-ik}\mybr\
The difference in physical properties between \gls{DNA} and
\gls{RNA} molecules is primarily due to
the predominant particular arrangement of \gls{DNA} (double-stranded) and
\gls{RNA} (single-stranded).
The double-stranded configuration of the \gls{DNA} dramatically improves
the stability of the \gls{DNA} by involving many mechanisms,
which includes many hydrogen bonds
(three for each couple of G/C and two for each A/T).
Besides,
in eukaryotic cells,
while the genome (\gls{DNA}) is protected in organelles such as the nucleus,
most of the \mRNAs\ life is spent in the cytoplasm,
which contains many enzymes (\eg\ endonucleases and exonucleases)
that cleave the nucleotide sequences and can ultimately degrade the \mRNAs.
Thus, even though \mRNA\ half-lives are quite variable,
\mRNAs\ have generally the lowest stability
when compared to the \glspl{DNA} and proteins.~\mycite{molBiolCell,Pierce2005-ib}\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.60]{background/dogma.pdf}\centering
\vspace{-4mm}
\caption[Central dogma of molecular biology proposed by F. Crick]%
{\label{fig:dogma}\textbf{The central dogma of molecular biology proposed by
F. Crick}. The proven modes of information transfers are for the general ones
in solid lines and in dashed lines for the special transfers
(\gls{RNA} to \gls{RNA} or \gls{DNA}) and the transfers that have yet to
be established (\gls{DNA} to protein).
}
\end{figure}
This overall process, which allows the creation of the proteins and
based on a flow of information initiated from the \gls{DNA},
had been predicted by Francis Crick~\mycite{Crick:1958,Crick:1970}.
He stated what is now known as the core of the central dogma of molecular biology
(shown in \Cref{fig:dogma}):
\textquote{Once information has got into a protein it can’t get out again}.
In other words, the genome contains all the information needed to produce
functional proteins,
and, in theory, if we reach a total understanding of the information encoded
into the \gls{DNA},
we will be able to predict the \gls{phenotype} due to the proteome.
As \gls{DNA} is static
while the coding portion (about 2\%) of the human genome~\mycite{Venter2001-sn}
varies in expression (both in concentration and composition)
depending on the tissue or cell type,
\cpv{genome studies are more established} than transcriptomic or proteomic ones,
but the latter ones are more \cpv{phenotypically} insightful.\mybr\
\section[Transcriptome exploration with RNA sequencing]{Transcriptome~exploration~with~RNA~sequencing}\label{sec:transExplo}
With the recent era of short-sequencing technology and the completion of the
Human genome~\mycite{Venter2001-sn,Lander2001-wj,%
International_Human_Genome_Sequencing_Consortium2004-hr},
understanding the genome expression is increasingly a more
reachable aim.
From the early 1980s,
technologies involved in transcriptome studies
have substantially improved through many successive innovations~\mycite{Lowe2017-kj,Parkinson2009-pj}
that include Sanger sequencing~\mycite{Sanger1975-io} or
\gls{PCR} (reviewed by \citet{VanGuilder2008-xs}).
Among the various transcriptomic study approaches,
there are three key methods~\mycite{Lowe2017-kj}:
\gls{EST} sequencing (for gene discovery --- see \Cref{sec:EST}),
microarrays (for gene quantification --- see \Cref{sec:microarray})
and the one with which all the transcriptomic data used in this thesis have
been generated:
\Rnaseq\ (which is both used for gene discovery and gene quantification).\mybr\
In the following section,
I introduce the typical steps of the required workflow
to study the transcriptome through sequencing on an Illumina platform.
While not by conscious design, all the transcriptomes analysed in this thesis are
the product of Illumina sequencing (see
\Crefrange{subsec:castlepresentation}{subsec:gtexPresentation}).
This is unsurprising as Illumina is by far
the most popular platform for the last decade \mycite{popularIllumina} and
has been used to generate most of the data in \gls{ENA} and \gls{ArrayExpress}.
Indeed, Illumina sequencing offers a very
good balance between accuracy and achieving the highest
throughput for the lowest per-base cost \mycite{IlluminaCheap}.
I will emphasise the approaches and
the tools I used to estimate the gene expression levels from raw nucleotide
sequences.
\Cref{fig:OverviewRnaseqPrepSeq} presents an overview of the typical steps of an
\Rnaseq\ workflow from the libraries preparation to the sequencing.\mybr\
\begin{figure}
\includegraphics[scale=0.60]{background/IntroWorkflow.png}\centering
\caption[Overview of a \Rnaseq\ workflow: library preparation
and sequencing]{\label{fig:OverviewRnaseqPrepSeq}\textbf{Overview of
a typical \Rnaseq\ workflow:
library preparation and sequencing}}
\end{figure}
Experimental protocols for other platforms may need various and specific
modifications that are outside of the realm of this thesis and thus will not be
covered here\footnote{More details on the other main sequencing platforms and their
relevant protocols may be found in~\citet{rnaseqProtocols} review paper or at the
online resource \enquote{\href{http://rnaseq.uoregon.edu/}{RNA-seqlopedia}}
(\href{http://rnaseq.uoregon.edu/}{http://rnaseq.uoregon.edu/})
\mycite{rnaseqlopedia}.}.\mybr\
Although the collection and the conservation\footnote{E.g.\ between fresh-frozen
samples and \gls{FFPE} samples see~\citet{sampleConservationMatters}.}
of the samples before the
\gls{RNA} extraction most definitely affects the final estimations,
I will set aside these steps from my review.\mybr\
\subsection{Library preparation}\label{subsec:libPrep}
While there are sequencing technologies that can directly sequence \glspl{RNA}
(see~\cite{rnaDirectSeq}), most of the technologies handle only \gls{DNA}.
Hence, the first step of a typical \Rnaseq\ workflow is the preparation of
\gls{cDNA} libraries from the starting material. This step and the sequencing
itself are the most platform dependent parts of the overall protocol.
Indeed, contingent upon which sequencing principle they rely,
the sequencers need the libraries to be fixed and loaded differently.\mybr\
\subsubsection{\gls{RNA} extraction}
There are many methods to extract \glspl{RNA} from the primary samples, and they
are commonly standardised. Indeed, depending on the type of biological samples,
the \glspl{RNA} of interest, the aim of the study and the sequencing platform
used, there is one (or more) available commercial kits. These are designed in
a way to not interfere
with any of the later steps of the library preparation or with the sequencing
itself.\mybr\
Unsurprisingly, the choice of one kit (and hence its method of extraction)
over another can impact the final \Rnaseq\ data. The main difference between
the most widespread methods being the quantity of non-mature \glspl{RNA}
(\ie\ with longer intronic regions) detected according to which kit has been used.
However, the relative gene expression levels are similar from one extraction
protocol to another. \mycite{RNAextraction}\mybr\
\subsubsection{\gls{RNA} enrichment}
After extracting the \gls{RNA} from the cells or tissues,
the next step is to enrich the content of the samples with the \glspl{RNA}
of interest (\ie\ the concentration of the \glspl{RNA} of interest is increased
either by specifically selecting it or by removing other \glspl{RNA}). Indeed,
the \glspl{rRNA} are the most abundant type of \gls{RNA} in any cell. Even
though they account for a very small part of the genome\footnote{For example,
\species{Homo sapiens}, there are 568 genes (<1\%) that are described as
\gls{rRNA} out of the 63,898 annotated genes of the \gls{Ensembl} database
(\hg{38.p10}, \ens{89}).}, they represent by their number 70\% or more of
the total population of \gls{RNA} \mycite{biochbook}.
Although there are interests to study \glspl{rRNA} (\eg{}~\cite{rrnaStudy}),
\mRNAs\ studies are more popular,
and they only constitute about 3 to 5\% of the whole \gls{RNA} population
\mycite{molBiolCell}. Other studies research even scarcer kinds of
\gls{RNA}.\footnote{Out of the 10,081 experiments tagged as \comp{\enquote{rna
assay}} and \comp{\enquote{sequencing assay}} within \gls{ArrayExpress},
7,981 were also tagged as \comp{\enquote{RNA-seq of coding RNA}},
1,829 as \comp{\enquote{RNA-seq of non coding RNA}} and 366 have both tags.
4 of them are only described as \comp{\enquote{microRNA profiling by
high-throughput sequencing}} --- Query date: 22 June 2017.}\mybr\
There are typically three strategies to achieve \gls{RNA} enrichment:
either by polyA-selection, by ribodepletion or (more complex)
by targeted amplification. While these
approaches are insufficiently specific to select one particular kind of \gls{RNA}
or remove all \glspl{rRNA}, it eases and improves the downstream analyses.\mybr\
\minisec{PolyA-selection}\label{msec:polyA}
This strategy essentially targets the \mRNAs. It exploits the polyadenylated
tail at the 3' end of the \mRNAs\footnote{And a few other kinds of \gls{RNA},
\eg\ \glspl{lncRNA} \mycite{polyAlncRNA}} that is added
post-transcriptionally.
Magnetic beads, supporting short strings of thymine (oligo-dT),
capture these \mRNAs\ efficiently while the others
are washed away \mycite{Mortazavi2008}.\mybr\
This protocol is probably the most widespread one as it is the easiest and
cheapest to set up. A dataset produced following this protocol is known as
a \emph{polyA-selected} dataset.\mybr\
\minisec{Ribodepletion}
This strategy is preferred for the study of any \gls{ncRNA} or when researching
the interaction of \mRNAs\ with other \glspl{RNA} \mycite{ribodepletion}. This
strategy is in a way the reverse of the previous one as its also
uses magnetic beads, but this time to efficiently\footnote{ThermoFisher claims
that its RiboMinus protocol can remove up to 99.99\% of the \glspl{rRNA}.}
target the unwanted \glspl{rRNA} as to remove them from the sample.\mybr\
The ribodepletion can also be achieved through ribonucleases. These enzymes
specifically digest \glspl{rRNA}, and then \glspl{RNA} of interest can be retrieved
through size selection.\mybr\
Datasets produced following a ribodepletion protocol are usually called
\emph{whole \gls{RNA}} or \emph{total \gls{RNA}} in contrast to the
\emph{polyA-selected} ones.\mybr\
\citet{castleData} created a total \gls{RNA} dataset, but they use another
approach where they amplify \emph{every} other \gls{RNA} with the help of
specially designed probes (see \Cref{subsec:castlepresentation}). Hence,
the protocol they have used is closer to the following one.\mybr\
\minisec{Targeted amplification}
Targeted amplifications rely on primers that would be designed to target (or
avoid as for~\citet{castleData}) specific sequence motifs of the genome.
Most studies based on this kind of approach are referred with
a name based on the studied \gls{RNA} type (\eg\ \gls{miRNA-Seq}) or
emphasising the variation of the method (\eg\ Capture-Seq
\mycite{captureSeq}).
Often, additional steps are required to prepare the libraries
in comparison with a polyA-selected or ribodepleted dataset.\mybr\
\subsubsection{RNA fragmentation}
Most sequencing platforms\footnote{As for the Illumina platforms that have
produced the transcriptomic datasets studied in this thesis.} require relatively
short (\ie\ 200 to 500\ nt) length to sequence. Concomitantly, it also ensures
a more uniform sampling along the \gls{RNA}.
This fragmentation can be carried out via divalent cations hydrolysis or
nebulisation.\mybr\
This step is performed on occasions after the \gls{cDNA} synthesis
(see next section). In those cases, the \glspl{cDNA} are fragmented mostly
by digestion with DNase I or by sonication.\mybr\
\subsubsection{Double-stranded cDNA synthesis}
The \gls{RNA} molecules are used as a template for a retro-transcription
involving oligo-dTs or \emph{random} hexamer primers, respectively only
for polyA-selected datasets or any dataset (polyA-selected included).
The set of random hexamer has been designed to cover the whole
transcriptome. Unfortunately, these random hexamer primers have been
proven to lack full randomness \mycite{notSoRandom}.\mybr\
At the end of the most common protocol, the order of synthesis of each \gls{cDNA}
strands is lost, \ie\ it is impossible to distinguish which of the \gls{cDNA}
strands has the same sequence as the original \gls{RNA}. Several techniques,
called \emph{strand-specific}, have been developed to compensate for this
\mycite{strandSpecific,strandSpe}.\label{strand-specific}\mybr\
\subsubsection{Adapter~ligation,~PCR~amplification~and~size~selection}
After generating blunt edges by restriction digest of the \glspl{cDNA}, adapters
(small known sequences of oligonucleotides) are ligated to both their ends.
These adapters are constituted from several parts.
A subset of them are later ensuring
the hybridisation of the \glspl{cDNA} with the flow
cells\footnote{\Gls{flow}: see \Cref{sub:HybridClustAmp}.} (based on sequence
complementary), and another set of them
are sequence binding sites that are used as primers for the following cluster
amplification step occurring \latin{in situ}. These adapters are also used to
introduce additional motifs such as indexes.\mybr\
The next two steps can be interchanged depending on the amount of starting material
at disposition. \gls{PCR} amplifies all the molecules before (or after)
a size-selection is performed (per gel electrophoresis) to extract
length-complying fragments (about 200 to 500\ bp) to the sequencer machine
requirements\footnote{Indeed, the previous fragmentation step creates a great
length range of fragments.}.\mybr\
Unfortunately, the size-selection means that any
transcript with an original length below the threshold used for the
selection will be missed\footnote{There is no problem for the greater length
as they will statistically present fragments in the correct range.}.
For example, \glspl{miRNA} are shorter than the general requirement of Illumina
sequencers. Alternative protocols are addressing this issue
\mycite{smallRNAprotocol}.\mybr\
\subsubsection{An example of alternative preparation strategy}
Along with the targeted, the strand-specific and small \glspl{RNA} protocols,
there are a few other variations to this typical protocol to handle other concerns.
For example, it is occasionally necessary to sequence simultaneously (in a single
run) multiple samples. This can either be motivated by practical reasons
(to lower the experimental costs or hasten the overall processing time)
\mycite{multiplexCheaper} or be critical to the experimental design as a way to
experimentally handle the \emph{batch effects}\footnote{Batch effect: see
\Cref{sub:BatchEffect}} \mycite{multiplexBatchEffect}. However, it is
crucial to later extricate the several pooled samples from each other as a
requirement to many downstream analyses.\mybr\
\emph{Multiplexed} protocols easily achieve
the distinction between the multiple samples as they incorporate \emph{barcodes}
before ligating the adapters. These barcodes are also small sequences of
nucleotides, and each sample has its unique associated
barcode. In practice, each sample is prepared separately with the added extra
step (before the adapters ligation) where the barcode is incorporated; then all
the samples are pooled together before the next step, which consists of hybridising
the \glspl{cDNA} to the flow cell.\mybr\
Other extra steps occur just after the sequencing and before any other data
analysis: all the reads\footnote{Reads: see \Cref{subsub:sequencing}} are
separated in files based on their barcodes and the barcode, along with the
adapters, is trimmed from all the reads.\mybr\
The main inconvenient of the multiplexing protocol is that the original sequenced
length of the \glspl{cDNA} are then shorter as the barcodes are also (and have
to be) sequenced as well.\mybr\
\subsection[Clustering: Hybridisation and Bridge amplification]{Clustering:
Hybridisation and Bridge amplification~\small{\mycite{IlluminaYoutube}}\quad}%
\label{sub:HybridClustAmp}
\vspace{-4mm}
Once the libraries are ready, they are loaded onto a \emph{flow cell}\footnote{%
\emph{Flow cells} are the support of Illumina sequencing. They parallelise
through supported Chemistry
the sequencing of millions of \gls{DNA} fragments together
which are kept spatially separated in clusters.
Each flow cell is a glass slide with lanes.
Each lane is coated with two short nucleotide sequences. One of these
oligonucleotides is complementary to a region contained in the ligated adapters.}.\mybr\
The clustering step comprises two phases: hybridisation and bridge amplification
of the \gls{cDNA} fragments.\mybr\
\subsubsection{Hybridisation}
The double-strand \glspl{cDNA} are denatured, and then each fragment randomly
hybridises across the flow cell surface with one of its small oligonucleotides.
These are used as primers for polymerases which create a first complementary
strand to the hybridised \gls{DNA} fragments. The new double-strand molecule is
denatured, and the original first template is washed away.\mybr\
\subsubsection{Bridge amplification}
The strands then fold over and their (second) adapter hybridises with a
complementary oligonucleotide sequence of the flow cell and thus creating a
bridge. The flow cell complementary fragment is then used as the primer for a new
strand. The new double-stranded \gls{DNA} is then denatured (which
dismantles the bridge). Each of the two tethered molecules creates a new
bridge by hybridisation which are the templates for a new strand each.
This process happens many times and simultaneously for millions of fragments.
It creates clusters of clonal amplification of the original fragments of
the library. After the bridge amplification, the reverse strands are cleaved
and washed away. The 3' end primers are also blocked to avoid any unwanted
priming.\mybr\
\subsection{Sequencing-by-synthesis}%
\label{subsub:sequencing}
Illumina sequencers propose two approaches: single-end and paired-end.
In \emph{single-end sequencing}\footnote{Chronologically
the oldest method}, the sequencing begins at one (and only one) of the fragment
ends and progresses towards the second.
In \emph{paired-end sequencing},
once the first end has been sequenced,
a bridge replication occurs,
then the other end of the original fragment is also sequenced.
Thus, in paired-end sequencing,
the sequencing occurs at \emph{both} ends of each original fragment.\mybr\
Though more expensive and more programmatically challenging,
the paired-end approach facilitate the detection of genomic rearrangements%
\footnote{As indels or inversions} and
repetitive sequence elements.
It also helps to distinguish between a gene isoforms
and provides greater support to identify novel transcripts (new isoform or gene)
and fusion genes\footnote{A \gls{fusionG} is a gene
that is the product of the fusion of
parts of two (or more) different genes.}.\mybr\
In both cases, Illumina's sequencing process,
\emph{sequencing-by-synthesis}~\mycite{seqBySynth},
is the same.
It uses the \gls{DNA} replication mechanism with modified \dNTPs.
Reversible fluorescent tagged \dNTPs,
which are protected at their 3'end to block any further elongation,
allow a step-by-step incorporation.
The product of this synthesis is called a \emph{read}, and
it supports the \emph{base calling}\footnote{Base calling:
identification of the nucleosides in a sequence
by assigning chromatogram peaks
or another kind of signal variations
to (nucleo){}bases.}.\mybr\
The sequencing co-occurs on every identical fragment
of every cluster of the flow cell.
It begins with the hybridisation of a complementary 5' primer onto the 3' binding
site of the tethered \gls{DNA} template. This primer is then extended by
replication through several sequencing cycles to create a new read.\mybr\
A \emph{sequencing cycle} starts with the addition of one complementary
fluorescent \dNTP\ to the new growing read,
which stops the replication process as the \dNTPs\ 3' end is blocked.
A wash discards all the unlinked \dNTPs\ away.
Then, the clusters are excited by a light source,
and the signal intensity and (characteristic) wavelength
of each \dNTPs\ are recorded
since they allow the identification of the new nucleotide incorporated
by each cluster and measure the accuracy of the base calling.
Finally, the fluorescent tags and the 3' caps are cleaved
and washed away before a new cycle begins.
The number of cycles determines the final read length.\mybr\
Unfortunately, as the sequencing proceeds, the error rate of the sequencers
increases. This is due to the incomplete removal of the fluorescent signal which
increases the background noise and thus reduces the signal-to-noise ratio.\mybr\
Once the programmed read length is achieved (typically between 25 to 200 \gls{nt}),
the reads are washed away (after denaturation).\mybr\
\subsubsection{Sequencing specificities for the \emph{paired-end} protocol}
The paired-end protocol uses an additional primer.
The first run is initiated by a single primer and
follows the same steps of the single-end sequencing cycle.
Once completed, the complimentary read is washed off,
and the 3' end primer is deprotected.
Then, the \gls{DNA} fragment bends over and hybridises to a complementary
oligonucleotide at the surface of the flow cell.
Next, the second primer initiates a new sequencing cycle
at the end of which the newly synthesised read is washed away.
A single new bridge replication follows.
The new double-stranded \gls{DNA} fragment is denatured, and the 3' primers are
protected before the original strand (that has been already read)
is cleaved and washed away.
Finally, the remaining strand is sequenced following the previously described
sequencing cycle. Once the same number of sequencing cycles as the first
strand is reached, the read product of the remaining strand is washed away.
By convention, the first primer allows to read the forward strand
and the second primer the reverse strand.
Note that these forward and reverse concepts used in paired-end protocol have
no relevance to the biological concepts of forward and reverse for genes.\mybr\
\subsection{From analogous input to digital output}
At the end of the sequencing process, a set of images across the flow cell
(one per sequencing cycle) is produced from the detected wavelengths.
While it is possible to work with the images themselves, in most
cases, the sequencing facilities will perform the base calling
and other intermediate steps before providing the end-user with text files.\mybr\
These files are usually distributed in the \fastq\ format~\mycite{fastqFormat}
which record for each cluster (read) a unique identifier,
a nucleotide sequence and a \gls{Phred} quality score for each base of the
sequence. A few optional information can also be provided, \eg\
the position of the read on the \gls{flow} (See \Cref{sec:fastq-format} for
a random read example).
The \gls{Phred} quality score ($Q$) measures the accuracy of the identification
of the nucleobase to which it refers. These scores are set by the base calling
program and are defined as $Q = -10\log_{10}(P)$ with $P$ the probability of
the base being called wrongly. There are several possible encoding formats
(see \Cref{sec:PhredScore}).\mybr\
In single-end sequencing, there is one file per sample.
In paired-end sequencing, the reads are usually separated based on their associated
indexes into two ordered files: all the reads from the forward strands
are grouped in one file, and the ones from the reverse strands
in a second one.\mybr\
\subsection{A typical bioinformatic workflow for \Rnaseq\ study\quad}\label{sub:RNAseqworflow}
From the reconstruction of the
transcriptome to the normalisation of expression in each sample, the various
steps may be addressed through many different algorithmic approaches.
Often, the choice of a method at one stage implies a more limited number of
alternatives from which to pick at later points in the pipeline.
The choice is frequently driven by the kind of downstream analyses planned for
the study. More than the practical format of the data for these, it is the
assumptions and the methods used upstream that are critical for a
rigorous investigation and, later, for an accurate interpretation of the results.\mybr\
\Cref{fig:pipelineTrans} presents an example of the overall \latin{in silico}
process of raw \Rnaseq\ data. It summarises the steps and highlights the tools
I used to process the data within this thesis.\mybr\
Before any downstream analysis, for each read, the genomic region (or
\emph{locus}), from which it has been expressed initially, needs to be identified.
Indeed, \Rnaseq\ main objective is to quantify the expression of genomic
\emph{features}\footnote{These genomic features could be genes, isoforms,
exons, novel genes, \dots\
In short, any genomic region with an annotated function.}. In other words,
the transcriptome needs to be reconstructed from the short reads and annotated
(\ie\ identify which features have been expressed in each library).\mybr\
Two different main strategies (see further `Reconstruction strategies' segment)
manage to accomplish this identification step. Independently of the
approach, this step is the most challenging and time-consuming
of the workflow. Tools, which tackle the reconstruction, usually provide
many tunable heuristic parameters (\eg\ maximum number of allowed mismatches
or indels per read before discarding a possible identification\footnote{Indeed,
many reads will have many identifications; these reads are defined as
\emph{ambiguous reads}.})
to speed up the task.
Unfortunately, as on Illumina platforms, the base calling accuracy decreases
along the read length, this may lead to an information loss~\mycite{TrimRNAseq}.
To prevent informative reads to be discarded, it is opportune to perform a quality
check of the raw data before the identification step. Thus, reads with a drop of
accuracy in their 3'end may be shortened (\ie\ trimmed) and rescued for the next
reconstruction step. Similarly, low-quality reads may be discarded hence
lowering the complexity
of the reconstruction task and hasten its accomplishment.\mybr\
\subsubsection{Quality check, trimming and filtering}\label{subsub:trim}
The quality assessment allows removing any read (or part of it) that would
increase the complexity of the reconstruction step or skew the downstream analyses.\mybr\
It is wise to discard uninformative reads, \ie\ reads with a low sequence
complexity (\eg\ poly-T or poly-A tails) or with ambiguous sequences (in other
words with uncalled bases --- reported as \emph{N}).
Indeed, these reads will hamper the processing time as they
usually map to several parts of the genome while also decreasing the accuracy of
the global gene expression estimations.
For similar reasons, it is judicious to remove reads with a low overall quality
score\footnote{It may vary based on the complete set of reads to analyse.}.\mybr\
It is also prudent to check and remove any read that may map to possible
contamination sources\footnote{For example, for eukaryotes, by aligning (see next
segment) every read to the \species{Escherichia coli} genome.}.
Indeed, as these reads are ambiguous, it is safer
to discard them than skew the expression estimations.\mybr\
Finally, as many tools (mappers in particular) require all the input reads
to have the same length, the purity-length balance requires optimisation.
Indeed, the trimming has to compromise between an approach too lenient
(where the mappers discard many unfit reads at a later step),
and a too stringent one (where either the reads are shorter,
which increases the overall complexity and therefore hinder
the mapping both on time and accuracy~\mycite{Trimwisely},
or too few are left for pertinent analyses).
When the tools allow it, avoiding quality-based trimmings is probably
a better practice.\mybr\
Generally, after the sequencer calls the reads, a first trimming removes
all the adaptors and barcodes needed by the sequencing protocols. Thus,
in principle, they are not to be found in the \enquote{raw data}.
However, to avert any latter contingency, a search against
a list of the most common adaptors and an over-representation assessment of small
sequences (\emph{k-mers}\footnote{\emph{k-mer}: In the present context, all
possible subsequences of length
\emph{k} of a \emph{read}.}) at each end of the reads is good practice.\mybr\
\subsubsection{Reconstruction strategies}\label{subsec:reconstruction}
Two main approaches can be used for the very computationally expensive step of
identification. I will present them in decreasing order of complexity:
the \latin{de novo} assembly of the reads and then the reads alignment approach
(to a genome reference or a transcriptome one).\mybr\
Regardless of the approaches, the reconstructed transcriptome is usually reported
as a \emph{\gls{SAM}} file~\mycite{SAMformat} (or one of its derivative formats:
either \emph{\gls{BAM}} or more recently \emph{CRAM}).\mybr\
\minisec{\latin{de novo} Assembly}
This approach is favoured when the reference genome of the species of interest
is unavailable or of poor quality
(\eg\ many non-model organisms) or inadequate (\eg\ cancer samples)
for the samples of interest. However, if a reference already exists this strategy
is avoided to the utmost.\mybr\
It allows the unbiased discovery of novel
exon-exon junctions~\mycite{deBruijn}. As none of the datasets I use in this
thesis has been reconstructed through this approach, I briefly summarise
the main points below as more in-depth reviews cover this strategy
(see~\cite{denovoReview}).\mybr\
In \latin{de novo} assembly, the reconstruction of the transcriptome happens
with the construction of the longest possible \emph{contigs} (\ie\ contiguously
expressed regions) based on sets of overlapping reads (see also
\Cref{fig:denovo}). Shorter reads add
to the overall complexity of this approach. While paired-end reads may help to
solve many genomic regions, lowly expressed or repetitive regions remain
challenging to determine. There are several algorithmic approaches for \latin{de
novo} transcriptome assembly~\mycite{algorithmsDenovo},
though the most prevalent one is the de Bruijn representation~\mycite{deBruijn}.\mybr\
\begin{figure}[!htb]
\includegraphics[scale=0.60]{background/denovo.png}\centering
\caption[\textit{de novo} Assembly]{\label{fig:denovo}\textbf{\latin{de novo}
Assembly.} From overlapping regions of raw reads, \emph{contigs} are
created by integrating the reads sequences together.}
\end{figure}
\minisec{Read alignment}
This approach exploits prior knowledge. The reads are aligned to a reference to
hasten the reconstruction process. The reference may be a genome or a
transcriptome (provided that a good annotation is available).\mybr\
\begin{figure}[!htb]
\centering
\begin{subfigure}{1.1\textwidth}\label{fig:genoAlignment}
\includegraphics[scale=0.60]{background/genomeAlignment.png}\centering
\caption{Alignment to the genome}
%\label{fig:genoAlignment}
\end{subfigure}
\begin{subfigure}{1.1\textwidth}\label{fig:transAlignment}
\includegraphics[scale=0.6]{background/transcriptomeAlignment.png}\centering
\caption{Alignment to the transcriptome}
%\label{fig:transAlignment}
\end{subfigure}
\caption[Overview of main alignment strategies for \Rnaseq\
transcriptome]{\label{fig:OverviewRnaseqMapping}\textbf{Overview of main
reconstruction strategies for an \Rnaseq\ transcriptome by alignment to a
reference}}
\end{figure}
\subminisec{Genome reference}
Aligning to the genome allows discovering new genes or isoforms. However, it
requires splice-aware algorithms, \ie\ they need to align the
reads across the splice-junctions (which is possible but non-trivial).
As illustrated in \Cref{fig:genoAlignment},
the reads might span many discontinued regions of the reference.
While on the one hand, aligning to the genome avoids \emph{multiple mapping
issues}\footnote{Due to sequence similarity, a same read or subpart of a read
may be attributed to many different loci in the genome. As it is
impossible to attribute the read to its original locus of expression directly,
distribution models have to be pondered to avoid unnecessary skewness during
the quantification step.} for the same exon, this also implies that
the genome needs to provide the coordinates for the different isoforms
which will then require further analyses for accurate quantifications at that
isomeric level.
Indeed, irrespectively of the number of isoforms including a specific exon,
the sequence of this exon is transcribed only once in the reference.\mybr\
\subminisec{Transcriptome reference}
Using a transcriptome for reference instead of a genome reduces the complexity
of the aligning step due to the lack of intronic sequences.
However, it also limits the potential downstream analyses,
\eg\ any new (or unannotated) gene or isoform will be missed.
This approach is the easiest, but a pre-existing
accurate and well-annotated gene model is required.
\Cref{fig:transAlignment} shows in fact that this approach is simpler
to the previous one as a direct read alignment is done against the
transcriptome of reference.
This enables the accurate gene isoforms expression quantification,
provided that the gene model is correct and the reads may be
attributed unambiguously to a single isoform for each gene.
However, in practice, this approach produces many multimapped reads,
particularly for shorter reads as
many isoforms are very similar in sequence and vary only in the exons
they retain.
If the difference in the exon compositions is towards the end of the gene,
reads from two isoforms may be indistinguishable.
Paired-end sequencing helps to resolve part of the ambiguity encountered
with single-end protocols.\mybr\
To mitigate very computational greedy approaches and more constraining
ones, several tools complement the previous strategies.\mybr\
\minisec{Hybrid approach between \latin{de novo} and alignment}
There are tools like \toph\footnote{\toph\ ---
\href{https://ccb.jhu.edu/software/tophat/index.shtml}
{https://ccb.jhu.edu/software/tophat/index.shtml}} (2.0.12)~\mycite{tophat2},
that use a hybrid approach
between a reference alignment and a \latin{de novo} assembly.\mybr\
\minisec{Reads and fragments}
In the case of paired-end data,
each read of the pair is first processed separately.
Then, in the final evaluation phase and
with the help of additional information sources,
they are used as a pair to infer among the many possibilities which
are the most credible ones.
Both parts of a paired-end data once aligned to a
concordant region of the genome is then called a \emph{fragment} instead of a
\emph{read}.
Today, there is conflation between the \enquote{\emph{read}}
and the \enquote{\emph{fragment}} terms. Even though the term \emph{fragment}
is more accurate and may be used in any situation (as it equals to one read for
single-end data and a pair of related reads for paired-end data), it is
frequent to see the term \emph{read} instead (even for paired-end data).\mybr\
\toph\ ---
along with \soft{STAR}\footnote{\soft{STAR} ---
\Href{https://github.com/alexdobin/STAR}}~\mycite{STARpaper} ---
is the most popular splice-aware mapper for genomes with a near-complete annotation
(\eg\ for \species{Homo sapiens})~\mycite{tamaraRNA}
despite being slower than the latter~\mycite{hisat}.\mybr\
As many concepts or terms in Science, \enquote{\emph{read mapping}} can
have different (however very closely related) meanings.
Hence, while for many people, \emph{read mapping} will encompass any
transcriptome reconstruction strategy (including \latin{de novo} assembly) since
the main point is to map the features to functional annotations,
for others the term will only refer to
\enquote{read alignment} strategies specifically~\mycite{readMappingDef}.
Tools such as \soft{TopHat2} contributes to this confusion.\mybr\
\subsubsection{Quantification of \emph{features}}\label{subsubsec:rnaseqQuant}
When working with \Rnaseq\ data, the typical next step after mapping the reads or
fragments to the reference is to quantify the expression of the
feature\footnote{E.g.\ genes, isoforms, exons, splicing events.}
of interest.\footnote{In fact, while genotyping, heredity
studies and other genetically focused studies are possible in principle, the common
main focus is centred on expression estimation. For example, instead of
reporting a specific \gls{SNP}, in an \Rnaseq\ study, the core interest is
currently more about the specific allelic expression. Moreover, most of the \Rnaseq\
studies fail to provide the necessary sequencing depth and coverage for other kinds of study.}
In the context of this thesis, I only consider gene expression (either as \gls{RNA}
or as protein). Hence, many subtleties required for isoforms or exons studies are
here irrelevant and are left out of my overall review.\mybr\
Several tools and algorithmic approaches are available.
Indeed, for larger genomes, many regions may present high sequence similarity
which results in many \emph{ambiguous} (and challenging) reads as they mapped to
many potential genomic sites. These reads are also called \emph{multireads}.
One early strategy to solve multireads is to discard them from later analyses;
another one is to attribute them to the most credible locus based on
the overall distribution of the reads for a given sample. \mycite{Mortazavi2008}
Paired-end data help in many cases to discriminate between possible genomic
original sites, thus decreasing the overall number of multimapped fragments.\mybr\
Two popular tools were used in this thesis,
\cuffl\footnote{\cuffl\ ---
\href{http://cole-trapnell-lab.github.io/ufflinks/manual/}%
{http://cole-trapnell-lab.github.io/cufflinks/manual/}} (2.2.1)~\mycite{cufflinks} and
\htseq\footnote{\htseq\ ---
\href{http://www-huber.embl.de/HTSeq/doc/index.html\#}%
{http://www-huber.embl.de/HTSeq/doc/index.html\#}} (0.6.1p1)~\mycite{htseq},
which are both compatible with \toph\ but rely on different concepts.
I briefly present them below in their chronological release.\mybr\
\minisec{Cufflinks}\label{minisec:Cufflinks}
\cuffl\ is part of a collection of tools called
\soft{Tuxedo suite}\footnote{\soft{Tuxedo suite} user group:
\href{https://groups.google.com/forum/\#!forum/tuxedo-tools-users}%
{https://groups.google.com/forum/\#!forum/tuxedo-tools-users}} which also includes
\toph\ and \soft{Bowtie}.
\cuffl\ can assemble
\latin{de novo} novel transcripts and isoforms following the same principles
than \toph. Likewise, using good references is faster and more useful.
In both cases, \cuffl\ infers the most parsimonious\footnote{As a requirement to
Occam's razor, which may be a debatable strategy~\mycite{OccamRazorAndBiology}}
and credible set of transcripts (and their isoforms) that can explain the
complete set of observed fragments.\mybr\
This task is challenging as many isoforms are sharing a common set of exons.
While most genes present a dominant isoform for a specific condition,
there are often a few other isoforms expressed along, even though their amount
may be very limited~\mycite{oneDominant}.\mybr\
Furthermore, \cuffl\ tries assigning the multimapped reads to one isoform only.
First, sets of fragments are separated into sets of isoforms
(all fragments that are likely produced from the same set of isoforms
are regrouped together).
Then to estimate the abundance of each isoform of one set,
\cuffl\ integrates many information sources together.
For example, the overall distribution of fragments
(or reads), if these are spanning over known (or novel) splice-junctions.
Particular attention is drawn to the fragments that map unambiguously to one
unique isoform. When available, paired-end fragments are critical:
as they cover longer regions, the probability
that they span multiple adjacent exons is increased which helps to resolve
the possible structure of the original isoforms. \mycite{CufflinksPaper2}\\
The abundance is finally estimated through
an \gls{EM} algorithm~\mycite{EM-what,EM-algo-latest},
with the following main steps (see also \Cref{fig:cuffEstimation}):\mybr\
\begin{enumerate}
\item \emph{Initialisation}: For each fragment, a rough estimation of the
probability to be expressed from each isoform is computed based on the
different piece of information cited previously.
\item \emph{Iteration till convergence with the observed distribution of fragments}:
\begin{enumerate}
\item Isoforms abundance are recomputed based on the updated
fragment-to-isoform assignment \vspace{-2mm}
\item Fragment-to-isoform assignment re-updated based on the isoforms
abundance.
\end{enumerate}
\end{enumerate}
\begin{figure}
\includegraphics[scale=0.60]{background/CufflEstimationNew.pdf}\centering
\caption[Abundance estimation of isoforms by
Cufflinks]{\label{fig:cuffEstimation}\textbf{Abundance estimation of isoforms
by \cuffl} following an \gls{EM} algorithmic approach. [Adapted
from~\citet{Turner2015}]}
\end{figure}
Finally, to compute the gene expression levels, \cuffl\ aggregates \latin{per}
gene all the isoforms expression abundances together.\mybr\
\cuffl\ provides by default \FPKM\footnote{See \Cref{subsub:norm}.}
\emph{normalised} data.\mybr\
\minisec{HTSeq-count}
The \gls{Python} library \soft{HTSeq} provides a stand-alone script
(\htseq) performing the feature quantification with a more conservative strategy.
It discards all ambiguous reads from a \gls{SAM}/\gls{BAM} file and then
only counts the \emph{un}ambiguous reads that overlap with the features of
interest for a given gene model\footnote{Gene models are distributed as
annotation file (usually either as \gls{GTF} or \gls{GFF} format) and refer
to a specific reference (genome or transcriptome).}.\mybr\
\begin{figure}
\includegraphics[scale=0.65]{background/htseqQuantNew.pdf}\centering
\caption[Abundance estimation of genes by
HTSeq-count]{\label{fig:htseqEstimation}\textbf{Abundance estimation of genes
by \htseq}. The unambiguous reads (or fragments for paired-end data)
overlapping locus annotated as gene are directly counted. [Adapted
from~\citet{MarPhD}]}
\end{figure}
\htseq\ deems as ambiguous any multiread or read that overlaps more than
one annotation for the considered feature.\footnote{In fact, reads might be
discarded for one feature but kept for another one. For example, to quantify the
expression of a given gene, \htseq\ considers every read that unambiguously
overlaps with any of its annotated exons --- indeed, \htseq\ defines a gene
as the union of all its exons. However, while quantifying exon expression, many
of these same reads may be discarded as they overlap several exon annotations
with overlaying definitions.}
\htseq\ provides three modes for fine-tuning the overlap definition.
For this thesis, I used the \enquote{\texttt{intersection non-empty}} mode
(see \Cref{fig:htseqMode}).
This mode avoids discarding too many reads due to a too tolerant annotation
(\ie\ the annotation itself presents many overlapping definitions for a given
pair of feature and chromosome region).\mybr\
Initially, \htseq~\mycite{htseq} was designed for \gls{DGEA}.
This type of analysis compares expression profiles
to highlight the genes (or transcripts)
for which the expression is significantly different
depending on the considered condition.
As multireads are irrelevant for those studies,
including or excluding them from the downstream analysis is insignificant.\mybr\
Many papers
(\eg~\citet{Fonseca2014,errorsRNAquant,tophatStarwhatever})
have since shown that
the gene expression estimation by \htseq, while underestimated, is overall
well-correlated with other \gls{RNA} quantification methods
(\eg\ microarrays or \gls{RT-qPCR}).
Moreover, quantifications with \htseq\ are highly correlated with \cuffl\