-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdatasets.tex
executable file
·907 lines (789 loc) · 45.3 KB
/
datasets.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
\chapter{Available high-throughput human datasets}\label{ch:datasets}
\setlength{\epigraphwidth}{0.57\textwidth}%0.57
\setlength{\epigraphrule}{0pt}%0.1
\epigraphhead[5]{%
\epigraph{\emph{Data! Data! Data! I can’t make bricks without clay!}}{Sherlock Homes~\mycite{Doyle1892}}
}
In the past few years, many laboratories have studied the expression
of human genes at the transcriptome and at
the proteome levels by taking advantage of high-throughput techniques
(\eg\ \citet{Krupp2012,VTpaper,ramskoldan:2009,Uhlen2014,%
Uhlen2015,UhlenGastro,GTExTranscript,PeptideAtlas,PandeyData,KusterData}).
In this chapter, I review the openly available (for research) data
I use within my thesis to explore the gene expression in (undiseased) tissues
and explain how they have been reprocessed.
Besides the results I report in the subsequent chapters,
the present chapter also provides the basis for work
I have co-authored\footfullcite{Wright-2016}.\mybr\
Unless otherwise stated, all the computational processing of the \Rnaseq\ part
described here have been performed by myself under the supervision of
\alvis. I also received general feedback from \mar,
\johan\ and \nuno. The proteome data has been processed by \james.\mybr\
\section{Introduction}
All the datasets were selected to fit three main criteria.
Firstly, they comprise normal (\ie\ reported as disease-free) human samples
from at least three different tissue types.
Secondly, gene expression
quantifications are based on \Rnaseq\ for the transcriptome and on label-free \ms\
for the proteome\footnote{These
technologies are non-targeted high-throughput and
allow one, in theory, to study the whole
repertoire of \glspl{RNA} or proteins in a sample.}.
Finally, the \emph{raw} data is openly available and reusable\footnote{%
In this context, \emph{reusable} means that the data can be processed
as accurately by third-party researchers than the original authors,
and this without the need to access additional information
that have been not openly released}.\mybr\
In the next section, I first describe the \Rnaseq\ and the \ms\ data I use
in my thesis; then I detail how these data have been processed to be
employed in the next chapters for various analyses that explore the transcriptome,
the proteome and finally, the comparison and integration of these two
biological layers.\mybr\
\section{Transcriptome RNA-Seq studies}\label{sec:rnaseq-data}
I describe hereinafter the five transcriptomic datasets I used
in the chronological order of their first public release.
\Cref{tab:Trans5DF} summarises the main characteristics of these datasets.\mybr\
\begin{sidewaystable}
\centering
\caption[General description of the 5 transcriptomic datasets
(\Rnaseq) used for this study]{\label{tab:Trans5DF}\textbf{General
description of the five transcriptomic datasets (\Rnaseq) used for this
study}\\\footnotesize{Illumina Body Map (IBM) has no \enquote{regular}
technical replicates as the \enquote{replicates} are the product of
different protocols, thus are unfit to estimate the specific noise of
either protocol (single-end or paired-end).
\NB\ The protocols used for \Gtex\ and Castle datasets are not the same:
\Gtex\ is following the most common ribodepletion protocol, while
Castle is based on a targeted amplification protocol.\\
%\mynormalcheckmark\ indicates that the dataset presents the
%characteristic, and \\(\mynormalcheckmark) that one (or more) of the
%required criteria of the characteristic is lacking.\\
}}
\begin{tabular}{@{}cccccccccc@{}}
\toprule
\multicolumn{1}{c|}
{\multirow{2}{*}{ArrayExpress ID}} &
\multicolumn{1}{c|}{\multirow{2}{*}{Data ID}} &
\multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Library\\Preparation\end{tabular}} &
\multicolumn{2}{c|}{Sequencing} &
\multicolumn{2}{c|}{Replicates} &
\multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Number of\\Tissue\\
Types\end{tabular}}} &
\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Multi-sampling\\ from the \\ same
individual\end{tabular}} \\
\cmidrule(lr){3-8}
\multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} &
\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Total\\ RNA\end{tabular}} &
\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}PolyA\\ selected\end{tabular}} &
\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Single\\ end\end{tabular}} &
\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Paired\\ end\end{tabular}} &
\multicolumn{1}{c|}{Biological} & \multicolumn{1}{c|}{Technical} &
\multicolumn{1}{c|}{} & \\
\midrule
E-MTAB-305 & Castle & \mycheckmark\ & & \mycheckmark\ & & & & 11 & \\
E-GEOD-30352 & Brawand & & \mycheckmark\ & \mycheckmark\ & &
\mycheckmark\ & & 8 & \\
E-MTAB-513 & IBM & & \mycheckmark\ & \mycheckmark\ & \mycheckmark\ & &
(\mycheckmark) & 16 & \\
E-MTAB-2836 \footnotesize{(and E-MTAB-1733)}& Uhlén & & \mycheckmark\ &
& \mycheckmark\ & \mycheckmark\ & \mycheckmark\ & 32 & \\
E-MTAB-2919 & Gtex (v4) & & \mycheckmark\ & & \mycheckmark\ &
\mycheckmark\ & & 54 & \mycheckmark\\
\bottomrule
\multicolumn{10}{p{.8\textwidth}}{\raggedright\
\mynormalcheckmark\ indicates that the dataset presents the
characteristic, and \\(\mynormalcheckmark) that one (or more) of the
required criteria of the characteristic is lacking.}
\end{tabular}
\end{sidewaystable}
\subsection{Castle et al.\ dataset}\label{subsec:castlepresentation}
\vspace*{-0.2in}
\citet{castleData} released this dataset along with their study:
\paper{\citetitle{castleData}}.
The authors were interested in exploring the whole RNA repertoire
with sequencing-based technology and they primarily focused their study
on the non-coding part.\mybr\
Purchased \gls{RNA} extracts were used to create multiple-donors pooled
samples for 11 tissues from which total \gls{RNA} libraries were prepared
following a total transcriptomic protocol
where nonribosomal \gls{RNA} transcripts are
amplified specifically by \gls{PCR}~\mycite{Armour:2009}.\mybr\
For each library (tissue), an average of 50 million sequence reads were sequenced
using an Illumina Genome Analyser-II sequencer (single-end).
The original reads were trimmed to 28 \gls{nt}
before being released through EMBL archives (\ENA{ERP000257}
and \ArrayExpress{E-MTAB-305}).\mybr\
Despite several limitations, such as the lack of replicates, the old technology and
the short reads,
I have included this dataset for two main reasons.
Firstly, it is the oldest available
\Rnaseq\ data I found that was performed on normal human tissues.
Thus, the congruence of results for this dataset with the following ones
gives a rough idea on the extent of \Rnaseq\ datasets
that may be integrated together.
Secondly, as \Rnaseq\ studies are prepared mainly with polyA-selected protocols
today, I was interested in gauging how the library preparation
protocols --- and the presence of \glspl{ncRNA} --- can affect the
quantifications and then any final observation.\mybr\
\subsection{Brawand et al.\ dataset}\label{subsec:brawandpresentation}
\vspace*{-0.2in}
In the article entitled~\paper{\citetitle{VTpaper}},
\citet{VTpaper} focused their interest on the
evolution of the mammalian transcriptomes\footnote{While there were existing
studies on the matter, the sequencing approach was then creating new perspectives.}.\mybr\
They collected 6 organs from 10 different vertebrates:
9 mammalians (including human) and a bird. There are no technical replicates,
but two biological replicates per tissue:
one male and one female for every tissue except the testis (two males).
The 131 libraries (including 23 for \species{Homo sapiens}) were prepared with
a polyA-selected protocol.
Hence, they are largely enriched in \pcgs.\mybr\
An average of 3.2 billion 76 \gls{bp}-long single-end reads were generated per sample
using an Illumina Genome Analyser IIx (single-end) and they released them
through \gls{GEO} (accession number: GSE30352).
I personally retrieved the human data from
\ArrayExpress{E-GEOD-30352}\footnote{ArrayExpress was routinely importing
datasets from \gls{GEO} on a weekly basis until very recently.
While not automatically, \gls{GEO} data are still included in \egxa.}.\mybr\
\subsection{Illumina Body Map 2.0 (IBM)}\label{subsec:ibmpresentation}
\vspace*{-0.2in}
This dataset, created in 2010, has been released in
2011\footnote{See~\citetitle{ibmEnsembl} -~\cite{ibmEnsembl}} by Illumina
and it used its most recent technology at that time:
the paired-end sequencing. Until then, all the sequencing was done
from only one end of the \gls{DNA} or \gls{cDNA} fragments. From that
date, most of the following transcriptome studies based on \Rnaseq\ use
paired-end sequencing.\mybr\
The dataset covers 16 tissues (one donor per tissue) and
the libraries were prepared following a polyA-selected
and are enriched in protein coding genes.\mybr\
Although each sample has been sequenced twice and despite having in principle
\emph{technical} replicates, these are ``non-regular'' technical replicates.
\emph{Technical} replicates, by contrast to \emph{biological} replicates,
usually imply that their processing uses the same sample source and protocols.
Thus, the error and noise due to a specific technique could be determined.
Here, however, each tissue has been sequenced once with a single-end protocol
and once with a paired-end one to compare their ability to discriminate between
\gls{mRNA} isoforms.
Indeed, Illumina's main incentive to develop its paired-end
technology was to improve the accurate identification of spliced \glspl{mRNA}.\mybr\
The sequencing was performed with an Illumina HiSeq 2000, and the reads were
released through \ArrayExpress{E-MTAB-503} (\ENA{ERP000546}),
from where I have retrieved both the single-end and paired-end mono-tissue samples
(the original dataset includes raw data files for mixtures
that have been created with the tissue samples).\mybr\
Despite the lack of biological replicates,
it was for an extended time
the most extensive freely available \Rnaseq\ dataset of human tissues.
Hence, it has been referenced many times
(\eg\ \citet{Asmann2012-in,ibmrelatedpaper,Smith2012-jw,Derrien2012-ej,%
Florea2013-in,tophat2,Kechavarzi2014-oi,Zhao2014-xj,Pasquali2014-zj,Corpas2014-ej,%
Petryszak2014-kj,Brown2015-co,Janes2015-wn,%
De_Simone2016-fb,Kern2016-dv,Iwakiri2016-sr,Yao2017-gj,Akers2018-cf})
in the literature since its release.
In fact, this dataset is the most viewed one in \gls{ArrayExpress}
(with 68,020 views on 31 May 2018 ---
the second most viewed dataset (46,247 views)
being \ArrayExpress{E-MTAB-62}\mycite{Lukk2010-op}).\mybr\
\subsection{Uhlén et al.\ dataset}\label{subsec:uhlenPresentation}
\vspace*{-0.1in}
Uhlén \etal\ have created the
\hFo{Human Protein Atlas}{https://www.proteinatlas.org/}
(often referred as \glsxtrshort{HPA} in the literature).
This atlas revolves mostly around the spatial
distribution of the proteins through the human body.
Using diverse approaches and techniques, including \Rnaseq,
they first released \Rnaseq\ data for 27 normal tissues
as part of their study:~\paper{\citetitle{Uhlen2014}}%
~\mycite{Uhlen2014}.
Later, they extended the dataset with new samples and 5 new
tissues. The latest version was published within~\paper{\citetitle{Uhlen2015}}%
~\mycite{Uhlen2015} in \textit{Science}.\mybr\
For each of the 32 tissues, there are (at least two) biological replicates.
With a few exceptions, the tissues have both male and female donors.
Many of the tissues present also technical replicates.
The total set comprises 200 samples,
which have been picked by pathologists
based on the screening of frozen biopsy samples.\mybr\
The polyA-selected libraries were paired-end sequenced
with an Illumina HiSeq 2000 or 2500.
I first started to work with the early version of this dataset
(\ArrayExpress{E-MTAB-1733} --- 171 samples for 27 tissues),
and then I upgraded my work with the extended more recent version
(\ArrayExpress{E-MTAB-2836}).\mybr\
At the preparation time of this thesis, this normal human dataset is the
most comprehensive, freely and publicly available dataset:
either regarding the number of tissues (see \Cref{tab:Trans5DF})
or the number of samples (see \Cref{tab:Lib5DF}).
Therefore, its growing number of references is unsurprising.\mybr\
\subsection{GTEx dataset}\label{subsec:gtexPresentation}
\vspace*{-0.1in}
The Genotype-Tissue Expression (\gls{GTEx}) project is funded by the \gls{NIH}
Common Fund and aims to establish, in its authors' own words,
\textquote{a resource database and associated tissue bank for the study of the
relationship between genetic variation and gene expression
and other molecular phenotypes in multiple reference tissues}.
The project was first introduced in~\citet{GTEx2013}. It aims to quickly collect
various tissues from postmortem donors for genotype-tissue expression analyses
(notably \gls{eQTL} studies, which study the function of \glspl{SNP}
in the modulation of \gls{RNA} expression).
The results of the
analyses are released through the \hFo{\Gtex\ portal}{https://gtexportal.org}.\mybr\
As the project is quite ambitious and the collection
and sequencing of the samples spreads over a long period of time,
several intermediate data \enquote{\emph{freezes}} have been
released\footnote{Many groups are involved in collecting, producing or processing
the data. To ease the communication and work coordination,
many time points are used to reference each a specific state (version)
of the data. Each version of the data is called a \emph{freeze}.}.
My analyses include samples up to the fourth release of the pilot phase (v4).
This release covers 54 tissues/cell types (53 normal and 1 tumoral)
collected on from individuals for a total of 3,276 samples.\mybr\
The \Rnaseq\ libraries were prepared following a polyA-selected protocols
and have been paired-end sequenced on an Illumina HiSeq 2000/2500.
There is an average of 80 million reads per sample.\mybr\
For privacy reasons, the raw data is available only through controlled access via
\dbGaP{phs000424.v4.p1} (access number specific to the version of the data I used
in my study).
Unfortunately, this translates to a slow access time to the raw data.\mybr\
\vspace{5mm}
During the data selection process,
I had to disregard a few studies as the raw data was not fitting
the reusability criterion \mycite{Burge,Pan2008-di}.
Many times I came across studies with ambiguous encoding
format for the raw data such
as the \ArrayExpress{E-GEOD-41637} dataset~\mycite{Burge}.
Despite my best efforts,
I was unable to resolve this issue
by contacting the respective authors.
\citet{Burge} (\ArrayExpress{E-GEOD-41637}) is one example of study
that I unfortunately had to dismiss.\mybr\
\section[Proteome Mass Spectrometry bottom-up studies]%
{Proteome Mass Spectrometry bottom-up studies \quad}\label{sec:ProteoData}
As mentioned earlier,
the proteomic data have been selected and handled by \jyoti\ and \james.\mybr\
Until recently, compared to the transcriptome, the proteome world
was lacking in normal human tissues expression quantification
experiments. In fact, while there were human protein maps available
(\eg\ the \hFo{Human Protein Atlas}{https://www.proteinatlas.org}), these
are mostly reporting the spatial expression of proteins (as they are based
on immunohistochemistry or other means of identification) than quantifying
their (non-targeted) abundance in each tissue.\mybr\
In 2014, two independent groups of authors~\mycite{PandeyData,KusterData}
published (in \textit{Nature},
issue 7502) their own \textquote{\emph{draft of the human proteome}}
based on the study of tissues with \ms. These two datasets complement a previous
smaller one that was publicly released but was never the object of a publication.\mybr\
Hereinafter, I present these three datasets that I use in my thesis.
See \Crefp{fig:pipelineProt}{A} for a short summary.\mybr\
\subsection{Pandey Lab dataset}
The Pandey Lab~\mycite{PandeyData} created the
\hFo{Human Proteome Map}{http://www.humanproteomemap.org} which
they released alongside~\paper{\citetitle{PandeyData}} in \emph{Nature}.\mybr\
For their study, they processed 30 kinds of histological normal human
tissues and cell line samples (17 adult tissues, 7 foetal tissues and 6
haematopoietic cell types).
Each sample was created from pooling samples of three
individuals (generally two males and one female).\mybr\
Their proteomic libraries were prepared with a label-free method to quantify
as many proteins as possible.
The samples were fractionated to protein level through \gls{SDS-PAGE},
and then at peptide level after trypsin digestion by \gls{RPLC}
to create 85 experimental samples.
Finally, state-of-art \gls{MS/MS} protocols
(with high-resolution and high accuracy
\glspl{FTMS} Thermo Scientific \orbi\ instruments)
was used to generate about 25 million of (\gls{HCD})
high-resolution mass spectra which account for 2,212 \gls{LC-MS/MS} profiles.
The raw spectra were retrieved from ProteomeXchange via the repository
\Pride{PXD000561}.\mybr\
While the authors' effort to generate technical high quality raw data was highly
appraised by the scientific community, their processing
(identification and quantification) methods were criticised
(see~\citet{Ezkurdia2014-qx,Deutsch2015}).
Thus, for this thesis I have relied only on
quantifications provided by \james\ who reprocessed the raw spectra.\mybr\
\subsection{Kuster Lab dataset}
In their approach of the human proteome map, the Kuster Lab~\mycite{KusterData}
combined newly generated \gls{LC-MS/MS} spectrum data (about 40\% of their
complete working set) with already publicly available data
(either from their colleagues or accessible through repositories
--- for the remaining 60\%).
They reprocessed
the whole collection of spectra to maximise proteome coverage
and make it available through their own repository:
\hFo{ProteomicsDB}{https://www.proteomicsdb.org/}.\mybr\
The subset of data considered in my thesis is also
known as the [protein] Human BodyMap which is the part that the Kuster Lab
primary generated for their own study. They collected 48 experiments covering 36
tissues (adult and foetal) and cell lines. After \gls{LDS-PAGE} fractionation and
digestion into peptides with trypsin, they processed the samples with
\gls{LC-MS/MS} to create 1,087 profiles. Overall, that represents about
14 million of \gls{HCD}/\gls{CID} spectra from Thermo Scientific instruments
(including an \orbi).
This specific raw data subpart was downloaded from \Proteomicsdb{PRDB000042}.\mybr\
\subsection{Cutler Lab dataset}\label{subsec:cutler}
This dataset was generated prior to the \pandey\ Lab and the \kuster\ Lab
data as it was released in 2011 through
\hFo{PeptideAtlas}{http://www.peptideatlas.org/}~\mycite{PeptideAtlas},
IDs: [PAe001768 --- PAe001778].\mybr\
It was created by Paul Cutler at Roche Pharmaceuticals.
It comprises 10 different tissues (and one sample per tissue) that after trypsin
digestion, were analysed through Thermo Scientific \gls{LTQ}-\orbi.
In total, there are 1,618 \gls{CID} profiles which accounts for 13 million raw
\gls{CID} spectra from a \gls{LTQ}-\orbi\ instrument.\mybr\
While this dataset was never published on its own, it has been used in different
studies (\eg~\cite{KusterData}).
The raw files were accessed and downloaded from \Proteomicsdb{PRDB000012}.\mybr\
\section{Consistent processing pipelines}
The authors of these five transcriptomic and three proteomic studies have,
in most cases,
released the quantification of the expression values either directly
(\eg~\cite{Krupp2012}) or upon requests (\eg~\cite{PandeyData}).
Third-parties also distribute
quantification for these studies either retrieved from the original studies,
such as \hFoCi{BioGPS}{http://biogps.org/}{BioGPS1} or
\hFoCi{Harmonizome}{https://amp.pharm.mssm.edu/Harmonizome}{Harmonizome},
or after reprocessing the raw data as
the \WebFoCi{\egxa}{https://www.ebi.ac.uk/gxa/home}{EBIgxa} does.\mybr\
To primarily reduce for avoidable technical variability,
and despite readily available quantifications for most of the
datasets, I only used data reprocessed from raw files by myself or \nuno\
(\gtex\ dataset) for the transcriptomic data and by \james\
for the proteomic data as already mentioned.\mybr\
In fact, each study has been originally processed with different protocols,
\eg\ \gtex{}~\mycite{GTExTranscript} and \castle{}~\mycite{Krupp2012}.
While the \egxa\ reprocesses raw data through the same
methods and has quantification for most of the aforementioned
datasets (it is still lacking the \castle\ \etal\ one.),
these were still the products of different protocols when I started my work.
Indeed, the datasets were processed with different versions of
reference (Human genome build and annotation) and tools.\mybr\
Intuitively, we expect that different processing protocols produce
different results.
As I started to work with \Rnaseq\ data,
I noticed many potential analysis variables that impact
at various levels the resulting gene expression values.
Indeed, many of these have been
reported in the literature since then; in fact,
annotation versions~\mycite{annotationDiff},
contamination (from viruses or bacteria \gls{DNA})~\mycite{contaminationRNAseq},
quality controls (and subsequent reads filtering choices)~\mycite{qualityRNAseq},
mapping and quantifications pipelines~\mycite{Fonseca2014}
have considerable effects on the final quantification. Lastly, normalisation
methods also greatly impact the final expression figures~\mycite{Dillies2013,%
normSigCancerHelp}. For all these reasons,
I decided to reprocess all transcriptomic datasets with the same exact protocol
as the first step of my study.
Recently, \citet{Danielsson2015-cn} compare results based on prepublished data and
reprocessed ones and conclude that using a single processing pipeline ensures better results.\mybr\
Likewise, there are various tools and many parameters for each processing step
needed to quantify the proteomic data
that may impact the final expression values~\mycite{Aebersold2011-av}.
For example, various search engines allow detecting
different sets of peptides~\mycite{Griss2016}.
\citet{Mackay2015-zl} has reviewed and analysed the impact
of many of these variables more specifically for label-free proteomics,
such as the effect of \gls{FDR}, protein inference tools or normalisation methods.
Therefore, the three datasets were reprocessed uniformly
from the raw spectra up to the normalisation of the protein expression values.\mybr\
\subsection{RNA-Seq raw data processing}
As presented in \Cref{sub:RNAseqworflow}, there are many steps from the raw data
files to the quantification matrices on which this thesis' analyses are based.
\Cref{fig:pipelineTrans} presents a general overview of the \Rnaseq\ processing
protocol I used.\mybr\
\begin{figure}
\includegraphics[scale=0.75]{datasets/pipelineTrans1.pdf}\centering
\caption[General steps for processing the transcriptomic
data]{\label{fig:pipelineTrans}\textbf{General steps for processing the
transcriptome.} The pipeline \irap\ integrates
all the tools needed for the state-of-art
processing of \Rnaseq\ data. The quality of the reads is checked and they
are trimmed if needed. After removal of possible contaminant reads (such as
\species{E. coli}), the reads
are aligned with \toph. The gene expression is then quantified with two
different approaches: based on the aggregation of isomers for each gene or
simply based on the number of aligned fragment on the gene locus defined in
the reference. \cuffl\ provides directly \FPKM\ values. \htseq\ provides
raw counts which were normalised by an \irap\ function into \FPKM.}
\end{figure}
I downloaded and entirely processed four of the transcriptomic datasets
myself (\castle, \vt, \ibm\ and \uhlen\ data) and
\nuno\ retrieved and processed the \gtex\ dataset\footnote{As
the \Gtex\ data is involved in many projects within the \EBI\
and due to its huge amount of files (number and size --- see \Cref{tab:Lib5DF}),
it was agreed that this would be processed centrally by one person and then
redistributed to all the other interested parties. \nuno\ had this
tremendous task.}.
In this thesis,
I present results computed on the quantification of these
five datasets which have been processed through the same identical pipeline.\mybr\
\begin{table}
\centering
\caption[Technical description of the 5 transcriptomic datasets]{%
\label{tab:Lib5DF}\textbf{Technical description of the five transcriptomic
datasets}\\\footnotesize{I processed all the datasets except the one in
\textit{\color{darkgray}italic}.\\
For the Brawand dataset, I only included and processed the \species{Homo sapiens} part.}}
\begin{tabular}{@{}cccccc@{}}
\toprule
Dataset & \begin{tabular}[c]{@{}c@{}}Participant\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}Library\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}File\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}Total size of\\the fastq\\raw files (GB) \end{tabular}&
\begin{tabular}[c]{@{}c@{}}Mean number of\\biologic samples per\\tissues
[min;max]\end{tabular} \\
\midrule
Castle & 10 & 11 & 11 & 58 & 10 (mixture) \\
Brawand & 18 & 21 & 23 & 111 & 2.8 [2;3] \\
{\small Illumina Body Map} & 16 & 36 & 48 & 1,004 & 1 \\
Uhlén & 122 & 200 & 400 & 1,851 & 3.81 [2;11] \\
\textit{\color{darkgray}\Gtex\ (v4)} &
\textit{\color{darkgray}551} & \textit{\color{darkgray}3,276} &
\textit{\color{darkgray}6,552} &
\textit{\color{darkgray}$\sim$ 50,000} & \textit{\color{darkgray}60.67 [4;214]}\\
\bottomrule
\end{tabular}
\end{table}
\subsubsection{Data retrieval and preparation}
I retrieved the human raw data of each dataset from \gls{ArrayExpress} and
\gls{ENA} through their identifier (see \crefp{sec:rnaseq-data}). After we
received our access approval, \nuno\ retrieved \Gtex\ data from
\gls{dbGaP}.\mybr\
While most of the raw files can be used as they are, an additional step is
needed for the \castle\ files. Indeed these files are using an older
\fastq\ format that is non-compliant to the most accurate and recent tools used
for this thesis.
As it is a simple matter of changing the quality score scale
(see \cref{sec:PhredScore}),
I converted these files to \gls{Phred}$+33$ \fastq\ files with a
\emph{\gls{Perl}} script (provided digitally as supplementary data).\mybr\
\subsubsection{Genome and annotation reference}
I collected and processed the datasets through an extended period of time.
Hence, for a subset of them,
I produced many intermediate sets of results based on the \hg{37.p12}
(and later \hg{37.p13}) human reference genome and the latest available
\gls{Ensembl} gene set annotation (73, 74 or 75) at that time.
In fact, the quality of each new annotation update is
generally greater than its predecessor\footnote{Although,
it is not unusual to have gene or transcript additions based on new studies
that are then removed (or fused to another) in a later version.
}.\mybr\
As the \gtex\ data was processed with \hg{38.p1} and \ens{76}, that led me
to reprocess all the other four \Rnaseq\ datasets for the sake of consistency and
to avoid more biases~\mycite{h38vsh37}. Thus, unless indicated
otherwise, the results presented in the current work are
based on the \hg{38.p1} human genome reference and the \ens{76} gene set annotation.\mybr\
\subsubsection{Data processing}\label{subsubsec:RnaseqDataProc}
In the early stages of my research, I was processing each of the different steps
sequentially and semi-manually with the help of custom made scripts. While the
\EBI\ \gls{cluster} greatly facilitated the handling of the numerous files,
the task remained quite tedious.
Additionally,
the scripts I wrote would need a fair amount of work to achieve general
reproducibility on other platforms.\mybr\
Fortunately, \nuno\ developed
an \enquote{integrated RNA-seq analysis Pipeline}:
\softFoCi{\irap}{https://nunofonseca.github.io/irap/}{Fonseca2014-si}.
This tool allows the automation of the typical
state-of-the-art and optimised workflow to study
\Rnaseq. It takes full advantages of the capacities provided by computer clusters.
Thus, I switched from my original set of scripts to \irap\
to improve my workflow without changing any step or parameter.\mybr\
Besides the usual input files (raw \Rnaseq\ files and genome/annotation
references), \irap\ needs a configuration file that precisely describes the
dataset (its design and technical features) and, if needed, specific parameters
to use.
To provide full reproducibility,
each version of \irap\ is shipped with
its own set of third-party version-defined tools and default parameters.
Thus, apart from remarkably speeding up the data processing,
\irap\ also ensures the protocol integrity
across the five transcriptomic datasets I use
in my thesis regardless of who runs the pipeline.\mybr\
Each of the transcriptomic datasets is the product of the same version of
the \irap\ pipeline (development version 0.6.3b) and set of parameters. As the
default parameters of \irap\ are tuned for human Illumina paired-end data,
I only have to define a few of them. Hence, the quality and contamination checks,
and the filtering and trimming of the reads are done following the default options
of \irap.\mybr\
\minisec{Quality assessment, trimming and filtering}
\irap\ uses internally
\softFo{FastX toolkit}{http://hannonlab.cshl.edu/fastx\_toolkit/} (0.0.13)
to perform the
assessment and the trimming. The usual uninformative and ambiguous reads
(see \Crefu{subsub:trim})
have been discarded as were any with an overall quality score below
a threshold of 10.\mybr\
The quality of the call decreases while the base calling progresses ---
see \Crefu{subsub:sequencing}.
On another note, some tools (mappers in particular) need
all the reads to be trimmed to the same length. \irap\ optimises the compromise
between the purity and the length of the reads to avoid more errors or
biases due to smaller reads~\mycite{Trimwisely} by trimming at most 15\% of the
original length while discarding more reads if necessary to maximise the length.\mybr\
Reads that could be assigned to a likely contamination source, here
\species{Escherichia coli} (as I work with \species{Homo sapiens}),
are also discarded. A non-splice aware mapper,
\softFo{Bowtie}{http://bowtie-bio.sourceforge.net/}~(1.1.1)~\mycite{Bowtie}
maps all the reads to the
contaminant genome and all the reads mapping perfectly and unambiguously are
discarded.\mybr\
\minisec{Mapping}
I mapped the reads to the genome (\hg{38.p1})
and the transcriptome (\ens{76} gene set annotation)
with \irap's (0.6.3b) proposed default splice-aware mapper
\hFo{\toph}{https://ccb.jhu.edu/software/tophat/index.shtml}
(2.0.12)~\mycite{tophat2}
with its set of default predefined arguments.
Indeed, \toph\ can handle reads from many organisms
by fine-tuning the parameters (\eg\ number of mismatches or indels to tolerate),
but the default parameters are adjusted for normal human.\mybr\
\minisec{Quantification and Normalisation}\label{minisec:quantNorm}
While \Rnaseq\ can be used to identify (and discover) \gls{RNA}
isoforms, I have focused my thesis on the gene level expression. Indeed, current
annotations and knowledge are still lacking in the reasons and
external conditions that impact the expression of a specific isoform over the
others. In addition, criticisms have been raised on the accuracy of distinction
between them~\mycite{tamaraRNA,Janes2015-wn,Dapas2017-ui}.\mybr\
However, normalising gene expression
presents more challenges than specific transcript expression.
For instance, the definition of the gene length may be different
from one laboratory to another.
In this thesis' framework,
when I have to use a gene length for a computation,
I use the identical gene length definition as found in
\irap\ and \egxa.
Thus, as shown on \Cref{fig:Genelength-collapsedExons} the gene
length is defined as the sum of the lengths of all its \emph{collapsed} exons.\mybr\
\begin{figure}
\includegraphics[scale=0.8]{datasets/collapsedExonsLinLib.pdf}\centering
\caption[Gene length is equal to the sum of the lengths of all its collapsed
exons]{\label{fig:Genelength-collapsedExons}\textbf{Gene length is equal
to the sum of the lengths of all its collapsed exons.} Though,
this method lacks complete accuracy, it provides a sufficient estimation
of the gene length for an efficient normalisation regarding the length bias.
The coordinates for the 5' and 3' ends of each exon is extracted from
the annotation and they are collapsed together. This \emph{gene length}
is unaffected by incorrect attribution of a fragment to a
specific transcript when there are many possible options.}
\end{figure}
As mentioned in \Crefu{subsub:norm}, I used two different popular tools based on
different strategies to estimate
gene expression levels:
\hFo{\cuffl}{https://cole-trapnell-lab.github.io/cufflinks/manual/}
(2.2.1)~\mycite{cufflinks}
and \hFo{\htseq}{https://htseq.readthedocs.io/}
(0.6.1p1)~\mycite{htseq} (with the \texttt{intersection non-empty} mode).
These tools are also integrated in \irap.\mybr\
For \cuffl, I used the mode where the multi-mapped reads are probabilistically
assigned depending on the coverage of each mapped locus. In addition,
\cuffl\ provides normalised gene expression levels by aggregating their
corresponding normalised isoform expression levels. \cuffl\ uses the
\cref{eq:rpkm-fx} to normalise isoform expression levels. The length of the
isoforms are extracted from the reference.\mybr\
On the other hand, \htseq\ provides only \emph{raw} counts for the feature of
interest. \irap\ provides an internal \FPKM\ normalisation function that
is an implementation of the \cref{eq:rpkm-fx}. As I requested \htseq\ to work at
gene level, this formula requires gene lengths which are computed with the
aforementioned method.\mybr\
All the configuration files I created for this thesis may be found at
\addressToirapConfFiles.\mybr\
As the paired-end set of the \ibm\ data was presenting an overall better
quality than its single-end counterpart,
I only include \ibm{}'s paired-end data for the remaining of the thesis.\mybr\
\subsection{MS data processing}\label{subsec:msDataProcess}
After retrieval of the data from \gls{Pride} and \gls{Proteomicsdb},
\james\ reprocessed the three proteome \ms-based datasets.
\Cref{fig:pipelineProt} illustrates the pipeline that processed the three
datasets in a consistent and optimal manner. I summarise this protocol
in \Crefp{fig:pipelineProt}{}
and in the following \cref{subsub:spectralProcessing,subsub:seqDBseeking,%
subsub:spectralIDDBsearch,subsub:resultsFiltering}.
See~\citet{Wright-2016,Weisser2016-pipeline} for more details.\mybr\
\begin{sidewaysfigure}
\includegraphics[scale=0.75]{datasets/pipelineMS-2.pdf}\centering
\caption[General steps for processing the proteome
data]{\label{fig:pipelineProt}\textbf{General steps for processing the
proteome.} [Adaptation of courtesy materials from \james].\\
(\textbf{A}) The three datasets have been processed through the
same pipeline. In this thesis, I only use the samples from adult tissues.
(\textbf{B}) Extensive sources of protein sequences were
used for the search database, including prediction of novel proteins.
Contamination and decoy sequences were also included to allow for \gls{FDR}
estimation. (\textbf{C}) State of the art workflow was used to process the
\ms\ data from raw files.
This workflow combines multiple \ms\ search engines and
post-search evaluation tools. Results were filtered by peptide length,
\gls{FDR}, \gls{PEP} and agreement between the multiple search algorithms.
Note that there is no relation between the real size of
the database parts and their representation;
the decoy sequences are as numerous as the sum of the known and possible candidates. }
\end{sidewaysfigure}
\subsubsection{Spectral processing}\label{subsub:spectralProcessing}
The \soft{msconvert} module of \softFull{ProteoWizard}%
{http://proteowizard.sourceforge.net/}{ProteomeWizard}{v3.0.6485}
converted all the files to the standard format mzML.\@
\softCi{TOPP}{Topp} from
\softFull{OpenMS}{https://www.openms.de/}{openMS2016}{pre-v2.0 development build},
processed the raw spectra. Notably, \soft{PeakPickerHiRes} which centroids them
and \soft{FileMerger} that merges the ones from the same fractionated experiments.\mybr\
\subsubsection{Sequence database creation and searching preparation}%
\label{subsub:seqDBseeking}
The target sequence database is a critical element of the \ms\ pipeline and thus,
\james\ has carefully designed it.
It combines six different parts, three based on
known protein sequences and three other covering possible new protein candidates.\mybr\
The known sources include the complete human \hg{38} (v.20) \gls{CDS} translated
sequences from \gls{GENCODE}; the human reference proteome from
\gls{Uniprot}\footnote{%
\gls{Uniprot} --- \Href{http://www.uniprot.org/}}~\mycite{UniProt2017}
(in its May 2014 version); common contamination protein
sequences\footnote{Contamination sequences ---
\Href{http://maxquant.org/contaminants.zip}}
and \gls{HLA} sequences\footnote{\gls{HLA} sequences ---
\Href{https://www.ebi.ac.uk/ipd/imgt/hla/download.html}}.
\cpv{This \emph{known} portion of the target sequence database
represents 787,587 tryptic peptide sequences.}\mybr\
The sources for potential
novel proteins included a selection of non-coding gene sequences (including
pseudogenes, \gls{lncRNA} and untranslated regions) from \gls{GENCODE}
\hg{38} (v.20); prediction of novel sequences with
\softFoCi{\mbox{AUGUSTUS}}{http://bioinf.uni-greifswald.de/augustus/}{Stanke2004};
a set of two-consensus predictions (December 2013)
from \softFoCi{Pseudogene.org}{http://pseudogene.org/}{pseudogenesOrg}
and three-frame translated \Rnaseq\ transcript sequences.
These translated sequences include models
built on \ibm\ by \gls{Ensembl} and by the Kellis lab in addition with
models built on different \gls{ENCODE} cell lines by Caltech and CSHL\@.
\cpv{This \emph{novel} portion of the target sequence database
provides an addition of 4,211,835 tryptic peptide sequences.}\mybr\
\mybr\
%Precautions were taken with the handling of the overlapping regions
%and with the sequence three-frame translations.
\cpv{\softFo{Mimic}{https://github.com/percolator/mimic} generated
4,999,422 ($787,587 + 4,211,835$) randomised decoy sequences,
\ie\ the decoy database and the target database have an equal size of peptide sequences.
The different databases were then merged together.}
It is represented on \Cref{fig:pipelineProt}{\footnotesize B}.\mybr\
To account for the isobaric peptides, all isoleucine (I) residues were
converted to leucine (L) before the search and then after the search all leucine
(L) residues were converted to (J)\footnote{As J is one of the letter
from the Latin alphabet that do not map to any amino acid.}
to avoid later misconceptions.\cpvD\mybr\
%\NB\ For each peptide sequence, there is a unique identifier,
%an associated source database
%and, when available, the corresponding gene locus.
\subsubsection{Spectral identification and database search pipeline}%
\label{subsub:spectralIDDBsearch}
\Cref{fig:pipelineProt}{\footnotesize C} describes the overall workflow used by
\james\ to quantify the protein abundance in each tissue. As mentioned in
\Cref{seg:moreAlgoisbetter}, workflows involving several algorithms produce
better results. \soft{Mascot} Server~(v~2.4--- Matrix Science)~cluster produced
a first search on the mzML files submitted through \soft{MascotAdapterOnline}
(part of \soft{TOPP}). In parallel, \james\ also used \soft{MS-GF~$+$~Search},
which involves the run of \soft{MS-GF~$+$}~(v. 10089)~\mycite{Kim2014-nj}.
\soft{MascotPercolator}~(v~2.08)~\mycite{Brosch2009-xq,Wright2012-od}
optimised and rescored the results from \soft{Mascot} and
\soft{msgf2pin/Percolator} (v~2.08\textminus1)~\mycite{Granholm2014-cw} optimised
the results from \soft{MS-GF~$+$}.
Finally, \soft{SEQUEST}~\mycite{Eng1994-eq} and
\soft{Percolator}~\mycite{Spivak2009-zd} performed a search in a
\soft{Proteome Discoverer}~(v 1.4 --- Thermo Scientific) workflow.\mybr\
The different workflows used common stringent parameters for all the database
searches:
the precursor tolerance was set to 10 \gls{ppm};
fragment tolerance for \gls{HCD} spectra to $0.02$ \gls{Da} and
to $0.5$ \gls{Da} for \gls{CID} spectra;
the allowed missed cleavages were limited to $3$.
\cpv{As described in~\citet{Wright-2016},
the research also accounted for several amino acid modifications
by including (known) mass tolerances.
The fixed modification carbamidomethyl ($+57.0214$ \gls{Da}) was specified
for all cysteine residues.
The searches also comprised the following variable modifications:
N-terminal acetylation ($+42.01056$ \gls{Da}),
N-terminal carbamidomethyl ($+57.0214$ \gls{Da}),
deamidation of asparagine and glutamine residues ($+0.984$ \gls{Da}),
oxidation of methionine residues ($+15.9949$ \gls{Da}),
and the possible N-terminal conversion to pyro-glutamine of
glutamine ($-17.0265$ \gls{Da}) and glutamic acid ($-18.0106$ \gls{Da})
residues.}\mybr\
The search results were converted into mzTab formatted files and uploaded along
with the mzML spectra and FASTA search database to \Pride{PXD002967}.\mybr\
\subsubsection{Results processing and filtering}\label{subsub:resultsFiltering}
Custom Perl scripts parsed, merged and filtered the results of each search engine
so that every \gls{PSM} had the same identification in at least two of the three
search engines.
In each case, the least confident \gls{PEP} (\ie\ the highest, see \Cref{PEP}) was retained.\mybr\
The \glspl{PSM} were then filtered to keep matches only to the three following
criteria:
\qval\ less than or equal to $0.01$ (\ie\ 1\% \gls{FDR});
% $\leq 0.01$
a \gls{PEP} inferior or equal to 0.05 and
a peptide length superior or equal to seven amino acids.
\glspl{PSM} matching contaminant or decoy sequences were also removed.\mybr\
The resulting list of peptides was then used to infer the proteins with
a simple approach.
Protein clusters were created based on the common matching non-null set of peptides,
\ie\ each protein cluster has at least one unique peptide.
Then, the \gls{GENCODE} \gls{CDS} and \gls{Uniprot} accession were mapped back to
\gls{Ensembl} identifiers.
Proteins with a gene (or gene clusters) definition
matching at least three unique peptides
were kept for the remaining of the analysis
while the others were discarded.\mybr\
The quantification of the retained proteins was computed for each experiment
with an approach close to the Top3 method~\mycite{Silva-Top3}.
The precursor intensities of the three most intense \emph{unique} peptides
per gene identifier (or for gene cluster) were summed,
before being divided by the total summed quantification of all proteins
in each sample to provide the \enquote{\emph{within sample abundance}}.
Then, these abundance values were normalised by the ten genes displaying
the lowest coefficient of variation across all tissues.
When there was more than one experiment per tissue,
the final quantification values are
the median value across all the replicates of each tissue.\mybr\
Protein clusters matching several \gls{Ensembl} gene identifiers or
failing the \emph{unique peptide} rule are discarded
from the presented further analyses.
The list of the discarded clusters is different for each of the
proteome datasets.\mybr\
Compared to the original \pandey\ Lab study~\mycite{PandeyData},
fewer proteins were quantified, but
the results are congruent to other previous studies on the range of detection
and quantification of \gls{LC-MS/MS}.
The quantifications were released along with our paper~\mycite{Wright-2016}
and the reanalysis of the \pandey\ data was also released
through \egxa\ under the accession: E\textminus{}PROT\textminus{}1
and described in %\citet{NAR2015,NatComm2016}.
\citet{EBIgxa,Wright-2016}.\mybr\
\section{Discussion and Conclusion}
In this chapter, I introduced the five transcriptomic and three proteomic normal
human tissues datasets on which I based my thesis.
I described how both the transcriptomic and the proteomic datasets have been
reprocessed from raw files with state-of-the-art unified pipelines
which are also using the same genome build and annotation references in the
final processed version.\mybr\
As mentioned before,
I have produced a subset of the transcriptomic datasets
with the previous human reference genome~(\hg{37}) and
three different \gls{Ensembl} gene set annotations (73, 74 and 75).
I have run many of the analyses of \Cref{ch:expression,ch:Transcriptomics} on
these data.
While the results may vary for individual genes,
the overall outcomes are congruent hence
supporting the robustness of the findings presented in this thesis.
\cpv{In addition, all the products of the \Rnaseq\ pipeline
are in agreement with the original studies findings.
The \ms\ pipeline also produces similar results to the original studies
--- except for~\citet{PandeyData},
which original processing and results raised many criticisms~\mycite{Ezkurdia2014-qx}}.
\mybr\
While we are in the era of \emph{data deluge} and \emph{big data},
the number of tissue overlaps for independent normal human studies is surprisingly small
--- see \Crefp{fig:VennStudiesT}.
Most of these datasets have been (and will be) referenced through
many papers for comparison (or as control) purposes;
hence, it is essential to assess the soundness of these practices by assessing
the consistency between these datasets.\mybr\