-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproteomics.tex
executable file
·619 lines (550 loc) · 28.7 KB
/
proteomics.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
\chapter{Human MS-based Protein expression landscape}\label{ch:proteomics}
\setlength{\epigraphwidth}{0.7\textwidth}
\setlength{\epigraphrule}{0pt}
\epigraphhead[5]{%
%\epigraph{\emph{On m'a enseigné que la voie du progrès
%n'était ni rapide ni facile.}}{Marie Curie}
\epigraph{\emph{I was taught that the way of progress is
neither swift nor easy.}}{Marie Curie}
}
After exploring the high-throughput human transcriptomic studies
in \Cref{ch:Transcriptomics}
and before integrating them with proteomic data in \Cref{ch:Integration},
I present in this chapter the comparison of
the three proteomic datasets introduced in \Cref{ch:datasets}.
\citet{Ezkurdia2014-qx,Deutsch2015} have partially reviewed these data.
However, we have reprocessed them starting from the raw data
for this thesis.
In this context, reassessing the quantified processed proteomic data
before any integration is pertinent.\mybr\
The work presented in this chapter was done in collaboration with \james\
who has implemented the new protein quantification method
(presented in \Cref{sec:NewQuantProt}).
I have received general feedback from \alvis, \mar,
\sarah{}.\mybr\
\section{An~overall~fragmented~and~disparate~universe~to~explore}
As I have described in \Cref{ch:background},
proteins present a wide range of physicochemical properties
(see \Cref{sec:bio} and \Cref{sec:aa})
and are challenging to identify and quantify
in high-throughput studies (see \Cref{sec:exploreProtMS}).
Thus, it comes as no real surprise that
while the use of \ms\ for proteomics has been developing since the 1980s~\mycite{BiomolBio},
the first notable attempts to draft the human proteome occurred only recently
in 2014~\mycite{PandeyData,KusterData},
or that the oldest (unpublished) available multi-tissue \cutler\ dataset is from 2010
(see \Cref{subsec:cutler}).
Till early 2019,
\cutler\ Lab, \kuster\ Lab and \pandey\ Lab datasets are the only ones
that explore concurrently several non-diseased human tissues.
See \Cref{sec:ProteoData} for more details and
the processing pipeline designed and implemented by \james\ to handle them.\mybr\
As presented in \Cref{fig:VennDiagProt3},
they share four tissues: \heart, \lung, \ovary\ and \pancreas.
The protein overlap of this four-tissues set between these three datasets
is rather narrow as shown in \Cref{fig:VennProtComm}.
The \cutler\ Lab dataset shares
the smallest number of tissues with the two other studies;
\pandey\ Lab and \kuster\ Lab datasets share over twice as many proteins
that they share with \cutler\ (3,338 instead of 1,384).
The number of shared proteins between \pandey\ Lab and \kuster\ Lab data rise to 4,172
when all their fourteen common tissues are considered.\mybr\
\begin{figure}[!htb]
\includegraphics[scale=0.71]{proteomics/VennDiagProtCond.pdf}\centering
\vspace{-6mm}
\caption[Distribution of unique shared tissues between
the 3 MS-based proteomic studies]{\label{fig:VennDiagProt3}\textbf{Distribution
of unique and shared tissues between the three MS-based proteomic studies.}
The three datasets share together: \heart, \lung, \ovary\ and \pancreas.
The two most recent studies share fourteen tissues in total;
the additional ten tissues are: \adrenal, \hcolon, \gallbladder,
\oesophagus, \kidney, \liver, \placenta, \prostate, \rectum\ and \testis.}
\end{figure}
\vspace{1.4cm}
\begin{figure}[!htb]
\includegraphics[scale=0.71]{proteomics/VennDiagProtFor4T.pdf}\centering
\vspace{-2mm}
\caption[Proteins overlap between the common tissues of
the 3 proteomic studies]{\label{fig:VennProtComm}\textbf{Proteins
overlap between the common four tissues of the three proteomic studies.}
Unique and shared proteins detected and quantified
across the three \ms\ studies for their four shared tissues:
\heart, \lung, \ovary\ and \pancreas.}
\end{figure}
%\clearpage
\subsection{MS proteomic data has high detection variability}
\Cref{fig:barplot3Dvennprot}
illustrates the number of proteins identified in each of these four tissues.
The colours indicate in which dataset (or group of datasets)
the proteins have been identified.
See \Cref{fig:protbkdownT} for the precise numbers of each set.\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.65]{proteomics/barplotVenn3Dprot.pdf}\centering
\vspace{-3mm}
\caption[Identified proteins across the 4 shared tissues for the 3 datasets]%
{\label{fig:barplot3Dvennprot}\textbf{Number of identified
proteins in each of the four common tissues for the three proteomic data.}
Proteins found in more than one dataset are most likely true (in
red, light and darker green or purple --- the most validated ones in red
as they are found in all three datasets).
See \Cref{fig:protbkdownT} for the precise numbers in each set.}
\end{figure}
The tissue with the highest number of identified proteins,
regardless of which dataset,
is the \ovary.\mybr\
The highest number of proteins identified in all three datasets at once (600)
is in the \heart.
The other tissues have
the \emph{Kuster and Pandey} set as
their largest protein group formed by more than one dataset.\mybr\
The largest set of identified proteins in \pancreas\
and the second one in \ovary\ are proteins only found in the \pandey\ Lab dataset
(\emph{Pandey only}).
As shown in \Cref{fig:distribProtUniq3D},
our state-of-the-art pipeline (see \Cref{subsec:msDataProcess}) has identified
the highest number of proteins in the \pandey\ Lab dataset.
Thus, it is coherent that \pandey\ Lab proteins represent a large part of
the identified proteins in each tissue (either as \emph{Pandey only} set or
in agreement with the other datasets:
\emph{All 3 datasets}, \emph{Kuster \& Pandey} or \emph{Cutler \& Pandey}).
More surprising is that the \cutler\ dataset comprises a notable amount of
proteins in \lung\ that are missing in the other two.
A few of these proteins (82) are missing altogether,
but a subset of them (410) is still found
in (at least) another tissue of the other datasets.\mybr\
\begin{figure}[!htpb]
\includegraphics[scale=0.8]{proteomics/ProteinUniqueDistribPerdatasets.pdf}\centering
\caption[Distribution of the proteins per tissue]{\label{fig:distribProtUniq3D}%
\textbf{Distribution of the proteins per tissue across the three datasets.}
\cutler\ Lab dataset has the smallest and \pandey\ Lab the highest number of proteins
per tissue.
Coloured in red are the proteins that are specific to one tissue (or cell type);
these proteins have been identified in one tissue solely within each dataset.
Proteins in turquoise have been identified in several tissues of the dataset.}
\end{figure}
While proteins found in more than one dataset are more likely true positives,
it is impossible to exclude without risks the ones
that are identified in one dataset only.
Whether an identified protein in one dataset is
an artefact (\ie\ false positive) or a miss (false negative) in the other datasets
is a challenging question;
the diversified nature of proteins involves
many sample preparation and simplification methods
(see \Cref{subsec:ProtSampPrep,subsec:simpleProt}).\mybr\
\subsection{Overall about half of the proteins identified
in each study for any given tissue
are validated in a different study.}\label{subsec:halfProtConfirmed}
As shown in \Cref{fig:barplot3Dvennprot,fig:protbkdownT,fig:protbkdownT10more},
besides a few exceptions (\oesophagus, \gall\ and \testis),
more than half of the proteins are identified in the same tissue
in more than one dataset.
Three proteins are found in every tissue of every dataset:
\protein{ALB}, \protein{KRT9}, \protein{KRT10}.
This number rises to forty when only the \pandey\ Lab and the \kuster\ Lab data
are considered (see \Cref{tab:ubiProt2D}).
I have also investigated tissue-specific (\gls{TS}) proteins
(in red in \Cref{fig:distribProtUniq3D})
that are also identified in more than one dataset.
While the three datasets lacked to identify any \gls{TS} protein at once,
\pandey\ Lab and \kuster\ Lab datasets share a few (44 across eight tissues)
--- see \Cref{tab:comTSprot} for the complete list.\mybr\
\begin{comment}
twelve for \kidney, nine for \placenta, seven for \pancreas,
five for \adrenal\ and \liver, four for \testis,
and one for \prostate\ and \rectum.
\end{comment}
\gls{TS} proteins are more difficult to confirm
through different datasets,
but one needs to be careful with the ubiquitous proteins as well.
The latter may be present in the samples due to contamination:
none of the three ubiquitous proteins
(\protein{ALB}, \protein{KRT9} and \protein{KRT10})
is detected in \heart\ by the
\hFo{Human Protein Atlas}{https://www.proteinatlas.org/}~\mycite{Uhlen2015}
while they are found in epithelial cells.
One hypothesis is that contamination occurred when sampling from the donor
\cpv{or during the preparation or \ms\ analysis.
\protein{ALB} found in the tissues is more likely coming from the blood supply
(where \protein{ALB} is abundant);
\protein{KRT9} and \protein{KRT10} are environmental contaminants.}\mybr\
Lists for ubiquitous and \gls{TS} proteins of each dataset separately
and across the three (when consistent) are given as digital supporting data.\mybr\
\vspace{-3mm}
\subsection{Technical variability prevails over biological signal:
intrastudy correlations of different tissues are globally stronger than
same-tissue interstudy correlations.}\label{subsec:protTechVarHigh}
After defining the (1,384) protein set that is consistently detected
in the four common tissues of the three datasets,
I have assessed how consistent is
their expression quantification across tissues and studies.\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.9]{proteomics/heatmap3DproteinSpearman.pdf}\centering
\caption[Heatmap of the four common tissues between the three proteome
datasets]{\label{fig:prot3Dheatmap}\textbf{Heatmap of the four common tissues
between the three proteome datasets} based on
the pairwise Spearman correlations clustering of the expression levels
of 1,384 shared proteins.
The samples mostly cluster by laboratory.
Only \heart\ from \cutler\ Lab and \pandey\ Lab have a stronger intratissue correlation
than an intrastudy one.
Both for \pandey\ Lab and \cutler\ Lab,
\pancreas\ and \lung\ are more correlated to each other and their pair to \ovary.
The heatmap (\Cref{fig:prot3DheatmapPears})
based on Pearson correlation instead prompts globally identical observations.
See also \Cref{fig:scat3DheartCutlerPandey} that shows (as a scatterplot)
the relationship for \Heart\ between \pandey\ Lab and \cutler\ Lab, and then,
\Cref{fig:scat2DPlacentaKusterPandey} between
\pandey\ Lab and \kuster\ Lab.
}
\end{figure}
Following a similar approach to \Crefp{fig:noMitoNoRep4T}{},
I cluster the twelve proteomic \treps\footnote{\Glsxtrlong{TREP}.
See~\Vref{def:trep} for more details on \treps{}.}.
I have used Ward's method to link the \treps\ based on their similarity that
I have computed by subtracting from 1 the pairwise Spearman correlation
of the expression levels of the 1,384 common proteins.
As shown in \Cref{fig:prot3Dheatmap},
the technical variability overcomes the biological signal
as the proteomic \treps\ cluster according to their original laboratory/study
rather than their biological source,
except for \cutler\ Lab and \pandey\ Lab \heart.
However,
\cutler\ Lab and \pandey\ Lab share the same organisation of their remaining tissues:
\pancreas\ and \lung\ are the most correlated,
and their pair is in turn most correlated to \ovary.
\kuster\ Lab \treps\ display the greatest amount of study bias
(probably due to stronger batch effects --- see \Cref{sub:BatchEffect}).\mybr\
Removing the proteins translated from the mitochondrial genes slightly
improves the results
but excluding the three ubiquitous (likely contaminants) proteins, presented
in \Cref{subsec:halfProtConfirmed},
is impactless.\mybr\
Neither applying quantile normalisation or widespread scaling methods
on top of this quantification allowed correcting for the technical variability.
Because of the limited number of proteins included in this analysis,
it is unwise to draw definite conclusions
except that there is an extensive need for new quantification normalisation methods
that can help with protein expression meta-analyses
and, as for now, one has to be very cautious when comparing proteome samples.\mybr\
I attempted to expand this analysis by comparing the fourteen common tissues
of \pandey\ Lab and \kuster\ Lab datasets,
but the results are as inconclusive
(see \Cref{fig:prot2DheatmapT14,fig:prot2DheatmapPearson}).\mybr\
\section{New quantification method}\label{sec:NewQuantProt}
\vspace{-3mm}
In this thesis context where I aim to integrate
the \mRNA\ expression levels to the protein ones
(presented in \Cref{ch:Integration}),
I have developed with the help of \james\
a new method to infer and quantify the proteins.\mybr\
Our original processing workflow (detailed in \Cref{subsec:msDataProcess})
intends to provide a reliable, state-of-the-art, protein quantification.
Our method seems more rigorous than \citet{PandeyData} original paper;
\citet{Ezkurdia2014-qx} challenge the correctness of their quantification
by highlighting the disputable presence of \glspl{OR} in many tissues.
On the other hand, our workflow lacks to detect or quantify any possible \gls{OR}
in the \pandey\ Lab data.
The left part of \Cref{fig:newQuantProtMeth} summarises
our first method of quantification.\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.7]{proteomics/quantificationQ1Q2.pdf}\centering
\caption[Two quantification methods for Pandey Lab data]{\label{fig:newQuantProtMeth}\textbf{Two
quantification methods applied to Pandey Lab data.}
For both approaches, the search pipeline assigning \psms\ remains identical
(see \Cref{subsec:msDataProcess}).
The first quantification method,
which follows a robust inference method
involving at least three unique peptides,
relies on the intensity of the three most intense unique peptides.
The new quantification method that I have devised
allows more relaxed inference parameters
as it also uses the non-unique peptides for the quantification.
See \Cref{eq:PPKM},
which is similar to \Vref{eq:rpkm-fx}.
After averaging per tissue and removing the clusters,
the number of quantified proteins with the new method
is close to twice the number provided by the first described method.
The new quantification method was designed
for the analyses in \Cref{ch:Integration} in particular;
this is why the filtering is also less strict than for the first quantification method
since my main focus is the integration of the proteomic data
with the \Rnaseq\ data.
Clusters are protein groups that can be mapped to
more than one \gls{Ensembl} gene identifier.
Note that the final proteins numbers include only the fifteen tissues
used for the integration in the following \Cref{ch:Integration}.
}
\end{figure}
While probably more accurate,
our state-of-the-art quantification method is also
more stringent than the original authors' one.
Thus, the total number of quantified proteins is more limited.\mybr\
As presented in \Cref{ch:background},
protein quantification methods rely on
\psms\ identification (see \Cref{subsub:peptideID})
and a chosen approach for protein inference (see \Cref{subsec:proteinInference}).
I realised that our main limiting factor is that
we get quantification only for proteins
that have at least three unique peptides
since this first method is based on the Top3 approach~\mycite{Silva-Top3}
and uses the three \emph{unique}\footnote{I.e.\ exclusive to a single protein}
most expressed peptides of a protein to estimate its overall expression
(see \Cref{eq:Top3} on p.~\pageref{eq:Top3} and \Cref{subsubsec:protQuantLB}).\mybr\
\begin{minipage}{\textwidth}
\begin{equation}
\tag{Top3 \gls{IBAQ}}\label{eq:Top3}
\hat{\mu}_{ij}=\frac{\sum \text{Intensity of Top 3 unique peptides} _{ij}}{\text{Total Intensity in experiment } j}
\raisetag{3\normalbaselineskip}
\end{equation}
where:{\small
\begin{itemize}[topsep=0pt,nosep]
\item $\hat{\mu}_{ij}$ is the normalised expression for the protein $i$ in experiment $j$ (normalised \gls{PSM}),
\item $\sum \text{Intensity of Top 3 unique peptides} _{ij}$ is
the sum of the intensity of the three most intense unique peptides
of the protein $i$,
\item $\text{Total Intensity in experiment } j$ is the total sum of the intensity of all
the peptides identified in the experiment $j$.
\end{itemize}
}
\end{minipage}
For the following chapter analyses,
I requested \james\ to provide me a new quantification for the Pandey Lab data
where both unique and
the \emph{degenerate} (\ie\ non-unique) peptides
are involved in the estimation of the protein expression.\mybr\
The new method I have devised allocates the degenerate peptides
in proportion to the distribution of unique peptides per protein
following a similar approach to \cuffl\ for \Rnaseq\ data (see \Cref{minisec:Cufflinks}).
As three unique peptides are no longer required for the quantification,
it allows relaxing the inference parameters to two unique peptides
(in order to still avoid \emph{one-hit wonders}, see \Cref{subsec:proteinInference})
for the identification of the proteins.\mybr\
\cpv{Once the identification is done,
all the unique and degenerate peptides are mapped to the identified proteins.
For the unique peptides,
their quantification is directly linked to one and only protein.
However, for the degenerate peptides,
it is necessary to gauge the likely amount provided by each of their matching proteins.}\mybr\
\cpv{For this purpose, the distribution coefficient of the degenerate peptide $d$
to the protein $p$, $C_{d,p}$, is defined as below:}
\cbstart{}
\begin{minipage}{\textwidth}
\begin{equation}
\tag{Distrib.\ coeff.\ of the degenerate peptide}\label{eq:distribDegen}
C_{d,p}=\frac{%
\sum\limits_{i=1}^{Nu_p}\frac{\text{Number of \psm}(u_{p,i})}{Nu_p}
}{%
\sum\limits_{q\in P_d}(\frac{\sum\limits_{j=1}^{Nu_q}\text{Number of \psm}(u_{q,j})}{Nu_q})
}
\end{equation}
where:{\small
\begin{itemize}[topsep=0pt,nosep]
\item $P_d$ is the set of identified proteins that include the degenerate peptide $d$
\item $Nu_p$ is the number of unique peptides of the protein $p$, $\forall p \in P_d$
\item $u_{p,i}$ is the $i$th unique peptide of the protein $p$, $\forall p \in P_d$
\item $\text{Number of PSM}(u)$ is the number of \psms\ of the peptide $u$
\end{itemize}
}
\end{minipage}
\begin{minipage}{\textwidth}
Then, the contribution of the degenerate peptide $d$
to the quantification of a protein $p$ is computed as:
\begin{equation}
\tag{Distribution of the degenerate peptide quantification to a protein}
\text{Q}_{d,p}=C_{d,p} \cdot\ \text{Q}_d
\end{equation}
where:{\small
\begin{itemize}[topsep=0pt,nosep]
\item $C_{d,p}$ is the distribution coefficient defined above
\item $\text{Q}_d$ is the total quantification of
the degenerate peptide $d$
\end{itemize}
}
\end{minipage}
\cbend\
The new quantification follows a similar approach to the F/\RPKM\ normalisation ---
see \Cref{eq:rpkm-fx}.
Protein expression levels are expressed in \glspl{PPKM},
which stands for \glspl{PSM} Per Kilobase of gene per Million.
As shown in \Cref{eq:PPKM},
this method counts the number of \psms\
that can be mapped to the corresponding \gls{Ensembl} gene identifier of a protein.
Then, this raw count is normalised by dividing it
by the product of the longest transcript length and the total number of \psms\
assigned in that experiment;
the result is finally multiplied by a factor ($10^6$) to facilitate reading.
Using the longest \emph{transcript} length
instead of the longest protein isomer allows avoiding issues due to
annotation differences between gene and protein levels
in the analyses of the following \Cref{ch:Integration}.\mybr\
\begin{minipage}{\textwidth}
\begin{equation}\label{eq:PPKM}
\tag{PPKM definition}
\hat{\mu}_{ij} = \frac{\text{Number of \psms\ matching } G_i \cdot10^{-6}}{\ell_{G_i}\cdot \text{Number of \psm}_j}
\end{equation}
where:{\small
\begin{itemize}[topsep=0pt,nosep]
\item $\hat{\mu}_{ij}$ is the normalised expression of the protein $i$
in the experiment $j$,
\item $G_i$ is the gene that corresponds to the protein $i$,
\item $\text{Number of \psms\ matching } G_i$ is the number of \psms\ mapped
to $G_i$,
\item $\ell_{G_i}$ is the length of the longest transcript of $G_i$,
%\text{Longest transcript length for } G_i$
\item $\text{Number of \psm}_j$ is the total number of \psms\ identified in
the experiment $j$.
\end{itemize}
}
\end{minipage}
\james\ has implemented this new method and provided me
\PPKM\ quantifications for the \pandey\ Lab and \kuster\ Lab data.\mybr\
Although smaller,
another limiting factor is the filtering threshold of the selected \psms\
to be inferred in proteins.
We have chosen a conservative (and state-of-the-art) threshold prior to
the first quantification filtering
and a less strict one for the new method.
\cpv{As my aim is to compare and integrate proteomic and transcriptomic data together,
the primary purpose of the new quantification is to provide a number of proteins
that is roughly similar to the \mRNAs{}' one.
Thus, we also have to relax parameters and
allow an increased number of false positives among the identified proteins
(see \Cref{subsub:peptideID}).
I have included proteins quantified with both methods
in the analyses of the next chapter (\Cref{ch:Integration})}.\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.7]{proteomics/distribPandeyQ1Q2.pdf}\centering
\vspace{-2mm}
\caption[Pandey Lab data protein expression distribution
with two quantification methods]{\label{fig:pandeyDistribQ1Q2}\textbf{Distribution
of the protein expression levels with two different methods.}
On the left, the protein expression levels distribution for the adult
tissues of the \pandey\ Lab data with our first quantification method
(described in \Cref{subsec:msDataProcess}).
This figure is similar to \Cref{fig:distribProt}(c) but without the fetal tissues.
On the right are the protein expression levels of the same tissues that
have been computed with the new quantification method.
The overall shape of the density plots is very similar between the two approaches.
}
\end{figure}
Before moving on to the integration of the proteomic and transcriptomic data
in the next chapter,
I have carried out a few comparisons
between the first and the new quantification methods.
As shown in \Cref{fig:pandeyDistribQ1Q2},
the densities of distribution of protein expression levels per tissues
have similar shapes between the two methods.\mybr\
\Cref{fig:pandeyQ1Q2comp} presents the number of quantified proteins per tissue.
The new method allows quantifying for some tissues
up to more than twice the number quantified by the first method.
This proportion is consistent with the total number of proteins identified
across the fifteen adult tissues
by the first method (6,436)
and the new one (12,290).\mybr\
\begin{figure}[!ht]
\includegraphics[scale=0.82]{proteomics/compDistribQ1Q2pandey.pdf}\centering
\vspace{-3.5mm}
\caption[Comparison of the impact of the quantification method on the
protein distribution per tissue for Pandey Lab data]{\label{fig:pandeyQ1Q2comp}\textbf{Comparison of
the impact of the quantification method on the protein distribution per tissue
for the Pandey Lab data.}
This figure partly reproduces the \emph{Pandey Lab} part
of \Cref{fig:distribProtUniq3D}.
Indeed, the turquoise/red $Q1$ bars are the adult tissues of the \pandey\ Lab data
quantified with the first quantification method
(described in \Cref{fig:distribProtUniq3D}).
The purple/green $Q2$ bars are their equivalent
to the new quantification method.
The new method allows a considerable increase of
the identified and quantified proteins number,
sometimes more than twice as many as the first quantification.
On the other hand,
the rank orders of the total and tissue-specific protein numbers per tissue
are quite similar between the two methods.
}
\end{figure}
I have also checked the presence of \gls{OR} in the \pandey\ Lab data
with the new quantification method;
only two \glspl{OR} are present:
\protein{OR1M1} in \Kidney\ and \protein{OR13C4} in \Liver.
Their presence may be artefactual, or it may be an issue with the annotation.
Indeed, while these two proteins are missing from all tissues
--- either at \RNA\ or protein levels
--- in \WebFoCi{The Human Protein Atlas}{https://www.proteinatlas.org/}{Uhlen2015},
their \mRNAs\ are present
in the \emph{Baseline expression} of \WebFoCi{\egxa}{https://www.ebi.ac.uk/gxa/}{EBIgxa}
in the human \textit{Chloroid plexus} at 10 post-conception weeks
(HDBR developing brain --- \ArrayExpress{E-MTAB-4840})
and in one sheep \testis\ sample (\ArrayExpress{E-MTAB-3838}).
In addition, \protein{OR1M1} seems to be expressed in the \textit{Blood} of
the green monkey (\species{Chlorocebus sabaeus} --- \ArrayExpress{E-MTAB-4404}).
Both proteins are also found to be up or down regulated at transcript level
in different tumoral samples (\emph{Differential expression} tab of \egxa).
See also the digital supporting data.\mybr\
I have also assessed the consistency of expression measurements
across \pandey\ Lab and \kuster\ Lab data and their common tissues
as I have done for the three datasets with the first quantification
(presented in \Cref{subsec:protTechVarHigh}).
The new quantification gives similar results
as shown in \Cref{fig:heatmap3DProtT14PPKM}.
Besides, as shown in \Cref{fig:barplot2DvennprotPPKM},
more than half of the proteins quantified in each dataset
within a given tissue is also quantified in the other dataset.\mybr\
\begin{comment}
\citet{Nesvizhskii2003-ls} propose a method that may seem alike,
but their method apportions the degenerate peptides
among all corresponding proteins to estimate their presence likelihood in an experiment.
They focus on the identification while the quantification remains overlooked.\mybr\
\end{comment}
\section{Ubiquitous and TS proteins}
Previously, \citet{Liu2014-xr} had compiled a list of 627 \gls{TS} and
1,093 housekeeping proteins
from the expression data released originally by \pandey\ Lab~\mycite{PandeyData}.
The data processing has a significant impact on protein identification;
in our first version of \pandey\ Lab data,
I have found 534 housekeeping (ubiquitous) proteins and 1,491 \gls{TS} proteins,
and for the \PPKM\ quantification: 2,057 ubiquitous and 2,640 \gls{TS} proteins.
I provide as digital supporting data the list of the \gls{TS} and
ubiquitous proteins.\mybr\
%\FloatBarrier
\section{Discussion and Conclusion}
In this chapter, I have reviewed human proteome data
from three projects presented in \Cref{ch:datasets},
that have been reprocessed by \james\ with two pipelines
for the two largest studies \pandey\ Lab data~\mycite{PandeyData}
and \kuster\ Lab data~\mycite{KusterData}.
Currently, state-of-art bottom-up label-free \ms\ proteomics captures
human tissues expression as a fragmented and disparate universe.
Our first processing pipeline,
based on the Top3 quantification method presented in \Cref{subsec:msDataProcess},
appears more reliable than some of the original authors'.
The original \pandey\ Lab data was disputably quantifying \glspl{OR} in many tissues
\mycite{Ezkurdia2014-qx}.
No trace of \glspl{OR} was found in any of our reprocessed data.
I have also described our new quantification method (\gls{PPKM}),
which allows us to estimate the expression of nearly twice as many proteins
than the first one we used.\mybr\
For both quantification methods,
the technical variability is generally stronger
than the biological interstudy signal even for similar tissues.
Besides, across the different tissues,
about half of the proteins are consistently observed
in the same tissues at least in two datasets.\mybr\
Even when limited to the protein identification only,
the general lack of repeatability and reproducibility
has been well reported and described
for technical and biological replicates in \ms\ proteomics
(\eg\ \citet{Tu2014-yj,Tabb2010-ro}).
\citet{Canterbury2014-oi} report that
the intra-assay variation between two technical replicates for complex mixtures
can be at least 50\%;
different runs of the same sample or experiment are often produced to raise
the interstudy results repeatability and confidence.\mybr\
Thus, beyond the quantification method
I have developed with the help of \james\ by drawing on \Rnaseq\ ones,
there is a definite need for new \ms\ protocols and quantification methods
for baseline expression\footnote{Normalisation methods for differential expression
analysis are usually unsuited for baseline expression studies.
See \citet{Valikangas2018-yj} for possible differential expression quantification
methods.} to correct the extreme variability
and ease the integration of proteomics data across studies.