-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintegration.tex
2105 lines (1885 loc) · 98.8 KB
/
integration.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Integration of Transcriptomic with Proteomic data}\label{ch:Integration}
\setlength{\epigraphwidth}{0.8\textwidth}
\setlength{\epigraphrule}{0pt}
\epigraphhead[5]{%
\epigraph{\emph{Scientists like ripping problems apart, collecting as much data as possible\\
and then assembling the parts back together to make a decision.}}{Shirley M. Tilghman}
}
\vspace{2cm}
After assessing the similarity of the human gene expression profiles
across various tissues
at transcriptomic level (with \Rnaseq\ studies in \Cref{ch:Transcriptomics})
and proteomic level (with \emph{bottom-up} \ms\ studies in \Cref{ch:proteomics}),
my next step is to examine how these gene expression profiles
compare between these two different biological layers.\mybr\
One major aim of this study is to assess
how the correlations between the transcriptome and proteome
described in the literature, mostly measured in cells,
hold at the tissue level.
Moreover, good correlations may potentially lead to
the development of new strategies.
These may use the expression levels of \mRNA\ as proxies
to estimate protein expression,
which is generally difficult to measure directly (see \Cref{sec:exploreProtMS}).\mybr\
I have performed the integration and all the analyses presented in this chapter
under the supervision of \alvis\ and \jyoti{}.\mybr\
A few closely related studies~\mycite{SciRep2016,Franks2017-bp,Wang2019-ut} have
been published while I was working on
the integration of the non-diseased human transcriptome and proteome.
As their analyses rely on the same data sets (\ie\ \uhlen, \gtex, \pandey\ Lab data)
that I include in my work,
I describe and discuss together my results and theirs
whenever relevant.\mybr\
\clearpage
\derivativeWork{}
\begin{itemize}[topsep=0pt,nosep]
\item (paper) \cpv{\href{https://dx.doi.org/10.1002/pmic.202000009}{%
Mitra P. Barzine, K\={a}rlis Freivalds, James Wright et al. (2020). %
\enquote{Using Deep Learning to Extrapolate Protein Expression Measurements}. %
\textit{Proteomics} 20 (21--22), e2000009}}
\item (submitted paper) Andrew F. Jarnuczak; Hanna Najgebauer; Mitra Barzine;
Deepti J. Kundu; Fatemeh Ghavidel; Yasset Perez-Riverol; Irene Papatheodorou; Alvis Brazma;
Juan Antonio Vizcaíno An integrated landscape of protein expression in human cancer
\item (poster) CSHL Biology of Genomes 2015 --- A feasibility study:
Integration of independent human \Rnaseq\ and proteomic datasets
\item (talk) \gtex\ meeting 2017 --- A. Brazma Correlating transcriptome
and proteome in human tissues
\item (poster) HUPO 2018 --- Jarnuczak \etal\ An integrated atlas of
protein expression in human cancer derived from publicly available
\item (poster) ECCB 2018 --- Viksna \etal\ An integrated approach
to missing data imputation in quantitative proteomics experiments
\item (poster) RECOMB 2018 --- Viksna \etal\ Deep learning
for protein abundance prediction using Gene Ontology and RNA abundance information
\end{itemize}
\clearpage\
%\vspace{-1cm}
An on-going debate in the literature is
whether good correlations of expression levels prevail
between \mRNAs\ and proteins \mycite{Uhlen:2016}.
The implicit assumption of a proportional relationship is persisting
as the many remaining technological limitations prevent
rigorous testing \mycite{Vogel2012-sq}.
To date, the existence or concentration of a given \mRNA\ transcript
is usually insufficient to ensure detection of the protein in a sample.\mybr\
On the one hand,
\citet{Ramakrishnan2009-lv} report that
\mRNAs\ abundance are roughly sufficient to predict
the protein presence or absence from a sample and
\citet{Vogel2010-ux} that
\mRNA\ level estimations and sequence features are enough to predict
two-thirds of the human protein abundance variation.\mybr\
On the other hand,
the literature fails to report any high correlation
between the transcriptome and the proteome for any organism.
Previous investigations found low or no correlation
between the measured expression profiles of the \mRNAs\ and
proteins in human~\mycite{Anderson1997-le,Chen2002-ob,Tian2004-hh,Pascal2008-gh,%
Gry2009-zv,Lundberg2010-gk},
other mammals~\mycite{Ghazalpour2011-nb},
and across many other species~\mycite{Gygi1999-fl,Maier2009-pb,Maier2011-tz,%
Yeung2011-sl,Palmblad2013-ji,Freiberg2016-fu}.\mybr\
In their encompassing reference experiment,
Schwanhäusser \etal{}~\mycite{schwanhausserglobal:2011,Schwanhausser2013-et}
present rather moderate correlations ($r^2≤0.41,\ie~r<0.64$)
and highlight that \mRNA\ levels explain only about 40\% of protein variations
they have observed.\mybr\
Other studies have explored the \mRNAs\ and proteins relationship in answer
to stimuli~\mycite{Marguerat2012-sn}
or with an increased focus to post-transcriptional regulations
(including degradation rates)~\mycite{Jovanovic2015-wv}.
While many other regulatory processes may occur
(\eg\ translation rates),
post-transcriptional modifications and technical noise
are (still) perceived as the probable primary sources
of \mRNA/protein concentration discrepancies~\mycite{Vogel2012-sq,Plotkin2010-ug}.\mybr\
Joint studies of transcriptome and proteome have already helped to highlight
links between genotype and phenotype~\mycite{Vogel2012-sq}.
However, the mitigated results reported above may explain
the focus shift of many subsequent studies.
While previous efforts were about linking the actual expression levels,
more recent studies primarily have mostly compared qualitative attributes
of given proteins and related \mRNAs{}.
Examples include the comparison of
the presence or absence of \mRNAs\ and their proteins
in specific conditions or tissues~\mycite{Santos2015-rj,Freiberg2016-fu,Uhlen2015}
or the comparison of their differential expression profiles
across identical sets of conditions~\mycite{Varemo2015-uk}.\mybr\
All (or almost all) aforementioned studies have turned to cells
for their joint analyses of transcriptome and proteome.
In contrast,
the analyses and integration I present in this chapter are
based on tissue studies.\mybr\
%\vspace{-2mm}
\section{Data~and~principal~analytical~approaches}\label{sec:IntegrationData}
%\vspace{-4mm}
Since the human proteome drafts~\mycite{PandeyData,KusterData} in 2014,
we have an unparalleled availability of large-scale tissue studies
both at the transcriptomic and proteomic layers to explore and integrate together
(see \Cref{ch:datasets}).
While these data are independent
(collected from various individuals, prepared,
and characterised by different laboratories),
their combined study may help
to shed light on the relationship
between the transcriptome and proteome at the tissue level.
Using different sources for the transcriptome and proteome
increases the overall technical noise,
but it may also help to highlight relevant biological signals (as
they need to be stronger than the noise and batch effects to be captured).\mybr\
In \Cref{ch:Transcriptomics}, I show that
the transcriptome \Rnaseq\ datasets present high interstudy tissue correlations
(median value for Pearson: $r_{\setOneMath}=0.75$; $r_{\setTwoMath}=0.85$ ---
Spearman: $\rho_{\setOneMath}=0.88$; $\rho_{\setTwoMath}=0.93$).
For this chapter analyses,
I only consider the datasets with the highest similarity
(highest correlations)
that incidentally comprise the greatest number of tissues
and are the two most recent studies,
\ie\ \dataset{Uhlén \etal}~\mycite{Uhlen2015}
and \gtex{}~\mycite{GTExTranscript} data.\mybr\
To compensate for the shortfalls in the study design implied
by the reuse of published data\footnote{%
Independent data also means
different collection and sampling processing methods and
lack of information on the samples population background.},
I use both \uhlen\ \etal\ and \gtex\ data
to filter out \mRNAs\ with high interstudy variability for identical tissues.
Whether this variability is technical or biological is irrelevant;
in both cases,
interpreting the relationship
between a highly variable \mRNAs\ and its protein from another dataset
remains hard to interpret.
For these \mRNAs,
it is impossible to explain the observed variability
between the two transcriptomic datasets.
Indeed, any result is subject to the transcriptomic dataset chosen
for the comparison with the proteomic one.
Furthermore, the comparison of the two transcriptomic data may give a reference,
\ie\ an ideal case scenario, for the proteomic/transcriptomic one.\mybr\
On the other hand,
as shown in \Cref{subsec:protTechVarHigh},
the technical variability prevails over
the biological signal of same-tissue samples
for the available high-throughput proteomics.
With the current technological state,
different tissues from the same proteomic study are more likely
to present a higher correlation
than the same tissues from two different studies.\mybr\
To avoid an overly restricted protein set for the following analyses,
I only include one proteomic study: \pandey\ Lab~\mycite{PandeyData}.
All its samples have been run through the same \ms\ platform and
with the same protocol.
Moreover, it presents more homogeneous protein distributions
(see \Cref{fig:distribProt} and \Cref{fig:pandeyDistribQ1Q2}) and
quantifies more proteins per tissue (\Cref{fig:distribProtUniq3D})
than the two other datasets.
Since a current major limitation of bottom-up \ms\ proteomic studies
is the possible lack of detection of proteins for various reasons
(see \Cref{subsec:simpleProt}),
the higher number of detected proteins in \pandey\ Lab data suggests that
this dataset has a higher quality than the two others.\mybr\
%Many strategies are recommended to increase the depth of the coverage
%(\eg\ \mycite{Zhang2014,Eriksson2007-si,Koziol2013-si}).
%Put together, these facts suggest that
%the \pandey\ Lab data has a higher quality than the two other datasets.
%\vspace{-0.5mm}
Though I include one proteomic dataset only,
as the literature reports that
the proteome is more conserved than the transcriptome
(across individuals and species)~\mycite{Laurent2010-rg,Liu2016-re},
this data collection ought to provide
a crude estimate of the extent of observations
that hold from cell to tissue level.\mybr\
This chapter integrates and analyses the matching pairs of \mRNA/proteins
of the common set of tissues between \pandey\ Lab
and the two transcriptomic datasets.\mybr\
\subsection{Overlapping set of tissues for the three datasets}
\begin{figure}[!htbp]
\includegraphics[scale=0.69]{integration/PandeyGtexUhlen_tissuesVennm.pdf}
\centering
\caption[Number of shared and unique tissues between the proteomic
dataset from Pandey Lab and the transcriptomic datasets (Uhlén \etal\ and
Gtex)]{\label{fig:VennTissuePandeyGtexUhlen}\textbf{Number of shared and unique
tissues between the proteomic (Pandey Lab) and the
transcriptomic (Uhlén \etal\ and GTEx) data.} %The twelve common tissues of
%the three datasets are
%\tissue{Adrenal gland}, \tissue{Bladder}, \tissue{Colon}, \tissue{Oesophagus},
%\tissue{Heart}, \tissue{Kidney}, \tissue{Liver}, \tissue{Lung}, \tissue{Ovary},
%\tissue{Pancreas}, \tissue{Prostate} and \tissue{Testis}. The three added
%tissue between \dataset{Uhlén \etal} and \dataset{Pandey Lab} are
%\tissue{Gall bladder}, \tissue{Placenta} and \tissue{Rectum}. The added tissue
%between \dataset{GTEx} and \dataset{Pandey Lab} is the \tissue{Frontal
%cortex}.
}
\end{figure}
All analyses include the twelve tissues shared between the three
datasets (\adrenal, \Bladder{}\footnote{May also
be referred to as \tissue{Urinary Bladder}},
\hColon, \Oesophagus, \Heart,
\Kidney, \Liver, \Lung, \Ovary, \Pancreas,
\Prostate\ and \Testis).\mybr\
In a few cases, I have also extended the analyses
to three additional tissues (\ie\ \Gall, \Placenta\ and \Rectum)
by including the \uhlen\ \etal\ data on the transcriptomic side only.\mybr\
\subsection{Matching pairs of mRNAs and proteins}
To avoid unnecessary biases (described in \Cref{sec:bias_sources}),
I only consider the \mRNAs\
(\ie\ \glspl{RNA} with a \emph{protein-coding} biotype --- \ens{76})
for the following analyses.
Moreover, since missing data is common for proteomics~\mycite{Lazar2016-oe},
only proteins that are detected in each dataset
in at least one of the included tissues
are considered for further analyses.\mybr\
Besides,
while in the transcriptomics studies
biological replicates of each tissue have been processed
as individual \Rnaseq\ libraries,
in the proteomic one,
the biological replicates have been pooled per tissue before any \ms\ profiling.
Thus, to prevent an unbalanced number of samples biasing
the integration analyses (see \Cref{ch:expression}),
I use \enquote{virtual references},
\ie\ \treps\footnote{\trep{}: \glsdesc{TREP}}
that I computed for each tissue
by taking the median values of each gene
across the biological replicates
(see \Cref{subsec:averagedTissue}).\mybr\
As exposed in \Cref{ch:datasets,ch:proteomics},
all the proteomic quantifications have been provided by \james.\mybr\
The first quantification follows state-of-the-art practices
with stringent parameters (described in \Cref{subsec:msDataProcess})
since accurate protein identification is paramount
for reliable proteome exploration.
The protein levels are
the intensity of their top three unique peptides normalised within-sample.
\Cref{fig:PGU_vennQ1} presents
the genes overlap across twelve shared tissues
between the \pandey\ Lab's proteins quantified through this first method
and \uhlen\ \etal{}'s and \gtex{}'s \mRNAs\ quantified
with \htseq\ (see \Cref{subsubsec:RnaseqDataProc}).
\Cref{fig:PU_vennQ1} is the same analysis across the fifteen shared tissues
between \pandey\ Lab and \uhlen\ \etal\ data.\mybr\
\vspace{5mm}
\begin{figure}[!htb]
\includegraphics[scale=0.65]{integration/PandeyGtexUhlen_mRNAprotQ1Vennm.pdf}\centering
\caption[Distribution of the unique and shared proteins/mRNAs for the three datasets
across twelve tissues]{%
\label{fig:PGU_vennQ1}\textbf{Distribution of the unique and shared proteins
of Pandey Lab data and mRNAs from Uhlén \textit{et al.} and GTEx ones across
their twelve shared tissues.}
There are 6,357 matching gene products between the three datasets.
Only 5 proteins have apparently no matching partners
in the \uhlen\ \etal\ or \gtex\ data.}
\end{figure}
\begin{figure}[!htb]
\includegraphics[scale=0.65]{integration/PandeyUhlen_mRNAprotQ1Vennm.pdf}\centering
\caption[Distribution of the unique and shared proteins/mRNAs for Pandey Lab
and Uhlén \textit{et al.} across fifteen tissues.]{%
\label{fig:PU_vennQ1}\textbf{Distribution of the unique and shared proteins/mRNAs
for Pandey Lab and Uhlén \textit{et al.} across their fifteen shared tissues.}
The number of matching pairs (6,428) and proteins that lack a counterpart in
the transcriptomic data (8) are similar regardless of how many different
transcriptomic data is included (see \Cref{fig:PGU_vennQ1}).}
\end{figure}
This first proteomic quantification is following robust guidelines,
and both figures show that
almost all the genes with an observed protein
also have an observed \mRNA{}.
However, only about 32\% of the quantified \mRNAs\
in the \uhlen\ \etal\ and \gtex\ data
have a corresponding protein detected in the \pandey\ Lab data.\mybr\
Once I learned more about the bioinformatic challenges of bottom-up proteomics
(described in \Cref{sec:bioinfProt}),
I chose to be more flexible with the identification and quantification methods
to increase the number of proteins included in my analyses.
As I aim to integrate independent proteomics with transcriptomics,
I mostly focus on robust expression between the two biological layers
since discrepancies in this study context are hard to interpret.
While artefacts may persist,
further analyses with targeted proteomics (see \Cref{sec:exploreProtMS})
can help prune or validate the results.\mybr\
I have drawn on \Rnaseq\ transcriptomic approaches to devise
a new quantification method, which is described in \Cref{sec:NewQuantProt}
and implemented by \james.
The method takes advantage of the \emph{degenerate} peptides\footnote{%
See \Cref{subsec:proteinInference}.}
that are distributed across possible protein parents
in proportion to their \emph{unique} peptides.
The method produces normalised values of the protein expression levels
(whose unit is the \gls{PPKM}, \ie\ \glspl{PSM} Per Kilobase of gene per Million).\mybr\
As shown in \Cref{fig:PGU_venQ3,fig:PU_vennQ3},
while the number of quantified proteins
with our new method
covers about 62\% of \uhlen\ \etal{}'s and \gtex{}'s quantified \mRNAs,
the number of proteins for which no \mRNA\ was detected
in the transcriptomic data remains marginal.\mybr\
\begin{figure}[!htpb]
\includegraphics[scale=0.65]{integration/PandeyGtexUhlen_mRNAprotQ3Vennm.pdf}\centering
\caption[Distribution of the unique and shared proteins/mRNAs
across the three datasets and twelve tissues
(new protein quantification method)]{\label{fig:PGU_venQ3}%
\textbf{Distribution of the unique and shared proteins/mRNAs
across twelve shared tissues} between Pandey Lab
(\textbf{new quantification method}),
Uhlén \etal\ and GTEx data.}
\end{figure}
\begin{figure}[!htpb]
\includegraphics[scale=0.65]{integration/PandeyUhlen_mRNAprotQ3Vennm.pdf}\centering
\caption[Distribution of the unique and shared proteins/mRNAs
across fifteen tissues between Pandey Lab (new quantification method)
and Uhlén \textit{et al.} data]{\label{fig:PU_vennQ3}\textbf{Distribution of the
unique and shared proteins/mRNAs across fifteen tissues between
the Pandey Lab (new quantification method)
and Uhlén \etal\ data.}}
\end{figure}
Whether it reflects the biological reality or
is solely due to \Rnaseq\ technology being more sensitive than
bottom-up \ms\ alone,
current techniques detect more individual \mRNAs\ than proteins
as confirmed by \Cref{fig:UniqExprPC1,fig:distribProtUniq3D}.
Thus, it may be surprising that
a few proteins lack a match in the transcriptome data.
Several possible explanations exist.\mybr\
Artefacts or technical issues are the most likely.
For example, the annotation might miss
the matching \glspl{RNA} definitions
or defines them with another biotype than \emph{protein-coding}\footnote{%
E.g.\ \gene{XXyac-YRM2039.2} annotated as \textit{unprocessed pseudogene}
and now known as \gene{WASH1} since \ens{77}~(October 2014) or
\gene{TRAJ61} which is annotated as \gene{TR J} \textit{gene}.%
}.
Or, peptides and \mRNA\ reads may be assigned to different gene IDs.
Alternatively, the \mRNAs\ are present in the sample,
but the library preparation has missed their capture
(see \Cref{subsec:libPrep}).
Or even, the presence of proteins in the sample is a false positive
or the result of contamination.\mybr\
However, biological processes might also explain the mismatches.
One example is the case of \mRNAs\ with short half-lives
while their proteins are very stable.
Another possible explanation is that
the original location of the proteins is different
from the tissue in which they were detected
(like hormones or cytokines).\mybr\
Lastly, as the transcriptomic and proteomic samples are independently sourced,
a protein may be specific to an individual or a population.
This last hypothesis is the most unlikely
as there are several biological replicates on the transcriptomic side.
A mixture of the previous causes is also plausible.\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.85]{integration/overviewDatasets1.pdf}\centering
\vspace{-3mm}
\caption[Overview of different datasets combination]{%
\label{fig:setsOverview}\textbf{Overview of different studied datasets
combinations.}}
\end{figure}
I exclude the unmatched proteins and \mRNAs\ from further analyses.
\Cref{tab:protNoTrans} provides the unmatched protein lists
for the \ens{76} annotation.\mybr\
Unless otherwise stated, to avoid issues exposed in \Cref{subsec:mito},
I also remove all the proteins and \mRNAs\ of the mitochondrial genome
from the subsequent analyses.\mybr\
Note that \Cref{fig:setsOverview} presents
an overview of the various datasets combinations presented
in \Cref{fig:PGU_vennQ1,fig:PU_vennQ1,%fig:VennTissuePandeyGtexUhlen,
fig:PGU_venQ3,fig:PU_vennQ3}.\mybr\
\subsection{Tissue-centric and gene-centric approaches}
\begin{figure}[!htpb]
\includegraphics[scale=0.85]{integration/VisualExplaination-Lin.pdf}\centering
%\vspace{-3mm}
\caption[Summary of the expression comparison approaches between
the transcriptome and proteome]{\label{fig:visualexp}\textbf{Approaches
summary of the expression comparison between the transcriptome and proteome.}
\emph{Tissue-centric} analyses focus on
how the transcriptome and proteome relate to each other within the same tissue.
\emph{Gene-centric} analyses study for each gene how its \mRNA\ expression
levels across all (or a subset of) the tissues may relate to
the quantified expression levels of its corresponding protein.
}
\end{figure}
\Cref{fig:visualexp} summarises the two analytical approaches I use
to compare transcriptomic and proteomic data.
The \emph{tissue-centric} approach compares for each tissue
the global expression of its transcriptomic landscape to its proteomic one.
In contrast,
the \emph{gene-centric} approach compares for each gene
its expression levels in \mRNA\ and protein across all the tissues.
\afterpage{\clearpage\sectionmark{Fair correlations between independent proteomics and transcriptomics}}
Confusion can arise
when integrating proteomics and transcriptomics.
Hence, it is essential
to define the taken approach clearly~\mycite{Liu2016-re}.\mybr\
\section{Fair correlations between independently sourced proteomics~%
and~transcriptomics~of~human~tissues~}\label{subsec:IntegrationGoodCorrProtTrans}
\sectionmark{Fair correlations between independent proteomics and transcriptomics}
\vspace{-2mm}
For the first tissue-centric analysis,
I assess for each tissue the relationship between
the expression of its proteome and transcriptome
through the correlation of the protein expression values
with their corresponding \mRNA\ ones.\mybr\
After scaling with $\log_2(x+1)$,
I compare proteomic and transcriptomic \treps\
from identical and random tissue pairs,
which are similar and roughly correspond to Gaussian distributions
as illustrated by \Cref{fig:distribTrans,fig:pandeyDistribQ1Q2}.\mybr\
\Cref{fig:TestSig} presents the correlation distribution range
of transcriptomic and proteomic \treps\ from identical and random pairs of tissues,
both with Spearman and Pearson correlation methods
(see \Cref{sec:CorrMore}).\mybr\
Although transcriptomics and proteomics have independent sources,
the Spearman correlations of the same tissues \treps\ are equivalent to
correlations in cell studies~\mycite{Lundberg2010-gk,schwanhausserglobal:2011}
where the same sample provides \mRNAs\ and proteins.
Regardless of the protein quantification method
(Top3~\mycite{Silva-Top3} or \PPKM{} ---~\vref{eq:PPKM}),
the median Spearman correlation coefficients are above $0.5$
for matched proteomic and transcriptomic \treps\
(also referred to as \emph{same-tissue pairs}).
The unscaled data presents identical outcomes
(see \Cref{tab:pvalueCorrSP} and \Cref{fig:TestSigUnlog}).\mybr\
The Pearson correlation is closer to the literature
for our new \PPKM\ quantification
than for the Top3 quantification.
The \PPKM\ Pearson correlation averages
above $0.5$ $[$min:~$0.38$~(\Oesophagus)\;; max:~$0.61$~(\Liver)$]$
(and is within $[$min:~$0.45$~(\Oesophagus)\;; max:~$0.67$~(\Liver)$]$
for the untransformed data).\mybr\
\begin{figure}[!htbp]
\includegraphics[scale=0.8]{integration/DFtestlog2.png}\centering
\vspace{-4mm}
\caption[Distribution of Pearson and Spearman correlation coefficients
for same-tissue proteomic and transcriptomic pairs
versus random tissue pairs]{\label{fig:TestSig}\textbf{Distribution of
Pearson and Spearman correlation coefficients
for same-tissue proteomic and transcriptomic pairs versus random tissue
pairs ($\log_2$-scaled data).} Depending on the protein quantification method,
there are two types of distribution ranges for the Pearson correlations.
Top3 quantification method provides a lower correlation ($\text{mean} \approx 0.11$).
The \PPKM\ method (\Cref{sec:NewQuantProt}) produces higher correlations
($\text{mean} \approx 0.5$).
All the Spearman correlation ranges between same-tissue proteomic and
transcriptomic \treps\ are quite similar,
regardless of the method quantifying the proteins.
The median of Spearman correlation is $0.52$.
With the Top3 quantification (\ie\ pink countered boxes --- Top3 x HTSeq),
two outliers are noticeable, and they are common to the three comparisons,
Pandey x Uhlén (12 tissues and 15 tissues) and Pandey x GTEx (12 tissues):
the lowest Spearman correlation is \Oesophagus\ ($\rho=0.39$)
and the highest \liver\ ($\rho=0.62$).
Both for the Pearson and Spearman correlations,
even when the correlations are very low,
same-tissue pairs always have higher correlations than
different (random) tissues pairs
(all p-values computed with Welch t-test <0.05 --- see \Cref{tab:pvalueCorrSP}).
Thus, even the lowest same-tissue correlations are significant.
The green boxplots, comparing the two transcriptomic datasets,
are only represented for reference purposes.}
\end{figure}
\begin{figure}[!htbp]
\includegraphics[scale=0.65]{integration/Kidney_scattplot_Q3_T15pv.png}\centering
\caption[Scatterplot of protein (Pandey Lab data --- PPKM quantification)
and mRNA (Uhlén \etal) expression for Kidney]
{\label{fig:ScatKid}\textbf{Scatterplot of
protein (Pandey Lab --- PPKM quantification) and mRNAs (Uhlén \textit{et al.})
expression for Kidney.}
Each point of this scatterplot represents a gene;
it has the $\log_2$-transformed expression value
of the corresponding \uhlen\ \etal\ \mRNA\ (\FPKM) on the x-axis and
the $\log_2$-transformed expression value of
the \pandey\ Lab protein (\PPKM) on the y-axis.
Most of the \mRNA/protein pairs are distributed in an area
that can be fitted by a linear function with a positive slope,
which indicates a high correlation between \mRNAs\ and proteins expression
levels.
However, genes with lower expressed \mRNAs\ have
a less associated expression between their protein and \mRNA,
in particular, \mRNAs\ that are expressed
below $1$ \FPKM\ (\ie\ below $0$ on the x-axis).
On the other side, genes with the highest expressed \mRNAs\ may present
a saturation effect (\Cref{subsec:simpleProt})
in the quantification of the protein expression.
The highest expressed protein is \protein{\gls{HBB}}
(\ie\ Hemoglobin Subunit Beta), which is also found in
the five highest expressed proteins in all the other tissues.
Possibly, its presence is due to remaining erythrocytes in the samples.
On the outer parts of the scatterplot,
there are the respective distribution densities of the proteins and the \mRNAs.
Whilst the correlation calculation includes every pair of \mRNA\ and protein,
the plot excludes any pair with an unexpressed \mRNA\ or protein to optimise the visualisation.
\Cref{fig:scatplotAll} presents an overview of the other tissue scatterplots.
}
\end{figure}
As tissue proteomic samples can present high correlation
without being related in any manner
(see \Cref{ch:proteomics,fig:scat2DAdrenalPancreasKuster}),
a Welch t-test~\mycite{Welch1951-sj} allows
assessing the significance of the correlation for the same-tissue pairs
by comparison to random tissue pairs.
The one-sided \Welchttest\footnote{See \Cref{mini:ttest}}
allows rejecting the null hypothesis $H_0$
(the means of the correlation coefficients for same-tissues pairs
are identical or lower to random tissues pairs).
Irrespective of the protein quantification or computational methods,
all the same-tissue pairs correlations are significant
(p-value $<5.10^{-5}$, except for Pearson correlation with Top3 quantification
where p-value $<0.05$ --- see \Cref{tab:pvalueCorrSP}).\mybr\
The previous correlation distribution
may imply a modest relationship between
these independent proteomics and transcriptomics,
but the same-tissue pairs scatterplots (\eg\ \Cref{fig:ScatKid})
show tighter links than first suggested.
Besides, these scatterplots share a coarse profile
despite the wide correlation ranges.\mybr\
\Cref{fig:ScatKid} illustrates the comparison of expression for \kidney\
between transcriptomics (\uhlen\ \etal) on the x-axis
and proteomics (\pandey\ Lab --- \PPKM) on the y-axis.
\Kidney's correlation coefficients stand in the middle of the range
regardless of the considered studies,
protein quantification or correlation methods involved in the comparison.\mybr\
%To optimise the visualisation,
%I removed the pairs with a null member
%(either for the \mRNA\ or protein)
%while I keep them for the correlation calculation.
A linear function with a positive slope (not drawn) can fit the bulk of the points.
Indeed, the expression of most \mRNAs\ and proteins in a tissue
are highly associated
with the exception of the lowest ($<1$ \FPKM) and
a number of the highest measured \mRNAs{}.\mybr\
Besides the mismatching sampling sources,
other possible explanations for the observed divergences are
technical limitations (such as protein saturation effect, see \Cref{subsec:simpleProt}),
translational noise (see \Cref{subsubsec:exprTrans})
or a consequent half-life difference between the \mRNA\ and its protein.\mybr\
Although the number of genes
presenting lowly associated levels of \mRNA/protein expression is rather limited,
it is enough to impair the Pearson and Spearman correlation coefficients.\mybr\
Systematic exclusion of lowly associated pairs of \mRNAs\ and proteins
is impractical and arguable
as they are inconsistent from one tissue to another.\label{memo:dispersedGenes}
Case-by-case treatment will be necessary.\mybr\
Removing the lowly expressed \mRNAs\ ($<1$ \FPKM) only marginally changes
the correlation coefficients,
\eg\ for \kidney,
when considering the \PPKM\ quantification for the proteins,
the Pearson correlation
increases from $0.56$ to $0.58$,
while the Spearman correlation is relatively unchanged
($0.51$ instead of $0.52$).
There are similar changes observed
when considering the more conservative Top3 protein quantification.
The Pearson correlation $r=0.18$ increases to $0.21$.
The Spearman correlation remains unchanged ($\rho=0.52$).\mybr\
Both transcriptomic studies (\uhlen\ \etal\ or \gtex)
providing alike results,
I describe for most of the following analyses the data combination
that provides the greatest number of tissues and genes to study,
\ie\ the fifteen-tissue set between \uhlen\ \etal\ and \pandey\ Lab data
quantified with the \PPKM\ method.\mybr\
\vspace{-1mm}
The other combinations
(provided in \Cref{ch:SupplIntegration} or electronic format)
may diverge for individual genes through the various combinations,
but the general trends are identical.\mybr\
\vspace{-1mm}
I focus on Pearson correlation over Spearman correlation
in the following parts
since the results for the \PPKM\ quantification are globally similar for both.\mybr\
\subsection{Mixed biological signal between the proteome and transcriptome
across the tissues}
%\vspace{-8mm}
\begin{figure}[!hb]
\includegraphics[scale=0.85]{integration/orderedHeatmapQ3Pearson1.pdf}\centering
\vspace{-2mm}
\caption[Heatmap based on the Pearson correlation between protein and mRNAs
expression (alphabetically ordered tissue)]{\label{fig:orderedHeatmapPearson}%
\textbf{Heatmap based on the Pearson correlation between protein and mRNAs
expression (alphabetically ordered tissue).}
Correlations for same tissue pairs (diagonal) are highlighted in
yellow when the highest observed correlations are between the matching proteomics
and transcriptomics pairs; in dark pink, when the proteomics correlates
the best with the matching transcriptomics.
When other higher correlations are observed for a tissue proteomics or transcriptomics
they are in given grey.}
\end{figure}
As shown in \Cref{fig:orderedHeatmapPearson},
for nine tissues (in yellow) transcriptomic and proteomic expressions correlate
better in matching tissues.
For four other tissues (\hColon, \Lung, \Oesophagus\ and \Urinarybladder\
--- in dark pink),
only the proteomics correlate the best with the matching transcriptomics,
while the transcriptomics correlates better with other proteomics tissues.
The remaining two tissues have their proteomics correlating
as much (\eg\ \Gall) to other tissues or more to transcriptomics
from other tissues (\Rectum).\mybr\
While the different correlation methods lead to similar result trends,
individual differences persist.
In a few cases, \eg\ \heart,
these slight differences may considerably change
the relative correlation ranking order of the \treps{} (see \Cref{fig:compCorJind}).\mybr\
In the following sections,
I explore several avenues to identify possible factors
that influence the association strength
between the proteome and transcriptome.\mybr\
I first study the effect of tissue composition (in proteins and \mRNAs)
on the correlations.
I begin with the assessment of the impact of the proteins and \mRNAs\
that are found in one tissue only,
before looking into the tissue-specific (\gls{TS}) proteins and \mRNAs{}.\mybr\
Then, in a more quantitative approach,
I examine more closely how the \mRNA\ expression profiles relate
to their respective protein ones.\mybr\
\subsection{Influence of the expression breadth on the tissue %
\texorpdfstring{\MakeLowercase{m}RNAs/proteins}{mRNAs/proteins} correlation}
In \Cref{ch:proteomics},
I have shown that
the protein expression of both different tissues and same-tissue pairs
are sharing a similar correlation range
(see \Cref{fig:scat2DAdrenalPandeyPancreasKuster}).
In this context,
genes expressed in a small number of tissues
(both as a protein and \mRNA)
can have a significant impact on the correlation
and may explain the mitigated results.\mybr\
The expression breadth of a gene is
the number of tissues and cell lines within which the gene is expressed
\cpv{at a given threshold\footnote{If a gene is expressed
below the considered threshold in all the tissues, its expression breadth is null.}}.
\Cref{fig:expressionBreadth} allows visualising
the distribution of the expression breadth of the \mRNAs\ (\uhlen\ \etal)
and the proteins (\pandey\ Lab data) across their fifteen common tissues.
In the following sections,
I may refer to a (\gls{TS}) gene as a \emph{unique gene}
when it is only expressed in a single tissue.\mybr\
\Cref{fig:protBreadth} shows that
the distribution of the protein expression breadth is bimodal.
Either due to technical limitations or biological reasons,
proteins detected in a sole tissue form
the most numerous class and represent 20 \% of the overall number.
Proteins expressed in all tissues are the second most numerous class (about $~$ 16 \%);
the third largest class (12 \%) comprises the proteins expressed in two tissues.\mybr\
On the other hand,
almost all \mRNAs\ are expressed in every tissue (\Cref{fig:mRNAbreadth0}).
One hypothesis is that \mRNAs\ levels have to exceed a sufficient threshold
for their proteins to be detected.
Thus, I also studied the effect of
two additional minimum expression thresholds for the \mRNAs\
on the expression breadth.\mybr\
The two new expression breadth profiles are more alike
to the proteomic one.
As shown in \Cref{fig:mRNAbreadth1},
the number of transcripts only found in one tissue increases
at the widespread $1$ \FPKM\ threshold,
which roughly equates to one \gls{RNA} in the cell~\mycite{Mortazavi2008,Hebenstreit:2011}.\mybr\
The expression breadth profile of the \mRNAs\ expressed at or above $5$ \FPKM\
present a similar bimodal distribution (\Cref{fig:mRNAbreadth5}) to the protein one.
While arbitrary, $5$ \FPKM\ is a threshold commonly found
in the literature~\mycite{Uhlen2015,oneDominant,Chen2018-ln}.\mybr\
\begin{figure}[!htb]
\begin{subfigure}[h]{0.53\textwidth}
\captionsetup{margin=0.6cm,justification=centering}
\centering \includegraphics[width=\textwidth]{integration/breadthProtQ3--15.pdf}
\caption{Protein~expression~breadth (\PPKM~quantification)}\label{fig:protBreadth}
\end{subfigure}
\begin{subfigure}[h]{0.53\textwidth}
\captionsetup{margin=0.6cm,justification=centering}
\centering \includegraphics[width=\textwidth]{integration/breadthmRNAQ3--15.pdf}
\caption{mRNA~expression~breadth\\(> 0 \FPKM)}\label{fig:mRNAbreadth0}
\end{subfigure}
\vspace{2.5mm}
\begin{subfigure}[b]{0.53\textwidth}
\captionsetup{margin=0.6cm,justification=centering}
\centering \includegraphics[width=\textwidth]{integration/breadthmRNAQ3--1501.pdf}
\caption{mRNA~expression~breadth\\(≥1 \FPKM)}\label{fig:mRNAbreadth1}
\end{subfigure}
\begin{subfigure}[b]{0.53\textwidth}
\captionsetup{margin=0.6cm,justification=centering}
\centering \includegraphics[width=\textwidth]{integration/breadthmRNAQ3--1505.pdf}
\caption{mRNA~expression~breadth\\(≥5 \FPKM)}\label{fig:mRNAbreadth5}
\end{subfigure}
\vspace{-6mm}
\caption[Expression breadth of the proteins and mRNAs]{\label{fig:expressionBreadth}%
\textbf{Expression breadth of the proteins and mRNAs.}
The expression breadth of the proteins has a bimodal distribution.
Many proteins are detected either in a single tissue or in all of them.
Almost every \mRNA\ is detected in every tissue.
Their breadth becomes bimodal when their expression threshold
is increased to $5$ \FPKM{}.
To ease the general visualisation,
I have omitted to plot the \mRNAs\
for which the expression remains below the threshold for all tissues
(\ie\ expression breadth=0 for the considered threshold).
}
\end{figure}
\Cref{fig:UniqueFeatureQ3T15} displays the fraction of unique genes
(\ie\ only expressed in a single tissue)
detected as a protein or an \mRNA\ at a considered threshold for each tissue as
the analysis is seeking a possible link between
the number of uniquely detected genes
and the correlation strength between the proteomic and transcriptomic \treps{}.
Hence, these fractions are computed by dividing
the number of unique genes (proteins or \mRNAs) of each tissue
by the total amount of uniquely detected genes across all tissues.
The tissues are ordered by increasing order of their fraction.\mybr\
\begin{figure}[!htb]
\includegraphics[scale=0.78]{integration/uniqueFeatureQ3T15a.pdf}\centering
\vspace{-5mm}
\caption[Unique proteins or mRNAs fractions across tissues]{\label{fig:UniqueFeatureQ3T15}
\textbf{Unique proteins or mRNAs fractions across tissues.}
}
\end{figure}
Although their proportion varies from one tissue to another,
all fifteen tissues have proteins
that are specifically detected in each tissue solely,
as shown in the top plot in \Cref{fig:UniqueFeatureQ3T15}.
In contrast, unique \mRNAs\ are detected in a more limited number of tissues
(see the three bottom plots of \Cref{fig:UniqueFeatureQ3T15}).
Besides, the unique proteins are more evenly distributed
between the fifteen tissues than the unique \mRNAs.\mybr\
Except for \Testis\ and \Liver,
which are consistently expressing the highest number of uniquely detected genes,
the other tissues fail to present any similarity
between the available proteomic and transcriptomic data.\mybr\
\Liver\ is the most correlated tissue (\Cref{fig:orderedHeatmapPearson})
and comprises the second-highest number of unique genes.
\Testis\ is the third-best correlated tissue
despite having the highest fractions of unique proteins and \mRNAs\
regardless of any threshold.
It may be tempting to hypothesise that the number of unique genes
relate to correlation levels,
but the other tissues fail to show any relationship
between the number of unique \mRNAs\ and proteins they expressed
and the strength of the correlations.\mybr\
Put together, these results suggest that
the number of proteins and \mRNAs{} uniquely expressed in these tissues
play a minor role at best in the \mRNA/protein correlation computed for each tissue.
The lack of relation between the proteomic and transcriptomic observations
is confirmed by a more refined analysis of the expression breadth.\mybr\
\begin{figure}[!htb]
\includegraphics[scale=0.75]{integration/coloredSharedbreadthProtQ3--15.pdf}\centering
\vspace{-4mm}
\caption[Comparison of proteins expression breadth to
corresponding mRNA breadth]{\label{fig:SharedBreadthProtQ3}%
\textbf{Comparison of proteins expression breadth to their corresponding mRNA.}
The proteins' expression breadth (\Cref{fig:protBreadth})
is coloured according to
their corresponding \mRNA\ expression breadth at $5$ \FPKM\
(\Cref{fig:mRNAbreadth5}).
About one-fifth of the uniquely detected proteins have
their corresponding \mRNA\ identically expressed once at or above 5 \FPKM{}.
The number of proteins classified as \emph{Identical} decreases significantly
through other breadths until it raises again from thirteen tissues
to reach about one-third of the ubiquitous proteins.
Proteins and \mRNAs\ with mismatching expression breadth are split into
several categories.
Proteins and \mRNAs\ that are both detected within four to twelve tissues
are described as \emph{Mixed}.
If the expression breadths of the remaining pairs are close (± $2$),
they are identified as \emph{Similar} otherwise as \emph{Different}.
Finally, many genes detected at least once as a protein have
an \mRNA\ expression that never reaches $5$ \FPKM{} (\emph{Expression < $5$ \FPKM}).
}
\end{figure}
\Cref{fig:SharedBreadthProtQ3} shows that the expression breadth
of \mRNAs\ (expressed ≥ $5$ \FPKM\ or even smaller threshold) concurs
in very few cases to their corresponding protein breadth.
Thus, the \mRNA\ expression breadth is insufficient
to predict the expression breadth of the corresponding protein.
Even for the two extreme cases
where the protein is unique to a tissue or ubiquitous (found in all fifteen tissues),
there are differences between the expression breadths of the \mRNA\ and the protein
of the same gene.\mybr\
All the expression breadth analyses of the transcriptome rely on expression levels.
However, \Cref{ch:Transcriptomics} underlines that
high \mRNA\ expression levels are unrelated
to high interstudy correlation of same-tissue pairs
while \gls{TS} \mRNAs\ present a rather strong relation with it.
For this reason, the following analysis examines
the relationship between \gls{TS} \mRNAs\ and \gls{TS} proteins.\mybr\
\subsection{Tissue-specific \texorpdfstring{\MakeLowercase{m}RNAs}{mRNAs} %
have significant overlap with tissue-specific proteins}\label{sec:TSprotMrna}
Unlike \mRNAs,
many proteins are only expressed in one unique tissue.
These are the ones I refer to as \gls{TS} proteins in the remainder of this thesis.\mybr\
To enable the comparison of these \gls{TS} proteins with possible transcript partners,
I first need to define a set of \gls{TS} \mRNAs{}.
To find the latter, I choose the $n$ \mRNAs\ most specific to a tissue
based on the \nameref{subsub:TisSpeGeneMethodPerso} (\Cref{subsub:TisSpeGeneMethodPerso})
where $n$ is the number of \gls{TS} proteins of that tissue.
Then, as detailed in \Cref{fig:RankSpe},
I examine for each tissue the overlap between its $n$ \gls{TS} proteins
with its $n$ \mRNAs\ with the highest tissue-specific ranks.
\Cref{fig:ExJacquard} illustrates the \heart\ example.\mybr\
\begin{figure}[!htb]
\includegraphics[scale=0.59]{integration/TissueSpeDeter1b.pdf}\centering
\vspace{-3mm}
\caption[Determination process of the specific mRNAs]{%
\label{fig:RankSpe}\textbf{Overview of the comparison of the TS proteins
and TS mRNAs.}
\gls{TS} proteins are the $n$ proteins only expressed in one tissue.
Once the \mRNAs\ have been sorted
by decreasing order of their relative specificity to a given tissue,
the first $n$ \mRNAs\ identities are compared
to the ones of the $n$ \gls{TS} proteins present in the same tissue.
%Jaccard's similarity coefficients and their significance (p-values)
%are computed to allow
%a global assessment of the proteome and transcriptome relationship
%across all the tissues simultaneously.
}
\end{figure}
\begin{figure}[!htbp]
\includegraphics[scale=0.63]{integration/overlapRatioPUQ15Q3Heart.pdf}\centering
\vspace{-3mm}
\caption[Example of overlap of TS proteins and TS mRNAs for Heart]{%
\label{fig:ExJacquard}\textbf{Example of overlap of \gls{TS} proteins
and \gls{TS} \mRNAs.}}
\end{figure}
Each tissue has a different number of \gls{TS} proteins.
I thus refine this analysis
by computing Jaccard similarity coefficients
(or Jaccard indices)~\mycite{Jaccard1901-ei,Lin2008-fc},
see \Cref{eq:Jaccard}.
The Jaccard indices allow assessing
the relationship between \gls{TS} proteins and \mRNAs\
across all the tissues at the same time
and ease the result interpretation in contrast to the raw overlap numbers.\mybr\
\begin{minipage}{\textwidth}
The Jaccard index is computed as follow:
\begin{equation}
\tag{Jaccard similarity coefficient}\label{eq:Jaccard}
\begin{split}
J(x_{1},x_{2}) & = \frac{\left | x_{1}\cap x_{2}\right |}%
{\left | x_{1}\cup x_{2}\right |}\\
& = \frac{\left | x_{1}\cap x_{2}\right |}%
{\left | x_{1} \right | + \left | x_{2} \right |%
- \left | x_{1}\cap x_{2}\right |}\\
\end{split}
\raisetag{6cm}
\end{equation}
When applied specifically to \Cref{fig:RankSpe}, we get:
$J(\tikzcircle[Tan,fill={rgb,255:red,195; green,160;blue,153}]{4pt},%
\tikzcircle[DarkOrchid,fill={rgb,255:red,215; green,145; blue,254}]{4pt}) = \frac{k}{2n-k}$,\\
with $n$ the number of proteins (\tikzcircle[Tan,fill={rgb,255:red,195; green,160;blue,153}]{4pt})
that are only found in a given tissue and
$k$ is the number of common genes between
these $n$ unique proteins
%(\tikzcircle[Tan,fill={rgb,255:red,195; green,160;blue,153}]{4pt})
and the $n$ most specific \mRNAs\ of the tissue
(\tikzcircle[DarkOrchid,fill={rgb,255:red,215; green,145; blue,254}]{4pt}).
\end{minipage}
\vspace{3mm}
To measure the Jaccard indices significance,
I use the hypergeometric test~\mycite{Field2012-au}
(see \Cref{sec:hypergeometricTest}).
In the current analysis,
I consider as \enquote{success} when a \gls{TS} \mRNA\ is among the $n$ \gls{TS} proteins
and test if the number of observed successes is greater than
the expected number for random sampling.\mybr\
\vspace{3mm}
The Jaccard indices for all pairs of the fifteen shared tissues
between the \pandey\ Lab (\PPKM\ quantification) and \uhlen\ \etal\
are summarised in \Cref{fig:JaccardIndexes},
while~\Cref{fig:JaccardPvalues} displays
their respective p-values (hypergeometric test).\mybr\
\begin{figure}[!ht]
\includegraphics[scale=1]{integration/overlapRatioPUQ15Q3.pdf}\centering
%\vspace{-3mm}
\caption[Heatmap of Jaccard indices across 15 tissues]{%
\label{fig:JaccardIndexes}\label{fig:RatioJac}\textbf{Heatmap of Jaccard indices
across the common fifteen tissues between Uhlén \textit{et al.} and Pandey Lab data.}
For each tissue, the \gls{TS} proteins are the proteins
(quantified with \PPKM\ method) that are expressed only in that tissue.
The \gls{TS} \mRNAs\ are the \mRNAs\ with the highest specific coefficients