datasets.tex

\chapter{Available high-throughput human datasets}\label{ch:datasets}

\setlength{\epigraphwidth}{0.57\textwidth}%0.57
\setlength{\epigraphrule}{0pt}%0.1
\epigraphhead[5]{%
\epigraph{\emph{Data! Data! Data! I can’t make bricks without clay!}}{Sherlock Homes~\mycite{Doyle1892}}
}


In the past few years, many laboratories have studied the expression
of human genes at the transcriptome and at
the proteome levels by taking advantage of high-throughput techniques
(\eg\ \citet{Krupp2012,VTpaper,ramskoldan:2009,Uhlen2014,%
Uhlen2015,UhlenGastro,GTExTranscript,PeptideAtlas,PandeyData,KusterData}).
In this chapter, I review the openly available (for research) data
I use within my thesis to explore the gene expression in (undiseased) tissues
and explain how they have been reprocessed.
Besides the results I report in the subsequent chapters,
the present chapter also provides the basis for work
I have co-authored\footfullcite{Wright-2016}.\mybr\

Unless otherwise stated, all the computational processing of the \Rnaseq\ part
described here have been performed by myself under the supervision of
\alvis. I also received general feedback from \mar,
\johan\ and \nuno. The proteome data has been processed by \james.\mybr\


\section{Introduction}

All the datasets were selected to fit three main criteria.
Firstly, they comprise normal (\ie\ reported as disease-free) human samples
from at least three different tissue types.
Secondly, gene expression
quantifications are based on \Rnaseq\ for the transcriptome and on label-free \ms\
for the proteome\footnote{These
technologies are non-targeted high-throughput and
allow one, in theory, to study the whole
repertoire of \glspl{RNA} or proteins in a sample.}.
Finally, the \emph{raw} data is openly available and reusable\footnote{%
In this context, \emph{reusable} means that the data can be processed
as accurately by third-party researchers than the original authors,
and this without the need to access additional information
that have been not openly released}.\mybr\


In the next section, I first describe the \Rnaseq\ and the \ms\ data I use
in my thesis; then I detail how these data have been processed to be
employed in the next chapters for various analyses that explore the transcriptome,
the proteome and finally, the comparison and integration of these two
biological layers.\mybr\


\section{Transcriptome RNA-Seq studies}\label{sec:rnaseq-data}

I describe hereinafter the five transcriptomic datasets I used
in the chronological order of their first public release.
\Cref{tab:Trans5DF} summarises the main characteristics of these datasets.\mybr\

\begin{sidewaystable}
           \centering
           \caption[General description of the 5 transcriptomic datasets
           (\Rnaseq) used for this study]{\label{tab:Trans5DF}\textbf{General
           description of the five transcriptomic datasets (\Rnaseq) used for this
           study}\\\footnotesize{Illumina Body Map (IBM) has no \enquote{regular}
           technical replicates as the \enquote{replicates} are the product of
           different protocols, thus are unfit to estimate the specific noise of
           either protocol (single-end or paired-end).
           \NB\ The protocols used for \Gtex\ and Castle datasets are not the same:
           \Gtex\ is following the most common ribodepletion protocol, while
           Castle is based on  a targeted amplification protocol.\\
           %\mynormalcheckmark\ indicates that the dataset presents the
           %characteristic, and \\(\mynormalcheckmark) that one (or more) of the
           %required criteria of the characteristic is lacking.\\
           }}
       \begin{tabular}{@{}cccccccccc@{}}
       \toprule
       \multicolumn{1}{c|}
           {\multirow{2}{*}{ArrayExpress ID}} &
            \multicolumn{1}{c|}{\multirow{2}{*}{Data ID}} &
            \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Library\\Preparation\end{tabular}} &
            \multicolumn{2}{c|}{Sequencing} &
            \multicolumn{2}{c|}{Replicates} &
            \multicolumn{1}{c|}{\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Number of\\Tissue\\
                    Types\end{tabular}}} &
            \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Multi-sampling\\ from the \\ same
            individual\end{tabular}} \\
            \cmidrule(lr){3-8}
            \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{} &
            \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Total\\ RNA\end{tabular}} &
            \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}PolyA\\ selected\end{tabular}} &
            \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Single\\ end\end{tabular}} &
            \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Paired\\ end\end{tabular}} &
            \multicolumn{1}{c|}{Biological} & \multicolumn{1}{c|}{Technical} &
            \multicolumn{1}{c|}{} &  \\
       \midrule
       E-MTAB-305 & Castle & \mycheckmark\ &  & \mycheckmark\ &  &  &  & 11 &  \\
       E-GEOD-30352 & Brawand &  & \mycheckmark\ & \mycheckmark\  &  &
           \mycheckmark\  &  & 8 &  \\
       E-MTAB-513 & IBM &  & \mycheckmark\ & \mycheckmark\ & \mycheckmark\ &  &
           (\mycheckmark) & 16 &  \\
       E-MTAB-2836 \footnotesize{(and E-MTAB-1733)}& Uhlén &  & \mycheckmark\ &
           & \mycheckmark\ & \mycheckmark\  & \mycheckmark\  & 32 &  \\
       E-MTAB-2919 & Gtex (v4)  &  & \mycheckmark\ &  & \mycheckmark\ &
           \mycheckmark\ &  & 54 & \mycheckmark\\
       \bottomrule
       \multicolumn{10}{p{.8\textwidth}}{\raggedright\
           \mynormalcheckmark\ indicates that the dataset presents the
           characteristic, and \\(\mynormalcheckmark) that one (or more) of the
           required criteria of the characteristic is lacking.}
      \end{tabular}
\end{sidewaystable}

\subsection{Castle et al.\ dataset}\label{subsec:castlepresentation}
\vspace*{-0.2in}
\citet{castleData} released this dataset along with their study:
\paper{\citetitle{castleData}}.
The authors were interested in exploring the whole RNA repertoire
with sequencing-based technology and they primarily focused their study
on the non-coding part.\mybr\

Purchased \gls{RNA} extracts were used to create multiple-donors pooled
samples for 11 tissues from which total \gls{RNA} libraries were prepared
following a total transcriptomic protocol
where nonribosomal \gls{RNA} transcripts are
amplified specifically by \gls{PCR}~\mycite{Armour:2009}.\mybr\

For each library (tissue), an average of 50 million sequence reads were sequenced
using an Illumina Genome Analyser-II sequencer (single-end).
The original reads were trimmed to 28 \gls{nt}
before being released through EMBL archives (\ENA{ERP000257}
and \ArrayExpress{E-MTAB-305}).\mybr\

Despite several limitations, such as the lack of replicates, the old technology and
the short reads,
I have included this dataset for two main reasons.
Firstly, it is the oldest available
\Rnaseq\ data I found that was performed on normal human tissues.
Thus, the congruence of results for this dataset with the following ones
gives a rough idea on the extent of \Rnaseq\ datasets
that may be integrated together.
Secondly, as \Rnaseq\ studies are prepared mainly with polyA-selected protocols
today, I was interested in gauging how the library preparation
protocols --- and the presence of \glspl{ncRNA} --- can affect the
quantifications and then any final observation.\mybr\


\subsection{Brawand et al.\ dataset}\label{subsec:brawandpresentation}
\vspace*{-0.2in}
In the article entitled~\paper{\citetitle{VTpaper}},
\citet{VTpaper} focused their interest on the
evolution of the mammalian transcriptomes\footnote{While there were existing
studies on the matter, the sequencing approach was then creating new perspectives.}.\mybr\

They collected 6 organs from 10 different vertebrates:
9 mammalians (including human) and a bird. There are no technical replicates,
but  two biological replicates per tissue:
one male and one female for every tissue except the testis (two males).
The 131 libraries (including 23 for \species{Homo sapiens}) were prepared with
a polyA-selected protocol.
Hence, they are largely enriched in \pcgs.\mybr\

An average of 3.2 billion 76 \gls{bp}-long single-end reads were generated per sample
using an Illumina Genome Analyser IIx (single-end) and they released them
through \gls{GEO} (accession number: GSE30352).
I personally retrieved the human data from
\ArrayExpress{E-GEOD-30352}\footnote{ArrayExpress was routinely importing
datasets from \gls{GEO} on a weekly basis until very recently.
While not automatically, \gls{GEO} data are still included in \egxa.}.\mybr\


\subsection{Illumina Body Map 2.0 (IBM)}\label{subsec:ibmpresentation}
\vspace*{-0.2in}
This dataset, created in 2010, has been released in
2011\footnote{See~\citetitle{ibmEnsembl} -~\cite{ibmEnsembl}} by Illumina
and it used its most recent technology at that time:
the paired-end sequencing. Until then, all the sequencing was done
from only one end of the \gls{DNA} or \gls{cDNA} fragments. From that
date, most of the following transcriptome studies based on \Rnaseq\ use
paired-end sequencing.\mybr\

The dataset covers 16 tissues (one donor per tissue) and
the libraries were prepared following a polyA-selected
and are enriched in protein coding genes.\mybr\

Although each sample has been sequenced twice and despite having in principle
\emph{technical} replicates, these are ``non-regular'' technical replicates.
\emph{Technical} replicates, by contrast to \emph{biological} replicates,
usually imply that their processing uses the same sample source and protocols.
Thus, the error and noise due to a specific technique could be determined.
Here, however, each tissue has been sequenced once with a single-end protocol
and once with a paired-end one to compare their ability to discriminate between
\gls{mRNA} isoforms.
Indeed, Illumina's main incentive to develop its paired-end
technology was to improve the accurate identification of spliced \glspl{mRNA}.\mybr\

The sequencing was performed with an Illumina HiSeq 2000, and the reads were
released through \ArrayExpress{E-MTAB-503} (\ENA{ERP000546}),
from where I have retrieved both the single-end and paired-end mono-tissue samples
(the original dataset includes raw data files for mixtures
that have been created with the tissue samples).\mybr\

Despite the lack of biological replicates,
it was for an extended time
the most extensive freely available \Rnaseq\ dataset of human tissues.
Hence, it has been referenced many times
(\eg\ \citet{Asmann2012-in,ibmrelatedpaper,Smith2012-jw,Derrien2012-ej,%
Florea2013-in,tophat2,Kechavarzi2014-oi,Zhao2014-xj,Pasquali2014-zj,Corpas2014-ej,%
Petryszak2014-kj,Brown2015-co,Janes2015-wn,%
De_Simone2016-fb,Kern2016-dv,Iwakiri2016-sr,Yao2017-gj,Akers2018-cf})
in the literature since its release.
In fact, this dataset is the most viewed one in \gls{ArrayExpress}
(with 68,020 views on 31 May 2018 ---
the second most viewed dataset (46,247 views)
being \ArrayExpress{E-MTAB-62}\mycite{Lukk2010-op}).\mybr\

\subsection{Uhlén et al.\ dataset}\label{subsec:uhlenPresentation}
\vspace*{-0.1in}
Uhlén \etal\ have created the
\hFo{Human Protein Atlas}{https://www.proteinatlas.org/}
(often referred as \glsxtrshort{HPA} in the literature).
This atlas revolves mostly around the spatial
distribution of the proteins through the human body.
Using diverse approaches and techniques, including \Rnaseq,
they first released \Rnaseq\ data for 27 normal tissues
as part of their study:~\paper{\citetitle{Uhlen2014}}%
~\mycite{Uhlen2014}.
Later, they extended the dataset with new samples and 5 new
tissues. The latest version was published within~\paper{\citetitle{Uhlen2015}}%
~\mycite{Uhlen2015} in \textit{Science}.\mybr\

For each of the 32 tissues, there are (at least two) biological replicates.
With a few exceptions, the tissues have both male and female donors.
Many of the tissues present also technical replicates.
The total set comprises 200 samples,
which have been picked by pathologists
based on the screening of frozen biopsy samples.\mybr\

The polyA-selected libraries were paired-end sequenced
with an Illumina HiSeq 2000 or 2500.
I first started to work with the early version of this dataset
(\ArrayExpress{E-MTAB-1733} --- 171 samples for 27 tissues),
and then I upgraded my work with the extended more recent version
(\ArrayExpress{E-MTAB-2836}).\mybr\

At the preparation time of this thesis, this normal human dataset is the
most comprehensive, freely and publicly available dataset:
either regarding the number of tissues (see \Cref{tab:Trans5DF})
or the number of samples (see \Cref{tab:Lib5DF}).
Therefore, its growing number of references is unsurprising.\mybr\

\subsection{GTEx dataset}\label{subsec:gtexPresentation}
\vspace*{-0.1in}
The Genotype-Tissue Expression (\gls{GTEx}) project is funded by the \gls{NIH}
Common Fund and aims to establish, in its authors' own words,
\textquote{a resource database and associated tissue bank for the study of the
relationship between genetic variation and gene expression
and other molecular phenotypes in multiple reference tissues}.
The project was first introduced in~\citet{GTEx2013}. It aims to quickly collect
various tissues from postmortem donors for genotype-tissue expression analyses
(notably \gls{eQTL} studies, which study the function of \glspl{SNP}
in the modulation of \gls{RNA} expression).
The results of the
analyses are released through the \hFo{\Gtex\ portal}{https://gtexportal.org}.\mybr\

As the project is quite ambitious and the collection
and sequencing of the samples spreads over a long period of time,
several intermediate data \enquote{\emph{freezes}} have been
released\footnote{Many groups are involved in collecting, producing or processing
the data. To ease the communication and work coordination,
many time points are used to reference each a specific state (version)
of the data. Each version of the data is called a \emph{freeze}.}.
My analyses include samples up to the fourth release of the pilot phase (v4).
This release covers 54 tissues/cell types (53 normal and 1 tumoral)
collected on from individuals for a total of 3,276 samples.\mybr\

The \Rnaseq\ libraries were prepared following a polyA-selected protocols
and have been paired-end sequenced on an Illumina HiSeq 2000/2500.
There is an average of 80 million reads per sample.\mybr\

For privacy reasons, the raw data is available only through controlled access via
\dbGaP{phs000424.v4.p1} (access number specific to the version of the data I used
in my study).
Unfortunately, this translates to a slow access time to the raw data.\mybr\

\vspace{5mm}
During the data selection process,
I had to disregard a few studies as the raw data was not fitting
the reusability criterion \mycite{Burge,Pan2008-di}.
Many times I came across studies with ambiguous encoding
format for the raw data such
as the \ArrayExpress{E-GEOD-41637} dataset~\mycite{Burge}.
Despite my best efforts,
I was unable to resolve this issue
by contacting the respective authors.
\citet{Burge} (\ArrayExpress{E-GEOD-41637}) is one example of study
that I unfortunately had to dismiss.\mybr\

\section[Proteome Mass Spectrometry bottom-up studies]%
{Proteome Mass Spectrometry bottom-up studies \quad}\label{sec:ProteoData}

As mentioned earlier,
the proteomic data have been selected and handled by \jyoti\ and \james.\mybr\

Until recently, compared to the transcriptome, the proteome world
was lacking in normal human tissues expression quantification
experiments. In fact, while there were human protein maps available
(\eg\ the \hFo{Human Protein Atlas}{https://www.proteinatlas.org}), these
are mostly reporting the spatial expression of proteins (as they are based
on immunohistochemistry or other means of identification) than quantifying
their (non-targeted) abundance in each tissue.\mybr\

In 2014, two independent groups of authors~\mycite{PandeyData,KusterData}
published (in \textit{Nature},
issue 7502) their own \textquote{\emph{draft of the human proteome}}
based on the study of tissues with \ms. These two datasets complement a previous
smaller one that was publicly released but was never the object of a publication.\mybr\

Hereinafter, I present these three datasets that I use in my thesis.
See \Crefp{fig:pipelineProt}{A} for a short summary.\mybr\

\subsection{Pandey Lab dataset}

The Pandey Lab~\mycite{PandeyData} created the
\hFo{Human Proteome Map}{http://www.humanproteomemap.org} which
they released alongside~\paper{\citetitle{PandeyData}} in \emph{Nature}.\mybr\

For their study, they processed 30 kinds of histological normal human
tissues and cell line samples (17 adult tissues, 7 foetal tissues and 6
haematopoietic cell types).
Each sample was created from pooling samples of three
individuals (generally two males and one female).\mybr\

Their proteomic libraries were prepared with a label-free method to quantify
as many proteins as possible.
The samples were fractionated to protein level through \gls{SDS-PAGE},
and then at peptide level after trypsin digestion by \gls{RPLC}
to create 85 experimental samples.
Finally, state-of-art \gls{MS/MS} protocols
(with high-resolution and high accuracy
\glspl{FTMS} Thermo Scientific \orbi\ instruments)
was used to generate about 25 million of (\gls{HCD})
high-resolution mass spectra which account for 2,212 \gls{LC-MS/MS} profiles.
The raw spectra were retrieved from ProteomeXchange via the repository
\Pride{PXD000561}.\mybr\

While the authors' effort to generate technical high quality raw data was highly
appraised by the scientific community, their processing
(identification and quantification) methods were criticised
(see~\citet{Ezkurdia2014-qx,Deutsch2015}).
Thus, for this thesis I have relied only on
quantifications provided by \james\ who reprocessed the raw spectra.\mybr\

\subsection{Kuster Lab dataset}

In their approach of the human proteome map, the Kuster Lab~\mycite{KusterData}
combined newly generated \gls{LC-MS/MS} spectrum data (about 40\% of their
complete working set) with already publicly available data
(either from their colleagues or accessible through repositories
--- for the remaining 60\%).
They reprocessed
the whole collection of spectra to maximise proteome coverage
and make it available through their own repository:
\hFo{ProteomicsDB}{https://www.proteomicsdb.org/}.\mybr\

The subset of data considered in my thesis is also
known as the [protein] Human BodyMap which is the part that the Kuster Lab
primary generated for their own study. They collected 48 experiments covering 36
tissues (adult and foetal) and cell lines. After \gls{LDS-PAGE} fractionation and
digestion into peptides with trypsin, they processed the samples with
\gls{LC-MS/MS} to create 1,087 profiles.  Overall, that represents about
14 million of \gls{HCD}/\gls{CID} spectra from Thermo Scientific instruments
(including an \orbi).
This specific raw data subpart was downloaded from \Proteomicsdb{PRDB000042}.\mybr\

\subsection{Cutler Lab dataset}\label{subsec:cutler}

This dataset was generated prior to the \pandey\ Lab and the \kuster\ Lab
data as it was released in 2011 through
\hFo{PeptideAtlas}{http://www.peptideatlas.org/}~\mycite{PeptideAtlas},
IDs: [PAe001768 --- PAe001778].\mybr\

It was created by Paul Cutler at Roche Pharmaceuticals.
It comprises 10 different tissues (and one sample per tissue) that after trypsin
digestion, were analysed through Thermo Scientific \gls{LTQ}-\orbi.
In total, there are 1,618 \gls{CID} profiles which accounts for 13 million raw
\gls{CID} spectra from a \gls{LTQ}-\orbi\ instrument.\mybr\

While this dataset was never published on its own, it has been used in different
studies (\eg~\cite{KusterData}).
The raw files were accessed and downloaded from \Proteomicsdb{PRDB000012}.\mybr\


\section{Consistent processing pipelines}

The authors of these five transcriptomic and three proteomic studies have,
in most cases,
released the quantification of the expression values either directly
(\eg~\cite{Krupp2012}) or upon requests (\eg~\cite{PandeyData}).
Third-parties also distribute
quantification for these studies either retrieved from the original studies,
such as \hFoCi{BioGPS}{http://biogps.org/}{BioGPS1} or
\hFoCi{Harmonizome}{https://amp.pharm.mssm.edu/Harmonizome}{Harmonizome},
or after reprocessing the raw data as
the \WebFoCi{\egxa}{https://www.ebi.ac.uk/gxa/home}{EBIgxa} does.\mybr\

To primarily reduce for avoidable technical variability,
and despite readily available quantifications for most of the
datasets, I only used data reprocessed from raw files by myself or \nuno\
(\gtex\ dataset) for the transcriptomic data and by \james\
for the proteomic data as already mentioned.\mybr\

In fact, each study has been originally processed with different protocols,
\eg\ \gtex{}~\mycite{GTExTranscript} and \castle{}~\mycite{Krupp2012}.
While the \egxa\ reprocesses raw data through the same
methods and has quantification for most of the aforementioned
datasets (it is still lacking the \castle\ \etal\ one.),
these were still the products of different protocols when I started my work.
Indeed, the datasets were processed with different versions of
reference (Human genome build and annotation) and tools.\mybr\

Intuitively, we expect that different processing protocols produce
different results.
As I started to work with \Rnaseq\ data,
I noticed many potential analysis variables that impact
at various levels the resulting gene expression values.
Indeed, many of these have been
reported in the literature since then; in fact,
annotation versions~\mycite{annotationDiff},
contamination (from viruses or bacteria \gls{DNA})~\mycite{contaminationRNAseq},
quality controls (and subsequent reads filtering choices)~\mycite{qualityRNAseq},
mapping and quantifications pipelines~\mycite{Fonseca2014}
have considerable effects on the final quantification. Lastly, normalisation
methods also greatly impact the final expression figures~\mycite{Dillies2013,%
normSigCancerHelp}. For all these reasons,
I decided to  reprocess all transcriptomic datasets with the same exact protocol
as the first step of my study.
Recently, \citet{Danielsson2015-cn} compare results based on prepublished data and
reprocessed ones and conclude that using a single processing pipeline ensures better results.\mybr\

Likewise, there are various tools and many parameters for each processing step
needed to quantify the proteomic data
that may impact the final expression values~\mycite{Aebersold2011-av}.
For example, various search engines allow detecting
different sets of peptides~\mycite{Griss2016}.
\citet{Mackay2015-zl} has reviewed and analysed the impact
of many of these variables more specifically for label-free proteomics,
such as the effect of \gls{FDR}, protein inference tools or normalisation methods.
Therefore, the three datasets were reprocessed uniformly
from the raw spectra up to the normalisation of the protein expression values.\mybr\

\subsection{RNA-Seq raw data processing}

As presented in \Cref{sub:RNAseqworflow}, there are many steps from the raw data
files to the quantification matrices on which this thesis' analyses are based.
\Cref{fig:pipelineTrans} presents a general overview of the \Rnaseq\ processing
protocol I used.\mybr\

\begin{figure}
    \includegraphics[scale=0.75]{datasets/pipelineTrans1.pdf}\centering
    \caption[General steps for processing the transcriptomic
    data]{\label{fig:pipelineTrans}\textbf{General steps for processing the
    transcriptome.} The pipeline \irap\ integrates
    all the tools needed for the state-of-art
    processing of \Rnaseq\ data. The quality of the reads is checked and they
    are trimmed if needed. After removal of possible contaminant reads (such as
    \species{E. coli}), the reads
    are aligned with \toph. The gene expression is then quantified with two
    different approaches: based on the aggregation of isomers for each gene or
    simply based on the number of aligned fragment on the gene locus defined in
    the reference. \cuffl\ provides directly \FPKM\ values. \htseq\ provides
    raw counts which were normalised by an \irap\ function into \FPKM.}
\end{figure}

I downloaded and entirely processed four of the transcriptomic datasets
myself (\castle, \vt, \ibm\ and \uhlen\ data) and
\nuno\ retrieved and processed the \gtex\ dataset\footnote{As
the \Gtex\ data is involved in many projects within the \EBI\
and due to its huge amount of files (number and  size --- see \Cref{tab:Lib5DF}),
it was agreed that this would be processed centrally by one person and then
redistributed to all the other interested parties. \nuno\ had this
tremendous task.}.
In this thesis,
I present results computed on the quantification of these
five datasets which have been processed through the same identical pipeline.\mybr\

\begin{table}
\centering
\caption[Technical description of the 5 transcriptomic datasets]{%
\label{tab:Lib5DF}\textbf{Technical description of the five transcriptomic
    datasets}\\\footnotesize{I processed all the datasets except the one in
    \textit{\color{darkgray}italic}.\\
    For the Brawand dataset, I only included and processed the \species{Homo sapiens} part.}}
\begin{tabular}{@{}cccccc@{}}
\toprule
Dataset & \begin{tabular}[c]{@{}c@{}}Participant\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}Library\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}File\\number\end{tabular} &
\begin{tabular}[c]{@{}c@{}}Total size of\\the fastq\\raw files (GB) \end{tabular}&
    \begin{tabular}[c]{@{}c@{}}Mean number of\\biologic samples per\\tissues
    [min;max]\end{tabular} \\
\midrule
Castle        & 10    & 11    & 11    & 58    & 10 (mixture)    \\
Brawand       & 18    & 21    & 23    & 111   & 2.8  [2;3]      \\
{\small Illumina Body Map} & 16    & 36    & 48    & 1,004  & 1 \\
Uhlén         & 122   & 200   & 400   & 1,851  & 3.81    [2;11] \\
    \textit{\color{darkgray}\Gtex\ (v4)}  &
    \textit{\color{darkgray}551}  & \textit{\color{darkgray}3,276}  &
    \textit{\color{darkgray}6,552}  &
    \textit{\color{darkgray}$\sim$ 50,000}  & \textit{\color{darkgray}60.67 [4;214]}\\
\bottomrule
\end{tabular}
\end{table}

\subsubsection{Data retrieval and preparation}

I retrieved the human raw data of each dataset from \gls{ArrayExpress} and
\gls{ENA} through their identifier (see \crefp{sec:rnaseq-data}). After we
received our access approval, \nuno\ retrieved \Gtex\ data from
\gls{dbGaP}.\mybr\

While most of the raw files can be used as they are, an additional step is
needed for the \castle\ files. Indeed these files are using an older
\fastq\ format that is non-compliant to the most accurate and recent tools used
for this thesis.
As it is a simple matter of changing the quality score scale
(see \cref{sec:PhredScore}),
I converted these files to \gls{Phred}$+33$ \fastq\ files with a
\emph{\gls{Perl}} script (provided digitally as supplementary data).\mybr\

\subsubsection{Genome and annotation reference}

I collected and processed the datasets through an extended period of time.
Hence, for a subset of them,
I produced many intermediate sets of results based on the \hg{37.p12}
(and later \hg{37.p13}) human reference genome and the latest available
\gls{Ensembl} gene set annotation (73, 74 or 75) at that time.
In fact, the quality of each new annotation update is
generally greater than its predecessor\footnote{Although,
it is not unusual to have gene or transcript additions based on new studies
that are then removed (or fused to another) in a later version.
}.\mybr\

As the \gtex\ data was processed with \hg{38.p1} and \ens{76}, that led me
to reprocess all the other four \Rnaseq\ datasets for the sake of consistency and
to avoid more biases~\mycite{h38vsh37}. Thus, unless indicated
otherwise, the results presented in the current work are
based on the \hg{38.p1} human genome reference and the \ens{76} gene set annotation.\mybr\


\subsubsection{Data processing}\label{subsubsec:RnaseqDataProc}

In the early stages of my research, I was processing each of the different steps
sequentially and semi-manually with the help of custom made scripts. While the
\EBI\ \gls{cluster} greatly facilitated the handling of the numerous files,
the task remained quite tedious.
Additionally,
the scripts I wrote would need a fair amount of work to achieve general
reproducibility on other platforms.\mybr\

Fortunately, \nuno\ developed
an \enquote{integrated RNA-seq analysis Pipeline}:
\softFoCi{\irap}{https://nunofonseca.github.io/irap/}{Fonseca2014-si}.
This tool allows the automation of the typical
state-of-the-art and optimised workflow to study
\Rnaseq. It takes full advantages of the capacities provided by computer clusters.
Thus, I switched from my original set of scripts to \irap\
to improve my workflow without changing any step or parameter.\mybr\

Besides the usual input files (raw \Rnaseq\ files and genome/annotation
references), \irap\ needs a configuration file that precisely describes the
dataset (its design and technical features) and, if needed, specific parameters
to use.
To provide full reproducibility,
each version of \irap\ is shipped with
its own set of third-party version-defined tools and default parameters.
Thus, apart from remarkably speeding up the data processing,
\irap\ also ensures the protocol integrity
across the five transcriptomic datasets I use
in my thesis regardless of who runs the pipeline.\mybr\

Each of the transcriptomic datasets is the product of the same version of
the \irap\ pipeline (development version 0.6.3b) and set of parameters. As the
default parameters of \irap\ are tuned for human Illumina paired-end data,
I only have to define a few of them. Hence, the quality and contamination checks,
and the filtering and trimming of the reads are done following the default options
of \irap.\mybr\

\minisec{Quality assessment, trimming and filtering}
\irap\ uses internally
\softFo{FastX toolkit}{http://hannonlab.cshl.edu/fastx\_toolkit/} (0.0.13)
to perform the
assessment and the trimming. The usual uninformative and ambiguous reads
(see \Crefu{subsub:trim})
have been discarded as were any with an overall quality score below
a threshold of 10.\mybr\

The quality of the call decreases while the base calling progresses ---
see \Crefu{subsub:sequencing}.
On another note, some tools (mappers in particular) need
all the reads to be trimmed to the same length. \irap\ optimises the compromise
between the purity and the length of the reads to avoid more errors or
biases due to smaller reads~\mycite{Trimwisely} by trimming at most 15\% of the
original length while discarding more reads if necessary to maximise the length.\mybr\

Reads that could be assigned to a likely contamination source, here
\species{Escherichia coli} (as I work with \species{Homo sapiens}),
are also discarded. A non-splice aware mapper,
\softFo{Bowtie}{http://bowtie-bio.sourceforge.net/}~(1.1.1)~\mycite{Bowtie}
maps all the reads to the
contaminant genome and all the reads mapping perfectly and unambiguously are
discarded.\mybr\

\minisec{Mapping}
I mapped the reads to the genome (\hg{38.p1})
and the transcriptome (\ens{76} gene set annotation)
with \irap's (0.6.3b) proposed default splice-aware mapper
\hFo{\toph}{https://ccb.jhu.edu/software/tophat/index.shtml}
(2.0.12)~\mycite{tophat2}
with its set of default predefined arguments.
Indeed, \toph\ can handle reads from many organisms
by fine-tuning the parameters (\eg\ number of mismatches or indels to tolerate),
but the default parameters are adjusted for normal human.\mybr\

\minisec{Quantification and Normalisation}\label{minisec:quantNorm}
While \Rnaseq\ can be used to identify (and discover) \gls{RNA}
isoforms, I have focused my thesis on the gene level expression. Indeed, current
annotations and knowledge are still lacking in the reasons and
external conditions that impact the expression of a specific isoform over the
others. In addition, criticisms have been raised on the accuracy of distinction
between them~\mycite{tamaraRNA,Janes2015-wn,Dapas2017-ui}.\mybr\

However, normalising gene expression
presents more challenges than specific transcript expression.
For instance, the definition of the gene length may be different
from one laboratory to another.
In this thesis' framework,
when I have to use a gene length for a computation,
I use the identical gene length definition as found in
\irap\ and \egxa.
Thus, as shown on \Cref{fig:Genelength-collapsedExons} the gene
length is defined as the sum of the lengths of all its \emph{collapsed} exons.\mybr\

\begin{figure}
    \includegraphics[scale=0.8]{datasets/collapsedExonsLinLib.pdf}\centering
        \caption[Gene length is equal to the sum of the lengths of all its collapsed
        exons]{\label{fig:Genelength-collapsedExons}\textbf{Gene length is equal
        to the sum of the lengths of all its collapsed exons.} Though,
        this method lacks  complete accuracy, it provides a sufficient estimation
        of the gene length for an efficient normalisation regarding the length bias.
        The coordinates for the 5' and 3' ends of each exon is extracted from
        the annotation and they are collapsed together. This \emph{gene length}
        is unaffected by incorrect attribution of a fragment to a
        specific transcript when there are many possible options.}
    \end{figure}

As mentioned in \Crefu{subsub:norm}, I used two different popular tools based on
different strategies to estimate
gene expression levels:
\hFo{\cuffl}{https://cole-trapnell-lab.github.io/cufflinks/manual/}
(2.2.1)~\mycite{cufflinks}
and \hFo{\htseq}{https://htseq.readthedocs.io/}
(0.6.1p1)~\mycite{htseq} (with the \texttt{intersection non-empty} mode).
These tools are also integrated in \irap.\mybr\

For \cuffl, I used the mode where the multi-mapped reads are probabilistically
assigned depending on the coverage of each mapped locus. In addition,
\cuffl\ provides normalised gene expression levels by aggregating their
corresponding normalised isoform expression levels. \cuffl\ uses the
\cref{eq:rpkm-fx} to normalise isoform expression levels. The length of the
isoforms are extracted from the reference.\mybr\

On the other hand, \htseq\ provides only \emph{raw} counts for the feature of
interest. \irap\ provides an internal \FPKM\ normalisation function that
is an implementation of the \cref{eq:rpkm-fx}. As I requested \htseq\ to work at
gene level, this formula requires gene lengths which are computed with the
aforementioned method.\mybr\

All the configuration files I created for this thesis may be found at
\addressToirapConfFiles.\mybr\

As the paired-end set of the \ibm\ data was presenting an overall better
quality than its single-end counterpart,
I only include \ibm{}'s paired-end data for the remaining of the thesis.\mybr\

\subsection{MS data processing}\label{subsec:msDataProcess}

After retrieval of the data from \gls{Pride} and \gls{Proteomicsdb},
\james\ reprocessed the three proteome \ms-based datasets.
\Cref{fig:pipelineProt} illustrates the pipeline that processed the three
datasets in a consistent and optimal manner. I summarise this protocol
in \Crefp{fig:pipelineProt}{}
and in the following \cref{subsub:spectralProcessing,subsub:seqDBseeking,%
subsub:spectralIDDBsearch,subsub:resultsFiltering}.
See~\citet{Wright-2016,Weisser2016-pipeline} for more details.\mybr\

  \begin{sidewaysfigure}
      \includegraphics[scale=0.75]{datasets/pipelineMS-2.pdf}\centering
      \caption[General steps for processing the proteome
      data]{\label{fig:pipelineProt}\textbf{General steps for processing the
      proteome.} [Adaptation of courtesy materials from \james].\\
      (\textbf{A}) The three datasets have been processed through the
      same pipeline. In this thesis, I only use the samples from adult tissues.
      (\textbf{B}) Extensive sources of protein sequences were
      used for the search database, including prediction of novel proteins.
      Contamination and decoy sequences were also included to allow for \gls{FDR}
      estimation. (\textbf{C}) State of the art workflow was used to process the
      \ms\ data from raw files.
      This workflow combines multiple \ms\ search engines and
      post-search evaluation tools. Results were filtered by peptide length,
      \gls{FDR}, \gls{PEP} and agreement between the multiple search algorithms.
      Note that there is no relation between the real size of
      the database parts and their representation;
      the decoy sequences are as numerous as the sum of the known and possible candidates. }
  \end{sidewaysfigure}

\subsubsection{Spectral processing}\label{subsub:spectralProcessing}

The \soft{msconvert} module of \softFull{ProteoWizard}%
{http://proteowizard.sourceforge.net/}{ProteomeWizard}{v3.0.6485}
converted all the files to the standard format mzML.\@
\softCi{TOPP}{Topp} from
\softFull{OpenMS}{https://www.openms.de/}{openMS2016}{pre-v2.0 development build},
processed the raw spectra. Notably, \soft{PeakPickerHiRes} which centroids them
and \soft{FileMerger} that merges the ones from the same fractionated experiments.\mybr\

\subsubsection{Sequence database creation and searching preparation}%
\label{subsub:seqDBseeking}

The target sequence database is a critical element of the \ms\ pipeline and thus,
\james\ has carefully designed it.
It combines six different parts, three based on
known protein sequences and three other covering possible new protein candidates.\mybr\

The known sources include the complete human \hg{38} (v.20) \gls{CDS} translated
sequences from \gls{GENCODE}; the human reference proteome from
\gls{Uniprot}\footnote{%
\gls{Uniprot} --- \Href{http://www.uniprot.org/}}~\mycite{UniProt2017}
(in its May 2014 version); common contamination protein
sequences\footnote{Contamination sequences ---
\Href{http://maxquant.org/contaminants.zip}}
and \gls{HLA} sequences\footnote{\gls{HLA} sequences ---
\Href{https://www.ebi.ac.uk/ipd/imgt/hla/download.html}}.
\cpv{This \emph{known} portion of the target sequence database
represents 787,587 tryptic peptide sequences.}\mybr\

The sources for potential
novel proteins included a selection of non-coding gene sequences (including
pseudogenes, \gls{lncRNA} and untranslated regions) from \gls{GENCODE}
\hg{38} (v.20); prediction of novel sequences with
\softFoCi{\mbox{AUGUSTUS}}{http://bioinf.uni-greifswald.de/augustus/}{Stanke2004};
a set of two-consensus predictions (December 2013)
from \softFoCi{Pseudogene.org}{http://pseudogene.org/}{pseudogenesOrg}
and three-frame translated \Rnaseq\ transcript sequences.
These translated sequences include models
built on \ibm\ by \gls{Ensembl} and by the Kellis lab in addition with
models built on different \gls{ENCODE} cell lines by Caltech and CSHL\@.
\cpv{This \emph{novel} portion of the target sequence database
provides an addition of 4,211,835 tryptic peptide sequences.}\mybr\
\mybr\
%Precautions were taken with the handling of the overlapping regions
%and with the sequence three-frame translations.

\cpv{\softFo{Mimic}{https://github.com/percolator/mimic} generated
4,999,422 ($787,587 + 4,211,835$) randomised decoy sequences,
\ie\ the decoy database and the target database have an equal size of peptide sequences.
The different databases were then merged together.}
It is represented on \Cref{fig:pipelineProt}{\footnotesize B}.\mybr\

To account for the isobaric peptides, all isoleucine (I) residues were
converted to leucine (L) before the search and then after the search all leucine
(L) residues were converted to (J)\footnote{As J is one of the letter
from the Latin alphabet that do not map to any amino acid.}
to avoid later misconceptions.\cpvD\mybr\

%\NB\ For each peptide sequence, there is a unique identifier,
%an associated source database
%and, when available, the corresponding gene locus.

\subsubsection{Spectral identification and database search pipeline}%
\label{subsub:spectralIDDBsearch}

\Cref{fig:pipelineProt}{\footnotesize C} describes the overall workflow used by
\james\ to quantify the protein abundance in each tissue. As mentioned in
\Cref{seg:moreAlgoisbetter}, workflows involving several algorithms produce
better results. \soft{Mascot} Server~(v~2.4--- Matrix Science)~cluster produced
a first search on the mzML files submitted through \soft{MascotAdapterOnline}
(part of \soft{TOPP}). In parallel, \james\ also used  \soft{MS-GF~$+$~Search},
which involves the run of \soft{MS-GF~$+$}~(v. 10089)~\mycite{Kim2014-nj}.
\soft{MascotPercolator}~(v~2.08)~\mycite{Brosch2009-xq,Wright2012-od}
optimised and rescored the results from \soft{Mascot} and
\soft{msgf2pin/Percolator} (v~2.08\textminus1)~\mycite{Granholm2014-cw} optimised
the results from \soft{MS-GF~$+$}.
Finally, \soft{SEQUEST}~\mycite{Eng1994-eq} and
\soft{Percolator}~\mycite{Spivak2009-zd} performed a search in a
\soft{Proteome Discoverer}~(v 1.4 --- Thermo Scientific) workflow.\mybr\

The different workflows used common stringent parameters for all the database
searches:
the precursor tolerance was set to 10 \gls{ppm};
fragment tolerance for \gls{HCD} spectra to $0.02$ \gls{Da} and
to $0.5$ \gls{Da} for \gls{CID} spectra;
the allowed missed cleavages were limited to $3$.
\cpv{As described in~\citet{Wright-2016},
the research also accounted for several amino acid modifications
by including (known) mass tolerances.
The fixed modification carbamidomethyl ($+57.0214$ \gls{Da}) was specified
for all cysteine residues.
The searches also comprised the following variable modifications:
N-terminal acetylation ($+42.01056$ \gls{Da}),
N-terminal carbamidomethyl ($+57.0214$ \gls{Da}),
deamidation of asparagine and glutamine residues ($+0.984$ \gls{Da}),
oxidation of methionine residues ($+15.9949$ \gls{Da}),
and the possible N-terminal conversion to pyro-glutamine of
glutamine ($-17.0265$ \gls{Da}) and glutamic acid ($-18.0106$ \gls{Da})
residues.}\mybr\

The search results were converted into mzTab formatted files and uploaded along
with the mzML spectra and FASTA search database to \Pride{PXD002967}.\mybr\

\subsubsection{Results processing and filtering}\label{subsub:resultsFiltering}

Custom Perl scripts parsed, merged and filtered the results of each search engine
so that every \gls{PSM} had the same identification in at least two of the three
search engines.
In each case, the least confident \gls{PEP} (\ie\ the highest, see \Cref{PEP}) was retained.\mybr\

The \glspl{PSM} were then filtered to keep matches only to the three following
criteria:
\qval\ less than or equal to $0.01$ (\ie\ 1\% \gls{FDR});
% $\leq 0.01$
a \gls{PEP} inferior or equal to 0.05 and
a peptide length superior or equal to seven amino acids.
\glspl{PSM} matching contaminant or decoy sequences were also removed.\mybr\

The resulting list of peptides was then used to infer the proteins with
a simple approach.
Protein clusters were created based on the common matching non-null set of peptides,
\ie\ each protein cluster has at least one unique peptide.
Then, the \gls{GENCODE} \gls{CDS} and \gls{Uniprot} accession were mapped back to
\gls{Ensembl} identifiers.
Proteins with a gene (or gene clusters) definition
matching at least three unique peptides
were kept for the remaining of the analysis
while the others were discarded.\mybr\

The quantification of the retained proteins was computed for each experiment
with an approach close to the Top3 method~\mycite{Silva-Top3}.
The precursor intensities of the three most intense \emph{unique} peptides
per gene identifier (or for gene cluster) were summed,
before being divided by the total summed quantification of all proteins
in each sample to provide the \enquote{\emph{within sample abundance}}.
Then, these abundance values were normalised by the ten genes displaying
the  lowest  coefficient  of  variation  across  all  tissues.
When there was more than one experiment per tissue,
the final quantification values are
the median value across all the replicates of each tissue.\mybr\

Protein clusters matching several \gls{Ensembl} gene identifiers or
failing the \emph{unique peptide} rule are discarded
from the presented further analyses.
The list of the discarded clusters is different for each of the
proteome datasets.\mybr\


Compared to the original \pandey\ Lab study~\mycite{PandeyData},
fewer proteins were quantified, but
the results are congruent to other previous studies on the range of detection
and quantification of \gls{LC-MS/MS}.
The quantifications were released along with our paper~\mycite{Wright-2016}
and the reanalysis of the \pandey\ data was also released
through \egxa\ under the accession: E\textminus{}PROT\textminus{}1
and described in %\citet{NAR2015,NatComm2016}.
\citet{EBIgxa,Wright-2016}.\mybr\

\section{Discussion and Conclusion}

In this chapter, I introduced the five transcriptomic and three proteomic normal
human tissues datasets on which I based my thesis.
I described how both the transcriptomic and the proteomic datasets have been
reprocessed from raw files with state-of-the-art unified pipelines
which are also using the same genome build and annotation references in the
final processed version.\mybr\

As mentioned before,
I have produced a subset of the transcriptomic datasets
with the previous human reference genome~(\hg{37}) and
three different \gls{Ensembl} gene set annotations (73, 74 and 75).
I have run many of the analyses of \Cref{ch:expression,ch:Transcriptomics} on
these data.
While the results may vary for individual genes,
the overall outcomes are congruent hence
supporting the robustness of the findings presented in this thesis.
\cpv{In addition, all the products of the \Rnaseq\ pipeline
are in agreement with the original studies findings.
The \ms\ pipeline also produces similar results to the original studies
--- except for~\citet{PandeyData},
which original processing and results raised many criticisms~\mycite{Ezkurdia2014-qx}}.
\mybr\

While we are in the era of \emph{data deluge} and \emph{big data},
the number of tissue overlaps for independent normal human studies is surprisingly small
--- see \Crefp{fig:VennStudiesT}.
Most of these datasets have been (and will be) referenced through
many papers for comparison (or as control) purposes;
hence, it is essential to assess the soundness of these practices by assessing
the consistency between these datasets.\mybr\