-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPhD_Chapter1.tex
93 lines (35 loc) · 16.3 KB
/
PhD_Chapter1.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
\linespread{1.0}
\chapter{General Introduction}
%--------------------------------------------------------- Sec
\section{The Problem}
Biodiversity loss currently threatens the diversity of life on Earth. It is estimated by the United Nations Convention on Biological Diversity (CBD) in their Global Biodiversity Outlook report that, of the estimated eight million species known, over one million animal and plant species currently face risk of extinction in the next few decades due solely to increased anthropogenic activities \cite{cbd2020global}. This troubling revelation is made all the more real since the majority of species still await discovery and formal description.
Through traditional means of morphological identification, taxonomists have \\ painstakingly managed to categorize just over one million species in the last 250 years alone. DNA barcoding \cite{hebert2003biological, hebert2003barcoding}, proposed nearly 20 years ago in 2003 as a viable solution to the taxonomic impediment, has since revolutionized the way Linnean taxonomy is done. The premise of DNA barcoding is quite straightforward. The technique proposes to make accurate and rapid species diagnoses through leveraging easily obtained genetic variation seen in short molecular DNA gene regions collected from unknown specimens of interest. In animals, DNA barcoding specifically employs a \textit{c.}~ 650 bp fragment taken from the 5' end of the cytochrome \textit{c} oxidase subunit I (COI) gene, which is highly abundant and found in the mitochondria of all animal species. Within the discipline of biodiversity science, DNA barcodes have been employed to tease out potential cryptic species complexes. \\ Cryptic species comprise those taxa which are morphologically indistinguishable from all other such species. As a result, they are erroneously lumped under a single binomial name by taxonomic experts. Further, while DNA barcoding's primary goal is to facilitate the acceleration of specimen identification and species discovery, a number of uses and \\ applications outside of biodiversity science have been brought forth. In particular, \\ governmental regulatory bodies worldwide such as the Canadian Food Inspection Agency and Agriculture and Agri-Food Canada to name a few have harnessed the true power DNA barcoding has to offer in the combatting of systemic seafood fraud (\textit{e.g.}, Shehata \textit{et al.}~\cite{shehata2019survey}, Shehata \textit{et al.}~\cite{shehata2018dna}), as well as in the monitoring of the impacts and spread of invasive species on natural ecosystems (\textit{e.g.}, Madden \textit{et al.}~\cite{madden2019using}).
Robust estimation of adequate specimen sample sizes for DNA-based species \\ identification of animal taxa through DNA barcoding is central to timely biodiversity \\ conservation and management. However, this problem is fraught with myriad challenges including species rarity and project costs \cite{cameron2006will, stein2014is}. Further, because species show \\ remarkable genomic marker variation and rates of molecular evolution within and among taxa, along with differing evolutionary and life histories, knowing how many specimens of a given species likely need to be collected to observe the majority of existing genetic diversity present within animal species of interest to biodiversity researchers and regulatory scientists is a difficult question to answer. While practical sample sizes for DNA barcoding typically range from 5-10 specimens per species \cite{zhang2010estimating}, anywhere from a single individual to hundreds of specimens may be targeted depending on the study \cite{hajibabaei2007dna, matz2005likelihood, zhang2010estimating}. \\ Unfortunately, little work has been done to determine optimal sampling depths in a \\ statistically rigorous manner.
The majority of studies conducted to date on estimating sample sizes for DNA \\ barcoding have employed sophisticated parametric statistical models having strong \\ underlying assumptions. Unfortunately, the success of this approach is highly dependent on the taxa and molecular genetic loci being considered. This warrants the introduction of more general, user-friendly approaches applicable to wide-ranging taxonomic groups and molecular marker genes.
\section{Thesis Overview} \label{sec:intro1}
This thesis outlines a novel statistical framework for assessment of COI DNA barcode haplotype sampling completeness.
In Chapter 2, existing literature on sample size determination for DNA barcoding is first reviewed. Here, evidence points to a large knowledge gap in statistical and computational methods currently available for this task. Specifically, too much focus has been placed on inflexible parametric models rather than generalized flexible ones. Further this work finds that efforts have been improperly delegated to sampling as many species as possible, rather than maximizing the number of specimens collected. A case study on ray-finned fishes retrieved from the Barcode of Life Data Systems (BOLD) \cite{ratnasingham2007bold} clearly highlights this shortcoming, along with the need to develop approaches which incorporate more species-level information.
Chapter 3 builds on fundamental concepts of evolutionary biology and statistics \\ introduced and outlined in Chapter 2 through detailing a novel nonparametric stochastic (Monte Carlo) local search optimization algorithm in the R statistical programming \\ language to better address the need for improved sampling strategies for DNA barcoding initiatives. The method, called {\tt HACSim} (short for \textbf{H}aplotype \textbf{A}ccumulation \textbf{C}urve \\ \textbf{Sim}ulator), available as an R package for global use, employs easily obtainable genomic information from a sample of previously-assembled species-specific DNA sequence \\ alignments. The method is tested on a variety of hypothetical and real species mined from BOLD. Specifically, the method employs iteration and randomness to extrapolate species' haplotype accumulation curves toward an asymptote to assess where such curves may level off. The approach is found to work well for a number of relevant species, consistently suggesting that hundreds to thousands of specimens are actually needed to be randomly sampled across their geographic and ecologic ranges to be confident that much species-level genomic variation has been sufficiently captured.
Chapter 4 extends elements discussed in Chapter 3 through delving further into the useability and applicability of {\tt HACSim} via a detailed statistical simulation study to assess both the validity and overall performance of {\tt HACSim} and its utility for assessing \\ intraspecific sampling completeness within DNA barcoding studies for a variety of species taken from BOLD. {\tt HACSim} is demonstrated to possess good statistical properties, \\ including high consistency between successive algorithm runs for desired capture of \\ intraspecific haplotype variation.
Finally, in Chapter 5, it is argued that DNA barcoding is currently lacking in statistical rigor and that better statistical methods are necessary to more accurately assess standing genetic variation at the species level when it comes to estimating the DNA barcode gap. The use of {\tt HACSim} is suggested to address the problem of improper allocation of specimen sampling effort. Kernel density estimation plots, along with quadrant plots, are advocated for in place of traditional histograms to more easily detect outlier and problematic taxa that reflect potential failures of DNA barcoding. Hypothesis testing, in addition to \\ nonparametric bootstrapping are recommended to place DNA barcoding and barcode gap analyses on firmer statistical ground through estimation of confidence intervals of \\ intraspecific and interspecific genetic distances. All proposed approaches are illustrated through a case study focusing on Pacific fishes of Canada \cite{steinke2009dna}.
\section{Thesis Statement}
Through the development of a novel stochastic simulation algorithm for the generation of haplotype accumulation curves, the current research will provide a framework that can be employed to determine plausible specimen sample sizes sufficient to quantify levels of haplotypic sampling completeness within species. The proposed method detailed herein is assessed under both uniform and non-uniform haplotype frequency distributions for wide variety of animal taxa, particularly those of conservation and regulatory concern. Such a framework will be valuable in promoting a greater degree of statistical thoroughness in future DNA barcoding studies.
\section{Statement of Contributions}
All chapters presented in this thesis are original and were the sole effort of JDP, \\ including review of the primary literature, conceptualization of ideas, implementation of code, design of experiments and writing of individual manuscripts. All other coauthors either assisted directly with writing of code and/or running of experiments or participated in the editing of final manuscript versions.
The following articles are published, under review or in preparation
\begin{itemize}
\item Phillips, J.D., Gillis, D.J. and Hanner, R.H. (2019). Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. \textit{Ecology and Evolution}, \textbf{9}(5): 2996-3010.
Here, existing literature on methods to estimate sample sizes for DNA barcoding is reviewed. It is found that a significant knowledge gap exists in available \\ computational and statistical methods to accurately determine adequate levels of sampling depth for genetic diversity assessment at the species level. Determining the amount of collection effort needed to be confident that the majority of species’ haplotype diversity has been captured is not an easy task. Practical sample sizes range from 5-10 individuals per species but recent work has criticized these arbitrary values. Due to species rarity, often only 1-2 specimens can reasonably be collected. Knowledge from a variety of fields in biodiversity science, ecology and evolutionary biology needs to be integrated to address this question sufficiently. Findings highlight that efforts to date have been too focused on sampling as many species as possible, given factors such as project budget. Instead, specimen collection should be based on targeting an optimal number of specimens per species. Reliable estimation of specimen sample size is key for development of robust species-specific primers and probes necessary for accurate specimen identification for example. A case study on DNA barcoding of ray-finned fishes \cite{phillips2015exploration} is then used to illustrate the need for new \\ methods that incorporate more genomic information.
\vspace{1mm}
\item Phillips, J.D., French, S.H., Hanner, R.H. and Gillis, D.J. (2020). HACSim: An R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves. \textit{PeerJ Computer Science}, \textbf{6}(192): 1-37.
Here, a novel statistical method called {\tt HACSim} (Haplotype Accumulation Curve Simulator) to estimate specimen sample sizes for DNA barcoding based on saturation levels in species’ haplotype accumulation curves is presented. The method is a \\ nonparametric stochastic local search optimization algorithm that uses Monte Carlo sampling. {\tt HACSim} can be employed to estimate likely required sample sizes to capture, for example, 95\% of all existing haplotype variation that might exist for a species of interest. Unlike previously proposed approaches, which take into account little biological information on the species under study, in addition to imposing strong statistical parametric assumptions, the new method employs species-level \\ information that is easily retrievable from DNA sequence alignments, in particular, the distribution of species’ haplotypes. In addition to hypothetical taxa, {\tt HACSim}'s use is illustrated on DNA barcode sequence data mined from BOLD for a variety of animal taxa of medical, forensic/regulatory, conservation and socioeconomic \\ importance (fishes, insects, arachnids). Findings of this work were not surprising. {\tt HACSim} revealed little evidence of asymptotic behaviour in generated accumulation curves based on sampling between 171-349 individuals per species. According to the model, only 73.8-82.6\% of total genetic diversity has likely been uncovered for the species examined thus far. {\tt HACSim} predicts that much larger sample sizes (often hundreds to thousands of collected specimens) will be needed to reliably probe genetic diversity at the species level. This is evidenced from sample sizes ranging from 414-803 specimens per species being found by {\tt HACSim} for species examined. {\tt HACSim} is available as an R package for global use by the molecular biodiversity community-at-large.
\vspace{1mm}
\item Phillips, J.D., Bootsma, S.E., Hanner, R.H. and Gillis, D.J. (\textit{In preparation}). \\ Solving the genetic specimen sample size problem for DNA barcoding with a local search optimization algorithm.
Herein, an in-depth statistical simulation study is undertaken to assess the overall performance of {\tt HACSim} to reliably estimate sample sizes necessary for genetic diversity assessment within species. At present, {\tt HACSim} produces a single realized estimate of the ``true" specimen sampling sufficiency (referred to as $\theta$) for a species of interest; however, given the stochastic nature of the algorithm, carrying out multiple independent runs is necessary. Algorithm performance is tested on a wide range of species, both real and hypothetical, of broad interest to biodiversity researchers and regulatory scientists. Based on running {\tt HACSim} 100 times using both default and altered levels of desired haplotype recovery ({\tt p} = 80\%, 90\% and 95\%) at population sizes of 1000, 10000, 100000 and 10 million, it is shown that {\tt HACSim} produces reasonable estimates of likely required sample sizes sufficient to capture set levels of haplotype diversity. This work opens up a number of avenues for future work, including further improving computational performance of {\tt HACSim}, as well as \\ incorporating more realistic biological scenarios, such as population structure, into simulations.
\vspace{1mm}
\item Phillips, J.D., Gillis, D.J. and Hanner, R.H. (\textit{Accepted}). Lack of statistical rigor in DNA barcoding likely invalidates the presence of a true species' barcode gap. \textit{Frontiers in Ecology and Evolution}.
Here, a case is made for a lack of statistical rigor in DNA barcoding. Simple \\ statistical approaches to the analysis of DNA barcode data as it pertains to estimation of the barcode gap is presented, with a particular focus on animal taxa of regulatory, forensic as well as broad socioeconomic and conservation importance. Arguments revolve around three broad areas: (1) the improper allocation of specimen sampling efforts required to assess standing levels of taxon genetic diversity, (2) the failure of properly visualizing both intraspecific and interspecific genetic distances, and (3) the inconsistent, inappropriate use or absence of statistical inferential procedures in DNA barcoding gap analyses. Recommended remedies presented herein are based strongly on established statistical theory and are easily applied in practice by the nonstatistician. A case study on the DNA barcoding of Canadian Pacific fishes \cite{steinke2009dna} is employed to highlight these three key shortcomings.
\section{Territorial Acknowledgement}
The Dish With One Spoon Covenant speaks to our collective responsibility to steward and sustain the land and environment in which we live and work, so that all peoples, present and future, may benefit from the sustenance it provides. As we continue to strive to strengthen our relationships with and continue to learn from our Indigenous neighbours, we recognize the partnerships and knowledge that have guided the
\\ research conducted in our labs. We acknowledge that the University of Guelph resides in the ancestral and treaty lands of several Indigenous peoples, including the Attawandaron people and the Mississaugas of the Credit, and we recognize and honour our Anishinaabe, Haudenosaunee, and M{\'e}tis neighbours. We acknowledge that the work presented here has occurred on their traditional lands so that we might work to build lasting partnerships that respect, honour, and value the culture, \\ traditions, and wisdom of those who have lived here since time immemorial.
\end{itemize}