Unambiguously identifying species is a far more challenging task than it may appear. There are a vast number of reasons for this. Different databases keep different taxonomic "backbones", i.e. different data structures in which names are mapped to species, and organised in a hierarchy. Not all names are unique identifiers to groups. For example, Io can either refer to a genus of plants from the aster family, or to a genus of molluscs; the genus Mus (of which the house mouse Mus musculus is a species), contains a sub-genus also named Mus (within which Mus musculus is located). Conversely, the same species can have several names, which are valid synonyms: for example, the domestic cow Bos taurus admits Bos primigenius taurus as a valid synonym. In addition to binomial names, the same species can be known by many vernacular (common) names, which are language or even region-specific: Ovis aries, for example, has valid English vernaculars including lamb, sheep, wild sheep, and domestic sheep.
In addition, taxonomic nomenclature changes regularly, with groups being split, merged, or moved to a new position in the tree of life; often, taxonomic revisions lead to these events occuring simultaneously. This is, notably, a common occurrence with viral taxonomy, each subsequent version of which can differ markedly from the last; compare, e.g @Lefkowitz2018VirTax to @Walker2020ChaVir, where entire viral sub-trees were split, re-organized, and created within just two years. As a consequence any mapping of names to other biological entities can become outdated, and therefore invalid. These taxonomic changes have profound implications for the way we perceive biodiversity at global scales [@Dikow2009BioRes], to the point were taxonomic revisions should sometimes be actively conducted to improve e.g. conservation outcomes [@Melville2021RetApp].
None of these issues, were they to happen in isolation, would be very difficult to deal with. Indeed, performing the lookup for any text string in any database is a trivial operation. But to add to the complexity, one must also consider that most taxa names are at some point manually typed, which has the potential to introduce additional sources of variation in raw data; it is likely to expect that such mistakes may arise when attempting to write down the (perfectly valid) names of the bacterial isolate known as Myxococcus llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogochensis, or of the crowned slaty flycatcher Griseotyrannus aurantioatrocristatus. These mistakes are more likely when dealing with hyper-diverse samples (demanding to memorize more names), like plant census [@Dauncey2016ComMis; @Wagner2016RevSof; @Conti2021MatAlg]; when dealing with multiple investigators with different knowledge of the taxonomy; and as a result of the estimated error in any data entry exercice, which other fields estimate at up to about 5% [@Barchard2011PreHum]. As a result, the first question one needs to ask when confronted with a string of character that purportedly points to a node in the tree of life is not "to which entry in the taxonomy database is it associated?", but "is there a mistake in this name that is likely to render a simple lookup invalid?".
All these considerations become important when matching species names both
within and across datasets. Let us consider the hypothetical species survey of
riverine fishes: European chub, Cyprinus cephalus, Leuciscus cephalus,
Squalius cephalus. All are the same species (S. cephalus), referred to as
one of the vernacular (European chub) and two formerly accepted names now
classified as synonyms (but still present in the litterature). A simple estimate
of diversity based on the user-supplied names would give
In this manuscript, we describe NCBITaxonomy.jl
, a Julia package that provides
advanced name matching and error handling capacities for the reconciliation of
taxonomic names to the NCBI database. This package was used to facilitate the
development of the CLOVER [@Gibb2021DatPro] database of host-virus
associations, by reconciling the names of viruses and mammals from four
different sources, where all of the issues described above were present. More
recently, it has become part of the automated curation of data for the VIRION
[@Carlson2022GloVir] database, which automatically curates an up-to-date,
authoritative virome network from dozens of heterogeneous sources. We describe
the core capacities of this package, and highlight how it enables safe,
high-performance name reconciliation.
Based on the author's experience reconciling lists of thousands of biological
names, NCBITaxonomy.jl
is built around a series of features that allow (i)
maximum flexibility when handling names without a direct match, (ii) a bespoke
exception system to handle failures to match automatically, and (ii) limits to
the pool of potential names in order to achieve orders-of-magnitude speedups
when the broad classification of the name to match is known. Adhering to these
design principles led to a number of choices. A comparison of the features of
different packages, as infered from their public documentation, is presented in
@tbl:comparison.
First, we specifically target programmatic (as opposed to command-line) based approaches, so that the functionalities of the package can be accessed as part of a larger pipeline. Second, to speed up the queries, we work from a local version of the database, the installation of which is handled at build time by the package itself; each project using the package can use its own version of the taxonomy by specifying a folder where it is stored through an environmental variable. Third, because we cannot trust that the names as presented in the original data are correct, we offer case-insensitive search (at no time cost) and fuzzy-matching (at a significant time cost). Either of these strategies can be called only after a case-sensitive, non-fuzzy search yields an exception about the lack of a direct match. Finally, in order to achieve a good performance even when relying on fuzzy matching, we offer the ability to limit the search to specific parts of the taxonomy database. An example of the impact of this feature on the performance of the package is presented below.
Tool | Lang. | Library | CLI | Local DB | Fuzzy | Case | Subsets | Ranks | Reference |
---|---|---|---|---|---|---|---|---|---|
NCBITaxonomy.jl |
Julia |
+ | + | + | + | + | + | ||
taxadb |
R |
+ | + | + | + | ||||
taxopy |
Python |
+ | + | + | |||||
rentrez |
R |
+ | + | ||||||
Taxonkit |
Python |
+ | + | ||||||
NCBI-taxonomist |
Python |
+ | + |
Table: Comparison of core features of packages offering access to the NCBI taxonomic backbone. "Library": ability to be called from code. "CLI": ability to work as a command-line tool. "Local DB": ability to store a copy of the database locally. "Fuzzy": ability to perform fuzzy matching on inputs. "Case": ability to perform case-insensitive search. "Subsets": ability to limit the search to a subset of the raw database. "Ranks": ability to limit the search to specific raxonomi ranks. The features of the various packages have been determined from reading their documentation. {@tbl:id}
An up-to-date version of the documentation for NCBITaxonomy.jl
can be found in
the package's GitHub repository (PoisotLab/NCBITaxonomy.jl), including
examples and in-line documentation of every method. The package is released
under the MIT license. Contributions can be made in the form of issues (bug
reports, questions, features suggestions) and pull requests, all of which can be
consulted publicly. Alternatively, the package can be downloaded from its Zenodo
page (ID 5825828
), along with a versioned DOI.
In order to achieve good performance, the package will first retrieve the latest (as validated by its checksum) NCBI taxonomy backbone, store it locally, and pre-process it as a set of Julia data tables. By default, the taxonomy will be downloaded to the user's home directory, which is not an ideal solution, and therefore we recommend that users set an environment variable to specificy where the data will be loaded from (this path will be created if it doesn't exist):
ENV["NCBITAXONOMY_PATH"] = joinpath(homedir(), "data", "NCBITaxonomy.jl")
Note that this location can be different for different projects, as the package is able to update the taxonomic backbone (and will indeed prompt the user to do so if the taxonomy is more than 90 days old, as infered from looking at the raw files creation timestamp). The package can then be checked out and installed anonymously from the central Julia repository:
using Pkg
Pkg.add("NCBITaxonomy")
As long as the package is not re-built, the local set of tables downloaded from
NCBI will not change; this way, users can re-run an analysis with a guarantee
that the underlying taxonomic backbone has not changed, which is not the case
when relying on API queries. In order to update the taxonomic backbone, users
can call the build
function of Julia's package manager (]build NCBITaxonomy
), which will download the most recent version of all files.
This software note describes version v0.3.0
of the package (we follow semantic
versioning), which works on Julia 1.5 upwards. The dependencies are all resolved
by the package manager at installation, and (on the user-facing side) include
the StringDistances.jl
package, allowing users to experiment with different
string matching methods. As is best practices for Julia packages, a
Project.toml
file specifying compatible dependencies versions is distributed
with the package. The code is covered by unit-tests (with about 98% coverage),
as well as integration tests as part of the documentation (specifically, a
use-case detailing how to clean data from a biodiversity survey, and a use-case
aiming to reconstruct a taxonomic tree for the Lemuriformes).
Name finding, i.e. the matching of an arbitrary string to a taxonomic
identifier, is primarily done through the taxon
function, which admits either
a unique NCBI identifier (e.g. taxon(36219)
for the bogue Boops boops), a
string (taxon("Boops boops")
), or a data frame with a restricted list of names
in order to create a name finder function (see the next section). The taxon
method has additional arguments to perform fuzzy matching in order to catch
possible typos (taxon("Boops bops"; strict=false)
), to perform a lowercase
search (useful when alphanumeric codes are part of the taxon name, like for some
viruses), and to restrict the the search to a specific taxonomic rank. The
taxon
function also accepts a preferscientificname
keyword, to prevent
matching vernacular names; the use of this keyword ought to be informed by
knowledge about how the data were entered.
The lowercase search can be a preferable alternative to fuzzy string matching.
Consider the string Adeno-associated virus 3b
- it has three names with equal
distance (under the Levensthein string distance function):
julia> similarnames("Adeno-associated virus 3b"; threshold=0.95)
3-element Vector{Pair{NCBITaxon, Float64}}:
Adeno-associated virus - 3 (ncbi:46350) => 0.96
Adeno-associated virus 3B (ncbi:68742) => 0.96
Adeno-associated virus 3A (ncbi:1406223) => 0.96
Depending on the operating system (and specifically whether it is case-sensitive), either of these three names can be returned; compare to the output of a case insensitive name search:
julia> taxon("Adeno-associated virus 3b"; casesensitive=false)
Adeno-associated virus 3B (ncbi:68742)
This returns the correct name.
When it succeeds, taxon
will return a NCBITaxon
object (made of a name
string field, and an id
numerical field). That being said, the package is
designed under the assumption that ambiguities should yield an error for the
user to handle. There are two such errors: NameHasNoDirectMatch
(with
instructions about how to possible solve it, using the similarnames
function),
or a NameHasMultipleMatches
(listing the possible valid matches, and
suggesting to use alternativetaxa
to find the correct one). Therefore, the
common way to work with the taxon
function would be to wrap it in a
try
/catch
statement:
try
taxon(name)
# Additional operations with the matched name
catch err
if isa(err, NameHasNoDirectMatch)
# What to do if no match is found
elseif isa(err, NameHasMultipleMatches)
# What to do if there are multiple matches
else
# What to do in case of another error that is not NCBITaxonomy specific
end
end
These functions will not demand any user input in the form of key presses
(though they can be wrapped in additional code to allow it), as they are
intended to run on clusters or virtual machines without supervision. The taxon
function has good scaling using muliple threads. For convenience in rapidly
getting a taxon for demonstration purposes, we also provide a string macro,
whereby e.g. ncbi"Procyon lotor"
will return the taxon object for the
raccoon.
As the full NCBI names table has over 3 million entries at the time of writing,
we have provided a number of functions to restrict the scope of names that are
searched. These are driven by the NCBI divisions. For example nf = mammalfilter(true)
will return a data frame containing the names of mammals,
inclusive of rodents and primates, and can be used with e.g. taxon(nf, "Pan")
. This has the dual advantage of making search faster, but also of
avoiding matching on names that are shared by another taxonomic group (which is
not an issue with Pan, but is an issue with e.g. Io as mentioned in the
introduction, or with the common name Lizard, which fuzzy-matches on the
hemipteran genus Lisarda rather than the class Lepidosauria).
Note that the use of a restricted list of names can have significant performance consequences: compare, for example, the time taken to return the taxon Pan in the entire database, in all mammals, and in all primates:
Names list | Fuzzy matching | Time (ms) | Allocations | Memory allocated |
---|---|---|---|---|
all | no | 23 | 34 | 2 KiB |
yes | 105 | 2580 | 25 MiB | |
mammalfilter(true) |
no | 0.55 | 32 | 2 KiB |
yes | 1.9 | 551 | 286 KiB | |
primatefilter() |
no | 0.15 | 33 | 2 KiB |
yes | 0.3 | 92 | 27 KiB |
Clearly, the optimal search strategy is to (i) rely on name filters to ensure
that search are conducted within the appropriate NCBI division, and (ii) only
rely on fuzzy matching when the strict or lowercase match fails to return a
name, as fuzzy matching can result in order of magnitude more run time and
memory footprint. These numbers were obtained on a single Intel i7-8665U CPU (@
(1.90GHz). Using "chimpanzees"
as the search string (one of the NCBI
recognized vernaculars for Pan) gave qualitatively similar results, suggesting
that there is no performance cost associated with working with synonyms or
verncular input data.
In order to facilitate working with names, we provide the authority
function
(gives the full taxonomic authority for a name), synonyms
(to get alternative
valid names), vernacular
(for English common names), and rank
(for the
taxonomic rank). These functions are not used in name matching, but are often
useful in the post-processing of results.
The children
function will return all nodes that are directly descended from a
taxon; the descendants
function will recursively apply this function to all
descendants of these nodes, until only terminal leaves are reached. The parent
function is an "upwards" equivalent, giving the taxon from which a taxon
descends; the lineage
function chains calls to parent
until either
taxon(1)
(the taxonomy root) or an arbitrary ancestor is reached.
The taxonomicdistance
function (and its in-place equivalent,
taxonomicdistance!
, which uses memory-efficient re-allocation if the user
needs to change the distance between taxonomic ranks) uses the
@Shimatani2001MeaSpe approach to reconstruct a matrix of distances based on
taxonomy, which can serve as a rough proxy when no phylogenies are available.
This allows coarse estimations of taxonomic diversity based on species lists.
The default distance between taxonomic levels is as in @Shimatani2001MeaSpe
(i.e. species have a distance of 0, genus of 1, family of 2, sub-classes of 3,
and everything else 4), but specific scores can be passed for any taxonomic
level know to the NCBI name table.
NCBITaxonomy.jl
enables rapid, taxonomically-restricted, adaptive matching for
taxonomic names. By implementing various combinations of search strategies, it
allows users to (i) optimize the speed of their queries and (ii) avoid usual
caveats of simple string matching. Through explicit exceptions, it allows to
write code that will handle the possible edge cases that cannot be solved
automatically in a way that does not interrupt execution, or requires manual
input by the user. Given the breadth of the NCBI taxonomy database,
NCBITaxonomy.jl
is particularly suited to the name cleaning of large datasets
of names.
Acknowledgements: This work was supported by funding to the Viral Emergence
Research Initiative (VERENA) consortium including NSF BII 2021909 and 2213854
and a grant from Institut de Valorisation des Données (IVADO), by the NSERC
Discovery Grants and Discovery Acceleration Supplement programs, and by a
donation from the Courtois Foundation. Benchmarking of this package on
distributed systems was enabled by support provided by Calcul Québec
(www.calculquebec.ca
) and Compute Canada (www.computecanada.ca
). TP wrote
the initial code, TP and CJC contributed to API design, and all authors
contributed to functionalities and usability testing.