Skip to content

Latest commit

 

History

History
173 lines (128 loc) · 9.34 KB

README.md

File metadata and controls

173 lines (128 loc) · 9.34 KB

Contributions are welcome !

NCBI Taxonomy SQL/Java Tools

This set of tools aims to create a local SQL copy of the NCBI Taxonomy database and uses Java scripts to query the database.

It aims to answer many common operations when working on systematics or species identification, such as :

  • extract a lineage for a species or taxonomic id
  • extract all species below a certain taxonomy node
  • extract every lineages of every species below a certain taxonomy node
  • build a Blast "Delimitation File", intended to be used with commands blastdb_aliastool -gilist to build Blast database indexes focuses on particular NCBI Taxonomy clades.
  • ... etc ...

This package was initially developed for the following academic projects :

  • The contribution of mitochondrial metagenomics to large-scale data mining and phylogenetic analysis of Coleoptera. Linard B et al. Mol Phylogenet Evol. 2018 Nov;128:1-11.
  • Lessons from genome skimming of arthropod-preserving ethanol. Linard B. et al. Mol Ecol Resour. 2016 Nov;16(6):1365-1377.
  • Metagenome skimming of insect specimen pools: potential for comparative genomics. Linard et al. Genome Biol Evol. 2015 May 14;7(6):1474-89.

If these sources are of any use in your own project, the authors would greatly appreciate that you cite one of these.

Requirements

  • Postgresql server (Java code should be compatible with other SGBDs after adapting COPY statements in update_taxonomy.sh and the SQL schema in taxonomy_schema.sql)
  • ADMIN or COPY rights associated to your SGBD user/role to copy NCBI dumps to your local database.
  • Java JDK 1.8

HOW TO USE

Its main purpose is the treatment of very large lists of species names, sequence identifiers and the export of large chunks of taxonomic data. Basically, the idea is 1) load the full NCBI taxonomy tree in memory and 2) rapidly query this tree using a list of input queries

Good approach: Build a list of thousands of species names or NCBI sequence identifiers as a text file. Then all one of the functions of this package ONCE.

Bad approach: In a bash script call the functions of this package at each iteration. This will be super slow... Why? because this would load the full NCBI tree in memory at each iteration !

NCBI Taxonomy operations

Available operations

  • ScientificNamesToLineages : From a list of Scientific Names, (written in a file, 1 identifier per line) extract the corresponding NCBI lineages.

  • TaxidToLineage : Extract the lineage from a simple taxonomic id.

  • TaxidToSubTreeLeavesLineages : Using a taxonomic id of an internal node of NCBI Taxonomy (for instance, 7041 which is Coleopteran order), extract the lineages of every species belonging to the subtree of this node (with, 7041, extract lineages of every Coleoptaran species.

  • IdentifiersToLineages : From a list of NCBI GIs or ACCESSIONs identifiers (written in a file, 1 identifier per line) extract the corresponding NCBI lineages. WARNING: queries will be fast only IF index_accession2taxid was set to 1 (default is 0) during installation. If not, you can index later the column of this table (corresponding SQL lines are in database_schema.sql).

  • GenerateTaxonomyDelimitationForBlast : One can require a copy of the NCBI Blast database focused on a particular clade. For instance, you may download the Nematodes Blast database but you are actually only interested by C. elegans sequences. A command blastdb_aliastool -db nematode_mrna -gilist c_elegans_mrna.gi can help you to build the corresponding Blast index (see NCBI documentation) BUT the annoying part is to build the gilist which targets every single sequence of C. elegans. The present operation does exactly that, from a taxonomic id, it will extract every gi numbers associated to the subtree so that you can build later a Blst database focused on a particular clade and accelerate you Blast searches. WARNING: queries will be fast only IF index_index_gi_taxid_nucl or index_index_gi_taxid_prot were set to 1 during installation (default is 1). If not, you can index later the column of this table (corresponding SQL lines are in database_schema.sql).

  • More to come ...

Calling an operation

java -cp NCBITaxonomy.jar op.[operation_name] 

For instance:

java -cp NCBITaxonomy.jar op.TaxidToLineage --help

Will show the usage of this operation:

Usage: TaxidToLineage [-hrV] [-f=[1|2]] [-o=<out>] -t=int
Extract NCBI lineage from a NCBI taxid.
  -f, --format=[1|2]   Format used to output ranks:
                       1 = 'Homo[Genus];sapiens[Species]'
                       2 = 'Homo;sapiens' (line 1)
                            'genus;species' (line 2)
  -h, --help           Show this help message and exit.
  -o, --out=<out>      Output results in file instead of stdout.
  -r, --ranks          Add rank names to scientific names.
  -t, --taxid=int      The taxonomic id.
  -V, --version        Print version information and exit.

Available operations can be listed by writing the following line in a terminal, followed by 2 pushes of the TAB key when your cursor is just on the right of the last dot :

java -cp NCBITaxonomy.jar op.

If your system is correctly configured for Java autocompletion, you should see a list of all available operations (op).

op.DBConnectionTest
op.GenerateTaxonomyDelimitationForBlast
op.IdentifiersToLineages
op.ScientificNamesToLineages
op.TaxidToLineage
op.TaxidToSubTreeLeavesLineages

Installation

The installation process is done in 4 steps:

  1. Configure the header of 'update_taxonomy.sh'. Execute to download the NCBI taxonomy dumps and copy them to a new database in your SQL server. (requires ADMIN or COPY rights associated to your SGBD user/role).
  2. Optionnal: Create a user granted with SELECT permissions on the created database.
  3. Write the database credentials of this user in a database.properties file (see below).
  4. Use the Java package to query your new NCBI Taxonomy database via different operations (see below).

Step 1: update_taxonomy.sh

Just edit the ### SCRIPT CONFIG section. Some important points:

  • The database user set in this file MUST have COPY rights to dump data to the database. In most recent version of Postgres, this can be done by granting pg_write_server_files privileges to this user.
  • index_* options are intended to avoid the create of giant indexes that are not necessarily useful to your applications. In particular, the table index_accession2taxid=0 will avoid to index the corresponding table wich contains todays ~ 2 milliards of lines (march 2020), leading to a >100 Gb index while the database itself is less than this size !
  • step_* options are there only if one of the step fails and you need to relaunch the script. For instance, to avoid downloading again the NACBI taxonomy dumps when the script failed in a later step.
  • By default the create database name will follow the pattern ncbi_taxoniomy_YYYY_MM_DD where YYYY_MM_DD are year-month-date in numerical caracters. This can be changed by changing dbname="" with a non-empty string.

Step 2: (Optionnal) NCBI taxonomy user

It can be useful to create a user dedicated to this new datbase, in particular when someone not intended to modify your SQL datbases just want to interogate NCBI taxonomy. In the database prompt, and a role holding 'CREATE USER' rights, do the following :

CREATE USER taxonomypublic WITH PASSWORD 'taxonomypublic' ;
GRANT CONNECT ON DATABASE ncbi_taxonomy_YYYY_MM_DD TO taxonomypublic ;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO taxonomypublic ;

Step 3: database.properties file

Edit the file database.properties to set valid database credentials. The Java code will require this file to connect to the database. You can either use your own database user or a dedicated-user as sdhown in step 2. By default, the Java code will look for database credentials in a database.properties file in the current directory.

jdbc.drivers=org.postgresql.Driver                           #driver
jdbc.url=jdbc:postgresql://ip:port/ncbi_taxonomy_xxxx_xx_x   #taxonomy DB address
jdbc.username=taxonomypublic                                 #login
jdbc.password=taxonomypublic                                 #password

Moreover, if you plan to use something else than Postgresl (MySQL, Oracle ...) do not forget to change the driver accordingly.

By default, this file will be looked in the same directory where is present the jar file. For all operations, this behaviour can be changed by targeting a particular property file with option -d .

Step 4: Java compilation

Install Java JDK and compiler is not already done.

#install java DK and gradle for compilation
sudo apt-get update
sudo apt-get install openjdk-8-jdk
sudo apt-get install gradle

Compile sources. To use another JDBC driver (MySQL, Oracle ...) edit gradle.build to add the corresponding driver in the dependancies.

git clone https://github.com/blinard-BIOINFO/NCBITaxonomy.git
cd ./NCBITaxonomy
gradle build && gradle clean

Rapid test. The command help should appear.

java -cp NCBITaxonomy-0.1.0.jar op.TaxidToLineage --help

Rapid connection test. If your setup is correct, you should see :

java -cp NCBITaxonomy-0.1.0.jar op.DBConnectionTest

Testing postgres database connection...
connected on: jdbc:postgresql://127.0.0.1:5432/ncbi_taxonomy_2020_03_12
with user: taxonomypublic

License

This code is distributed under the MIT License.