Skip to content

Taxonomy miscongruence

Vyacheslav Brover edited this page Nov 3, 2022 · 22 revisions

Installation

To compute the taxonomy miscongruence of a tree, the NCBI taxonomy must be installed into an SQL Server database:

$TT/ncbitax/load.sh <Server> <Database> NCBI

Here $TT is the installation directory defined in the section Installation.
This command will install the NCBI taxonomy on the server <Server>, database <Database>, schema NCBI.
A user must have DDL permissions in the database, including create schema.

Checking inconsistent taxonomy ranks

$TT/ncbitax/integrity.sh <Server> <Database> NCBI

The tax_ids printed by this script have inconsistent taxonomic lineages.

Test

This SQL command will print the encoded lineage for Aquabacterium pictum:

EXEC NCBI.tax2phen 2315236, 0;

Directory phen/

Let the file genome.tab contain two columns: NCBI assembly id and NCBI taxid:

$ head -3 genome.tab
3378    441960
6778    5762
7048    423536

Create a directory phen/ with encoded taxonomic lineages for each assembly:

mkdir phen
$TT/trav genome.tab -threads 10 "sqsh-ms -S <Server> -D <Database>  -w 1024 \
   -C %QNCBI.tax2phen %2, 0%Q | grep -v '(return status = 0)' > phen/%1"

This directory will be used in the below sections.

The name "phen" means "phenotypes".
These are Boolean or nominal attributes of the tree objects.
These attributes are not used in a tree building.

Computation of taxonomy miscongruence

Suppose a tree file tree created by makeDistTree contains assemblies in phen/.
Solve the maximum parsimony problem for each taxonomic class present in the assemblies of the tree:

$TT/phylogeny/tree_quality_phen.sh tree "" phen 0 1 ""

The screen output contains a line which looks like:

# Non-monophyletic disagreements: 19380 (1.57) V !

The number 19380 is the taxonomy miscongruence.
The symbol "V" means that lower values are better.
The symbol "!" which is the last on the line helps finding this line by grep.
The value 1.57 in parentheses equals the taxonomy miscongruence divided over the number of assemblies in the tree.

Comparing two trees by taxonomy miscongruence

Suppose two trees of assemblies are stored in the files tree1 and tree2 created by makeDistTree or converted to the makeDistTree format, and these assemblies are in phen/
Then this command will create two intersection trees and compute the taxonomy miscongruence for each of them:

$TT/phylogeny/tree2_quality_phen.sh tree1 tree2 phen 0 ""

Assigning taxonomic names to interior nodes

Suppose a tree file tree created by makeDistTree contains assemblies in phen/.
Solve the maximum parsimony problem for each taxonomic class present in the assemblies of the tree::

$TT/phylogeny/tree2names.sh tree phen 0

As a result a file gain_nodes is created, where each line has format

<asm1>:<asm2> <taxonomic name>

which means that <txonomic name> is gained at the least common ancestor of assemblies <asm1> and <asm2>.

If an incremental distance tree directory has the subdirectory phen/ which is a link to a directory with encoded taxonomic lineages for each reservoir object, then the taxonomy miscongruence for the initial tree can be computed by

$TT/phylogeny/distTree_inc_tree1_quality.sh

and this script will be invoked by distTree_inc.sh and distTree_inc_delete.sh.