-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of https://github.com/beast-dev/doc
- Loading branch information
Showing
28 changed files
with
1,725 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
--- | ||
title: "Some notes on pattern compression and speeding up BEAST analysis" | ||
categories: [release] | ||
permalink: 2024-07-17_pattern_compression.html | ||
sidebar: beast_sidebar | ||
tags: [news] | ||
--- | ||
|
||
### Pattern compression for fun and profit | ||
|
||
by Andrew Rambaut & Philippe Lemey | ||
|
||
To calculate the likelihood of a tree given a sequence alignment, the log likelihood of each site (column in the alignment) is evaluated independently and then summed across all sites in the alignment. The likelihood of any sites that have exactly the same nucleotides for each taxon will be the identical (assuming they have been assigned the same substitution model and rate). So, in practice, BEAST 'compresses' the alignment into a set of unique patterns of nucleotides (referred to as just 'patterns') and the count of how many times they occur. Thus, the likelihood for the alignment is the sum of the likelihood for each pattern times their counts. Obviously, this trick provides considerable speed improvements over just naively calculating the likelihood for each site independently and is used by most (all?) phylogenetic software. | ||
|
||
In the latest version (v10.5.0), BEAUti now shows the number of unique site patterns for a loaded alignment as well as the original length of the alignment: | ||
|
||
{% include image.html width="320" prefix="" file="news/Patterns_BEAUti.png" caption="Figure 1 | Pattern count in BEAUti data partition table." %} | ||
|
||
In this example using the file `WNV.fasta` from the `examples/Data` folder shows that this alignment of West Nile Virus genomes is compressed from over 11,000 nucleotides to only 727 unique site patterns - a compression of over 15 times with a commensurate improvement in likelihood calculation speed). The less variable the sequences in the alignment are, the greater the level of compression will be. This is because almost all the compression comes from the 'constant' sites that have the same nucleotide for every taxon. Other sites with nucleotide changes are unlikely to share exactly the same pattern with another such that they can be compressed. | ||
|
||
Here is the top 15 most common site patterns for the `WNV.fasta` alignment: | ||
|
||
| **count** | **pattern** | | ||
|-----------|--------------------------------------------------------------------------------------------------------| | ||
| 2821 | `GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG` | | ||
| 2635 | `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` | | ||
| 2025 | `TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT` | | ||
| 1963 | `CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC` | | ||
| 60 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANAAAAAAAAAAAAAAA` | | ||
| 56 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGNGGGGGGGGGGGGGGG` | | ||
| 49 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNCCCCCCCCCCCCCCC` | | ||
| 35 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG` | | ||
| 30 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC` | | ||
| 26 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` | | ||
| 24 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTTTTTTTTTTTTT` | | ||
| 21 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT` | | ||
| 11 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANNNNAAAAAAAAAAAAAAA` | | ||
| 11 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNNNTTTTTTTTTTTTTTTT` | | ||
| 10 | `NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNCCCCCCCCCCCCCCC` | | ||
|
||
The [full table can be seen here](wnv_pattern_table.html). The first observation (as expected) is that the constant sites are the 4 most common patterns by a large margin (they make up 9,444 of the 11,029 sites in the alignment. The second is that most of the other patterns that are at high frequency are ones that are constant except for ambiguities (`N`s) in some of the taxa. This is a feature of next generation sequencing where areas of low read coverage are filled in with `N`s. Ambiguities are treated as unknown nucleotides in BEAST and are efficiently dealt with in the likelihood calculations done by [BEAGLE](beagle). | ||
|
||
This second observation offers an opportunity for further compression of the patterns. If we assume that patterns that are constant except for ambiguities are, in fact, constant then we can just include them in the counts for the respective constant patterns. For example, in the table above the 5th pattern comprises only `A`s and `N`s so might be considered as a constant `A` pattern. The 10th and 13th patterns are also constant `A`s but differ in their pattern of `N`s so would also be compressed similarly. Doing this for the `WNV.fasta` alignment goes from 727 unique site patterns down to 604 'ambiguous constant' site patterns – a reduction of 17%. When run in BEAST with some standard models/settings the time drops from 11.2 minutes to 9.9 minutes for a reduction in run-time of 12%. | ||
|
||
For other data sets, a much larger saving can be achieved. Here are a few examples: | ||
|
||
| **Virus** | **Dataset** | **Sequences** | **Sites** | **Unique patterns** | **Compressed ambiguous** | **Factor** | | ||
|--|--------------|--|-----------|--|--------------------------| | ||
| SARS-CoV-2 | B.1.1.7 | 976 | 29409 | 2079 | 918 | 2.3x | | ||
| SARS-CoV-2 | omicron_BA1 | 1000 | 29409 | 1528 | 485 | 3.2x | | ||
| Ebolavirus | Makona_1610 | 1610 | 18992 | 7926 | 2267 | 3.5x | | ||
| Mpox virus | cladei_reservoir | 60 | 196858 | 3701 | 386 | 9.6x | | ||
|
||
In general, we would expect an increase in likelihood evaluation speed to improve by similar factors (with some allowance for overheads). | ||
|
||
In [BEAST X v10.5.0](installing) a new command line option allows this feature to be switched on: | ||
```bash | ||
-pattern_compression ambiguous_constant | ||
``` | ||
|
||
It is important to note that this feature is an approximation to the full likelihood calculation. In testing it appears to be a good approximation with only small differences in the estimated values of the parameters. A second option, `-ambiguous_threshold`, specifies the maximum proportion of nucleotides in the pattern that can be ambiguous and still be considered to be constant. This can help reduce the effect of the approximation whilst still allowing considerable savings in run-time. For example, compressing all constant sites irrespective of the proportion of ambiguous nucleotides produces a drop in the log likelihood of the tree (Figure 2, `all ambiguous constant`). Reducing the `-ambiguous_threshold` to 0.25 (which is the default value if not specified) returns a tree likelihood similar to the normal compression approach (Figure 2, `unique patterns`). | ||
|
||
{% include image.html width="320" prefix="" file="news/Patterns_TreeLikelihood.png" caption="Figure 2 | Log likelihood of the tree for different compression thresholds. 'Uncompressed' is no compression, 'unique patterns' has an ambiguity threshold of zero, `all ambiguous constant` has a threshold of 1.0. " %} | ||
|
||
|
||
| **Compression** | **Threshold** | **Pattern count** | **Run time** | | ||
| `-pattern_compression` | `-ambiguous_threshold` | | | | ||
|--------------------------------------------|----------------------|-------------------|--------------| | ||
| `off` | n/a | 11,029 | | | ||
| `unique` | 0.0 | 727 | 11.2 minutes | | ||
| `ambiguous_constant` | 0.25 | 659 | 8.1 minutes | | ||
| `ambiguous_constant` | 0.5 | 608 | 7.8 minutes | | ||
| `ambiguous_constant` | 1.0 | 604 | 7.3 minutes | | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.