Skip to content

Commit

Permalink
Merge pull request #2188 from merenlab/update-anvi-script-filter-hmm-…
Browse files Browse the repository at this point in the history
…paramteters

Update anvi-script-filter-hmm-hits-table parameters
  • Loading branch information
mschecht authored Dec 12, 2023
2 parents 1cb6665 + 5234230 commit 7d366cd
Show file tree
Hide file tree
Showing 4 changed files with 48 additions and 9 deletions.
12 changes: 6 additions & 6 deletions anvio/docs/programs/anvi-script-filter-hmm-hits-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ This program allows you to remove low quality HMM alignments from a %(hmm-source

## Filter with HMM alignment parameters

Similar to query coverage in BLAST, we can also use HMM alignment coverage to help determine if an hmm-hit is homologous. A small coverage value means only a small proportion of the query/target is aligning. Before anvi'o can filter out %(hmm-hits)s with alignment coverage, you must run %(anvi-run-hmms)s and report a domain hits table by including `--domain-hits-table` flag in your command:
Similar to query coverage in BLAST, we can also use HMM alignment coverage to help determine if an hmm-hit is homologous. A small alignment coverage value means only a small proportion of the query/target is aligning. Before anvi'o can filter out %(hmm-hits)s with alignment coverage, you must run %(anvi-run-hmms)s and report a domain hits table by including `--domain-hits-table` flag in your command. This will write the [domtblout](http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) file from hmmsearch:

{{ codestart }}
anvi-run-hmms -c %(contigs-db)s \
Expand All @@ -11,15 +11,16 @@ anvi-run-hmms -c %(contigs-db)s \
--domain-hits-table
{{ codestop }}

After the command above, your HMM hits will be stored in your %(contigs-db)s as usual. However, with the domain hits table, you can filter out hits from your %(contigs-db)s using thresholds for model or gene coverage of each hit i.e. you can filter out %(hmm-hits)s where the profile HMM and gene align well to each other.
After the command above, your %(hmm-hits)s will be stored in your %(contigs-db)s as usual. However, with the domain hits table, you can filter out hits from your %(contigs-db)s using thresholds for `--min-model-coverage` or `--min-model-coverage` of each hit i.e. you can filter out %(hmm-hits)s where the profile HMM and gene align well to each other.

For example, following the command above, the command below will remove %(hmm-hits)s from your %(contigs-db)s for profile HMMs that had less than 90%% coverage of the target genes:
For example, following the command above, the command below will remove %(hmm-hits)s from your %(contigs-db)s for profile HMMs that had less than 90%% model coverage and 50%% gene coverage:

{{ codestart }}
anvi-script-filter-hmm-hits-table -c %(contigs-db)s \
--hmm-source Bacteria_71 \
--domain-hits-table path/to/dir/hmm.domtable \
--min-model-coverage 0.9
--min-model-coverage 0.9 \
--min-gene-coverage 0.5
{{ codestop }}

### HMMs with multiple hits to one gene
Expand All @@ -30,7 +31,6 @@ Some HMM profiles align multiple times to the same gene at different coordinates
anvi-script-filter-hmm-hits-table -c %(contigs-db)s \
--hmm-source Bacteria_71 \
--domain-hits-table path/to/dir/hmm.domtable \
--min-model-coverage 0.9 \
--merge-partial-hits-within-X-nts
{{ codestop }}

Expand All @@ -39,7 +39,7 @@ The input domtblout file for %(anvi-script-filter-hmm-hits-table)s will be saved

## Filter out hmm-hits from partial genes

HMMs are able to detect partial genes (i.e., genes that are not partial and that start with a start codon and end with a stop codon) with good alignment coverage and homology statistics. However, partial genes can lead to spurious phylogenetic branches and/or inflate the number of observed populations or functions in a given set of genomes/metagenomes. Using `--filter-out-partial-gene-calls`, you can remove partial gene hmm-hits.
HMMs are able to detect partial genes (i.e., genes that do not contain start and/or stop codons) with good alignment coverage and homology statistics. However, partial genes can lead to spurious phylogenetic branches and/or inflate the number of observed populations or functions in a given set of genomes/metagenomes. Using `--filter-out-partial-gene-calls`, you can remove partial gene hmm-hits.

{{ codestart }}
anvi-script-filter-hmm-hits-table -c %(contigs-db)s \
Expand Down
24 changes: 23 additions & 1 deletion anvio/docs/workflows/ecophylo.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,33 @@ anvi-run-workflow -w ecophylo \

### HMM alignment coverage filtering

The first step to removing bad %(hmm-hits)s is to filter out hits with low quality alignment coverage. This is done with the rule `filter_hmm_hits_by_model_coverage` which leverages %(anvi-script-filter-hmm-hits-table)s. We recommend 80%% model coverage filter for most cases. However, it is always recommended to explore the distribution of model coverage with any new HMM which will help you determine a proper cutoff (citation). To adjust this parameter, go to the `filter_hmm_hits_by_model_coverage` rule and change the parameter `--min-model-coverage`.
The first step to removing bad %(hmm-hits)s is to filter out hits with low quality alignment coverage. This is done with the rule `filter_hmm_hits_by_model_coverage` which leverages %(anvi-script-filter-hmm-hits-table)s. This tool uses the output of hmmsearch to filter out hits basedon the model and/or gene coverage. We recommend 80%% model coverage filter for most cases. However, it is always recommended to explore the distribution of model coverage with any new HMM which will help you determine a proper cutoff (citation). To adjust this parameter, go to the `filter_hmm_hits_by_model_coverage` rule and change the parameter `--min-model-coverage`. You can also adjust the gene coverage by change the parameter `--min-gene-coverage`. This can help remove ORFs with outlier lengths but completely depends on the HMM you are using.

{:.notice}
Please consider exploring the distribution of alignment coverages before choosing HMM alignment coverage filtering values. [Interproscan](https://www.ebi.ac.uk/interpro/) is a great way to visualize how publicly available HMMs align to proteins. Additionally, you can parse the domtblout files from hmmsearch to explore these values in high throughput.

```bash
{
"filter_hmm_hits_by_model_coverage": {
"threads": 5,
"--min-model-coverage": 0.8,
"--min-gene-coverage": 0.5,
"additional_params": ""
},
}
```

{:.notice}
Some full gene length HMM models align to a single hmm-hit independently at different coordinates when there should only be one annotation. To merge these independent alignment into one HMM alignment coverage stat, set `--merge-partial-hits-within-X-nts` to any distance between the hits for which you would like to merge and add it to the rule `filter_hmm_hits_by_model_coverage` under `additional_params`.

```bash
{
"filter_hmm_hits_by_model_coverage": {
"additional_params": "--merge-partial-hits-within-X-nts"
},
}
```

### conservative-mode: complete open-reading frames only

Genes predicted from genomes and metagenomes can be partial or complete depending on whether a stop and stop codon is detected. Even if you filter out %(hmm-hits)s with bad alignment coverage as discussed above, HMMs can still detect low quality hits with good alignment coverage and homology statistics due to partial genes. Unfortunately, partial genes can lead to spurious phylogenetic branches and/or inflate the number of observed populations or functions in a given set of genomes/metagenomes.
Expand Down
10 changes: 9 additions & 1 deletion anvio/tests/run_component_tests_for_metagenomics.sh
Original file line number Diff line number Diff line change
Expand Up @@ -173,14 +173,22 @@ anvi-run-hmms -c $output_dir/CONTIGS.db \
--no-progress \
$thread_controller

INFO "Filtering hmm_hits using target coverage"
INFO "Filtering hmm_hits using query coverage"
anvi-script-filter-hmm-hits-table -c $output_dir/CONTIGS.db \
--domain-hits-table $output_dir/hmm.domtable \
--hmm-source Bacteria_71 \
--min-model-coverage 0.9 \
--no-progress \
--filter-out-partial-gene-calls

INFO "Filtering hmm_hits using target coverage"
anvi-script-filter-hmm-hits-table -c $output_dir/CONTIGS.db \
--domain-hits-table $output_dir/hmm.domtable \
--hmm-source Bacteria_71 \
--min-gene-coverage 0.5 \
--no-progress \
--filter-out-partial-gene-calls

INFO "Listing all available HMM sources in the contigs database"
anvi-delete-hmms -c $output_dir/CONTIGS.db \
--list \
Expand Down
11 changes: 10 additions & 1 deletion sandbox/anvi-script-filter-hmm-hits-table
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ class FilterHmmHitsTable(object):
filesnpaths.is_file_exists(self.domtblout)
self.run.info("Domtblout Path", self.domtblout)

if not self.min_model_coverage or self.min_gene_coverage:
if not self.min_model_coverage and not self.min_gene_coverage:
raise ConfigError("You didn't provide anvi-script-filter-hmm-hits-table with either a "
"--min-model-coverage or --min-gene-coverage. Please provide at least one "
"so anvi'o can filter hmm_hits for you :)")
Expand All @@ -130,6 +130,15 @@ class FilterHmmHitsTable(object):
raise ConfigError(f"--min-gene-coverage must be a percentage between 0 and 1 "
f"and you put a a value larger than 100%: {self.min_gene_coverage}")

if self.min_gene_coverage and self.min_model_coverage:
self.run.info("Minimum gene coverage", self.min_gene_coverage)
self.run.info("Minimum model coverage", self.min_model_coverage)
else:
if self.min_gene_coverage:
self.run.info("Minimum gene coverage", self.min_gene_coverage)
if self.min_model_coverage:
self.run.info("Minimum model coverage", self.min_model_coverage)

if not self.hmm_source:
raise ConfigError("Please provide a hmm-source :)")

Expand Down

0 comments on commit 7d366cd

Please sign in to comment.