RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

ishengtsai · 2024-07-21T23:01:49Z

Hi, Repatmodeler have been great since the very first version (been using it from 10Mb to 5Gb genomes which is fine). However, for some odd reason when annotating this fungal genome (˜42Mb), the rmblastn just got stuck at round 2. We have been trying this on:

rmblastn precompiled 2.14.1
rmblastn compiled from source 2.14.1
rmblastn compiled from source 2.14.0

sh -c /home/ijt/bin/rmblast-2.14.1/bin//rmblastn -num_alignments 9999999 -db /mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/family-22-cons-2.fa -query /mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/family-22.fa -gapopen 20 -gapextend 5 -mask_level 80 -complexity_adjust -word_size 7 -xdrop_ungap 300 -xdrop_gap_final 150 -xdrop_gap 75 -min_raw_gapped_score 150 -dust no -outfmt="6 score perc_sub perc_query_gap perc_db_gap qseqid qstart qend qlen sstrand sseqid sstart send slen kdiv cpg_kdiv transi transv cpg_sites qseq sseq" -num_threads 1 -mt_mode 1 -matrix comparison.matrix 2>/mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/ncResults-1721599401-186349-2050.75563326076.err

I actually do not know what seems to be causing the problem so I am attaching the two fasta below. Some suggestions would be appreciated.

family-22-cons-2.fa
family-22.fa

The text was updated successfully, but these errors were encountered:

rmhubley · 2024-07-22T17:58:19Z

This is a 30kb TA-rich region. There are some internal tandem repetitions but no overall discernible pattern. Owing to the size (and the several copies present) this may be a satellite region. It's certainly not a TE sequence. But your original question is why does it take so long to align. This is a 30kb sequence of mostly two bases being searched against a handful of 30kb sequences with similar composition. At word-size of 7 this really blows up the search space. In essence almost every position can align to almost any other position.

This is the first time I have seen (or someone has reported) that RECON produced such a long low-complexity region as a family. I suppose we could add a low-complexity filter to the families returned by RECON (as is done for RepeatScout) -- I'll look into this for a future release. Another possibility would be to apply a low-complexity filter to seeding words - as is done in Phil Green's crossmatch. Although that would be a larger undertaking. In this case, rmblast will finish it's work, albeit after much much longer processing time than the rest of the families. On my machine running in a single thread this search took 1 hr 22 minutes.

BTW...thanks for tracking this down to those files and attaching them. That really helped in figuring this one out.

ishengtsai · 2024-07-22T23:33:09Z

Thanks. Look forward to the future release! For this case I would remove the contig / region in question and rerun RepeatModeler?

rmhubley · 2024-07-29T18:33:35Z

That would ensure that it didn't happen again. Although, as I said, it should finish given more time.

ishengtsai added the question label Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

ishengtsai commented Jul 21, 2024

rmhubley commented Jul 22, 2024 •

edited

Loading

ishengtsai commented Jul 22, 2024

rmhubley commented Jul 29, 2024

RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

Comments

ishengtsai commented Jul 21, 2024

rmhubley commented Jul 22, 2024 • edited Loading

ishengtsai commented Jul 22, 2024

rmhubley commented Jul 29, 2024

rmhubley commented Jul 22, 2024 •

edited

Loading