Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepeatModeler stuck at rmblastn with a small fungal genome at "Refining 2 families" #251

Open
ishengtsai opened this issue Jul 21, 2024 · 3 comments
Labels

Comments

@ishengtsai
Copy link

Hi, Repatmodeler have been great since the very first version (been using it from 10Mb to 5Gb genomes which is fine). However, for some odd reason when annotating this fungal genome (˜42Mb), the rmblastn just got stuck at round 2. We have been trying this on:

rmblastn precompiled 2.14.1
rmblastn compiled from source 2.14.1
rmblastn compiled from source 2.14.0

sh -c /home/ijt/bin/rmblast-2.14.1/bin//rmblastn -num_alignments 9999999 -db /mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/family-22-cons-2.fa -query /mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/family-22.fa -gapopen 20 -gapextend 5 -mask_level 80 -complexity_adjust -word_size 7 -xdrop_ungap 300 -xdrop_gap_final 150 -xdrop_gap 75 -min_raw_gapped_score 150 -dust no -outfmt="6 score perc_sub perc_query_gap perc_db_gap qseqid qstart qend qlen sstrand sseqid sstart send slen kdiv cpg_kdiv transi transv cpg_sites qseq sseq" -num_threads 1 -mt_mode 1 -matrix comparison.matrix 2>/mnt/nas2/ijt/fungi/Ryder_fungi_repeat_test/RM_159647.SunJul211952592024/round-2/ncResults-1721599401-186349-2050.75563326076.err

I actually do not know what seems to be causing the problem so I am attaching the two fasta below. Some suggestions would be appreciated.

family-22-cons-2.fa
family-22.fa

@rmhubley
Copy link
Member

rmhubley commented Jul 22, 2024

This is a 30kb TA-rich region. There are some internal tandem repetitions but no overall discernible pattern. Owing to the size (and the several copies present) this may be a satellite region. It's certainly not a TE sequence. But your original question is why does it take so long to align. This is a 30kb sequence of mostly two bases being searched against a handful of 30kb sequences with similar composition. At word-size of 7 this really blows up the search space. In essence almost every position can align to almost any other position.

This is the first time I have seen (or someone has reported) that RECON produced such a long low-complexity region as a family. I suppose we could add a low-complexity filter to the families returned by RECON (as is done for RepeatScout) -- I'll look into this for a future release. Another possibility would be to apply a low-complexity filter to seeding words - as is done in Phil Green's crossmatch. Although that would be a larger undertaking. In this case, rmblast will finish it's work, albeit after much much longer processing time than the rest of the families. On my machine running in a single thread this search took 1 hr 22 minutes.

BTW...thanks for tracking this down to those files and attaching them. That really helped in figuring this one out.

@ishengtsai
Copy link
Author

Thanks. Look forward to the future release! For this case I would remove the contig / region in question and rerun RepeatModeler?

@rmhubley
Copy link
Member

That would ensure that it didn't happen again. Although, as I said, it should finish given more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants