You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The final RepeatMasker annotation out and gff files have an 80kbp stretch of the genome annotated as a contiguous repeat. The original consensus sequence is 9kbp so I'm a little confused as to why the sequence is collapsed into one rather than kept as seperate hits. Viewing the portion of the genome annotated as this repeat it appears a Penelope element has undergone tandem duplication numerous times, resulting in it effectively resembling a satellite.
Is this the desired output? I would have expected several lines, with each corresponding to the individual hits found using blast. I've pasted the example line from the .out below.
If you can let me know when you've downloaded it that'd be great so I can make more space in the Google Drive that would be great.
For context, the genome being annotated is publicly available here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/ . RepeatMasker was run as part of EarlGrey, with the RepeatModeler library being curated using EarlGrey/TEstrainer's BLAST, Extend, Align, Trim algorithm. The scaffold names differ between the publicly available data and the .out format due to this, with three scaffolds in the NCBI genome (OX731680.1, OX731681.1, and OX731682.1) being renamed ctg_1, ctg_2, and ctg_3 respectively. Let me know if you need any more info.
Thank you for the great info to reproduce this. I have downloaded your file, and will experiment with it. My suspicion is that this is an artifact of not having a curated library where both the mosaic satellite and the Penelope family are represented. Typically RepeatMasker will join significantly overlapping alignments of the same family into one annotation (accounting for minor subfamily differences, or local tandem duplications). This is a rare case though and there must not be a limit set on this joining process. I will see if I can add a fix to this in the next release.
I'm in the process of annotating the genome of Icerya purchasi (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/) and have come across a large problem.
The final RepeatMasker annotation out and gff files have an 80kbp stretch of the genome annotated as a contiguous repeat. The original consensus sequence is 9kbp so I'm a little confused as to why the sequence is collapsed into one rather than kept as seperate hits. Viewing the portion of the genome annotated as this repeat it appears a Penelope element has undergone tandem duplication numerous times, resulting in it effectively resembling a satellite.
Is this the desired output? I would have expected several lines, with each corresponding to the individual hits found using blast. I've pasted the example line from the .out below.
73691 0.1 0.0 0.0 ctg_2 328150582 328230797 (169823688) C rnd-4_family-3742 LINE/Penelope (0) 9199 1 1685343
The text was updated successfully, but these errors were encountered: