80kbp strecth of genome annotated as single repeat #248

jamesdgalbraith · 2024-02-15T13:58:42Z

I'm in the process of annotating the genome of Icerya purchasi (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/) and have come across a large problem.

The final RepeatMasker annotation out and gff files have an 80kbp stretch of the genome annotated as a contiguous repeat. The original consensus sequence is 9kbp so I'm a little confused as to why the sequence is collapsed into one rather than kept as seperate hits. Viewing the portion of the genome annotated as this repeat it appears a Penelope element has undergone tandem duplication numerous times, resulting in it effectively resembling a satellite.

Is this the desired output? I would have expected several lines, with each corresponding to the individual hits found using blast. I've pasted the example line from the .out below.

73691 0.1 0.0 0.0 ctg_2 328150582 328230797 (169823688) C rnd-4_family-3742 LINE/Penelope (0) 9199 1 1685343

The text was updated successfully, but these errors were encountered:

rmhubley · 2024-09-11T20:26:22Z

Sorry for the absurdly late reply. You wouldn't still happen to have this library available?

jamesdgalbraith · 2024-09-13T10:58:00Z

I've put library and RepeatMasker output up on Google Drive here: https://drive.google.com/file/d/1HvgNx58a_FO4jkDCmh1nZ7sO0NYGf57t/view?usp=sharing

If you can let me know when you've downloaded it that'd be great so I can make more space in the Google Drive that would be great.

For context, the genome being annotated is publicly available here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/ . RepeatMasker was run as part of EarlGrey, with the RepeatModeler library being curated using EarlGrey/TEstrainer's BLAST, Extend, Align, Trim algorithm. The scaffold names differ between the publicly available data and the .out format due to this, with three scaffolds in the NCBI genome (OX731680.1, OX731681.1, and OX731682.1) being renamed ctg_1, ctg_2, and ctg_3 respectively. Let me know if you need any more info.

rmhubley · 2024-09-17T18:02:15Z

Thank you for the great info to reproduce this. I have downloaded your file, and will experiment with it. My suspicion is that this is an artifact of not having a curated library where both the mosaic satellite and the Penelope family are represented. Typically RepeatMasker will join significantly overlapping alignments of the same family into one annotation (accounting for minor subfamily differences, or local tandem duplications). This is a rare case though and there must not be a limit set on this joining process. I will see if I can add a fix to this in the next release.

jamesdgalbraith added the question label Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

80kbp strecth of genome annotated as single repeat #248

80kbp strecth of genome annotated as single repeat #248

jamesdgalbraith commented Feb 15, 2024

rmhubley commented Sep 11, 2024

jamesdgalbraith commented Sep 13, 2024 •

edited

Loading

rmhubley commented Sep 17, 2024

80kbp strecth of genome annotated as single repeat #248

80kbp strecth of genome annotated as single repeat #248

Comments

jamesdgalbraith commented Feb 15, 2024

rmhubley commented Sep 11, 2024

jamesdgalbraith commented Sep 13, 2024 • edited Loading

rmhubley commented Sep 17, 2024

jamesdgalbraith commented Sep 13, 2024 •

edited

Loading