Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

80kbp strecth of genome annotated as single repeat #248

Open
jamesdgalbraith opened this issue Feb 15, 2024 · 3 comments
Open

80kbp strecth of genome annotated as single repeat #248

jamesdgalbraith opened this issue Feb 15, 2024 · 3 comments
Labels

Comments

@jamesdgalbraith
Copy link

I'm in the process of annotating the genome of Icerya purchasi (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/) and have come across a large problem.

The final RepeatMasker annotation out and gff files have an 80kbp stretch of the genome annotated as a contiguous repeat. The original consensus sequence is 9kbp so I'm a little confused as to why the sequence is collapsed into one rather than kept as seperate hits. Viewing the portion of the genome annotated as this repeat it appears a Penelope element has undergone tandem duplication numerous times, resulting in it effectively resembling a satellite.

Is this the desired output? I would have expected several lines, with each corresponding to the individual hits found using blast. I've pasted the example line from the .out below.

73691 0.1 0.0 0.0 ctg_2 328150582 328230797 (169823688) C rnd-4_family-3742 LINE/Penelope (0) 9199 1 1685343

@rmhubley
Copy link
Member

Sorry for the absurdly late reply. You wouldn't still happen to have this library available?

@jamesdgalbraith
Copy link
Author

jamesdgalbraith commented Sep 13, 2024

I've put library and RepeatMasker output up on Google Drive here: https://drive.google.com/file/d/1HvgNx58a_FO4jkDCmh1nZ7sO0NYGf57t/view?usp=sharing

If you can let me know when you've downloaded it that'd be great so I can make more space in the Google Drive that would be great.

For context, the genome being annotated is publicly available here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_952773005.1/ . RepeatMasker was run as part of EarlGrey, with the RepeatModeler library being curated using EarlGrey/TEstrainer's BLAST, Extend, Align, Trim algorithm. The scaffold names differ between the publicly available data and the .out format due to this, with three scaffolds in the NCBI genome (OX731680.1, OX731681.1, and OX731682.1) being renamed ctg_1, ctg_2, and ctg_3 respectively. Let me know if you need any more info.

@rmhubley
Copy link
Member

Thank you for the great info to reproduce this. I have downloaded your file, and will experiment with it. My suspicion is that this is an artifact of not having a curated library where both the mosaic satellite and the Penelope family are represented. Typically RepeatMasker will join significantly overlapping alignments of the same family into one annotation (accounting for minor subfamily differences, or local tandem duplications). This is a rare case though and there must not be a limit set on this joining process. I will see if I can add a fix to this in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants