-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect HIV Sequences being represented by PanGraph #62
Comments
Dear @TheHarshShow, |
Hi @TheHarshShow, I am investigating the issue. In the meantime I observed that the issue seems to be linked to the mix of uppercase and lowercase characters in your input sequences. If I run pangraph with standard parameters and the Cheers! |
Hi Marco, Thanks a lot for looking into this issue. Thanks for letting us know about the I am also adding the E-coli dataset. We believe, the sequence Thanks, |
Hi @TheHarshShow, after your last comment (thanks for that!) I started looking deeper into what was causing the issue in the small virus dataset, assuming that the lowercase nucleotides were not the problem. I found that during block merging there was a particular edge-case of an adjacent insertion and deletion that would cause small inconsistencies in the alignment. I created a branch with a fix for that problem: [#63]. If possible could you test this version of pangraph on your datasets with the Thanks again! Marco |
Hi @TheHarshShow, another small update: I tested it on the 100 E.coli sequences. I could reproduce the error with the original version of pangraph, and the error was removed with the bug-fix. I will merge the PR and consider this issue closed, but feel free to re-open it if you encounter the error again. Thanks again for all the feedback! |
Hi Marco, Thanks a lot for looking into and fixing the issue! I think that since it's working for you, I agree that this issue can be closed. Our lab will use the latest version of PanGraph and if something doesn't work, we'll let you know. Thanks, |
Hi there,
Our lab has been working with PanGraphs for a while now and we've found them to be very useful. We believe that we've recently noticed some sequences being incorrectly represented by the PanGraph. The simplest dataset that we've found the issue on is a dataset of 2000 HIV sequences. We have attached a Google Drive link with our files. The
HIV_2000.fa
file stores the true sequences. Thehiv_2000_pangraph.fa
file consists of the sequences that we believe the PanGraph represents.We've found seven sequences of the presumed PanGraph output to not match with the raw sequences. These are:
B.RU.2004.04RU128005.AY682547
,B.US.2000.14302_1.DQ853450
,B.US.2000.14294_1.DQ853436
,B.US.2000.14303_1.DQ853451
,B.US.1998.15388_1.DQ853456
,B.US.1998.15385_1.DQ853464
andB.US.1998.15386_1.DQ853460
.One thing to note is that most of these mismatches occur towards the ends of the sequences. The Google Drive link also contains the PanGraph that these sequences were derived from.
Soon, the data for 20000 HIV sequences will also be uploaded where 88/20000 sequences don't match. You can use those for testing.
Drive
Thanks,
Harsh Motwani
Turakhia Lab, UC San Diego
The text was updated successfully, but these errors were encountered: