Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect HIV Sequences being represented by PanGraph #62

Closed
TheHarshShow opened this issue Nov 13, 2023 · 6 comments · Fixed by #63
Closed

Incorrect HIV Sequences being represented by PanGraph #62

TheHarshShow opened this issue Nov 13, 2023 · 6 comments · Fixed by #63
Assignees

Comments

@TheHarshShow
Copy link

Hi there,

Our lab has been working with PanGraphs for a while now and we've found them to be very useful. We believe that we've recently noticed some sequences being incorrectly represented by the PanGraph. The simplest dataset that we've found the issue on is a dataset of 2000 HIV sequences. We have attached a Google Drive link with our files. The HIV_2000.fa file stores the true sequences. The hiv_2000_pangraph.fa file consists of the sequences that we believe the PanGraph represents.

We've found seven sequences of the presumed PanGraph output to not match with the raw sequences. These are: B.RU.2004.04RU128005.AY682547, B.US.2000.14302_1.DQ853450, B.US.2000.14294_1.DQ853436, B.US.2000.14303_1.DQ853451, B.US.1998.15388_1.DQ853456, B.US.1998.15385_1.DQ853464 and B.US.1998.15386_1.DQ853460.

One thing to note is that most of these mismatches occur towards the ends of the sequences. The Google Drive link also contains the PanGraph that these sequences were derived from.

Soon, the data for 20000 HIV sequences will also be uploaded where 88/20000 sequences don't match. You can use those for testing.

Drive

Thanks,
Harsh Motwani
Turakhia Lab, UC San Diego

@mmolari mmolari self-assigned this Nov 14, 2023
@mmolari
Copy link
Collaborator

mmolari commented Nov 14, 2023

Dear @TheHarshShow,
very happy to hear that you're finding PanGraph useful! Thank you for the feedback, this is very helpful for us. And thank you for sharing the files. I'll look into this and let you know if I can reproduce and correct the issue.
take care!
Marco

@mmolari
Copy link
Collaborator

mmolari commented Nov 14, 2023

Hi @TheHarshShow,

I am investigating the issue. In the meantime I observed that the issue seems to be linked to the mix of uppercase and lowercase characters in your input sequences. If I run pangraph with standard parameters and the --test flag (to test automatically for correct sequence reconstruction) I can reproduce the error that you were mentioning. When inspecting the merging at which the algorithm fails I saw that the two merged graphs include uppercase and lowercase characters.
I tried re-executing pangraph with the same parameters but with the --upper-case flag, that forces uppercase conversion of all input ucleotide characters, and in this case I do not detect the error.
This seems to be the case on the 2'000 sequences dataset, I haven't tested for the 20'000 sequences case yet. I will investigate further but in the meantime if this issue is blocking your work you could try to add the --upper-case and see if this solves it on your side.

Cheers!
Marco

@TheHarshShow
Copy link
Author

Hi Marco,

Thanks a lot for looking into this issue. Thanks for letting us know about the --upper-case and --test flags. We also faced one sequence mismatch in an E-coli dataset consisting of a 100 sequences. Now, since this dataset might be hard to work with, I just provided the HIV sequences. However, if the problem here relates to lowercase characters, I believe that the E-coli dataset has a different problem since it doesn't have any lowercase characters.

I am also adding the E-coli dataset. We believe, the sequence NZ_CP006834.2 isn't represented correctly. In fact, we have pin pointed that the sequence is missing an insertion of two nucleotides at position 873,274.

Thanks,
Harsh

@mmolari mmolari linked a pull request Nov 20, 2023 that will close this issue
@mmolari
Copy link
Collaborator

mmolari commented Nov 20, 2023

Hi @TheHarshShow,

after your last comment (thanks for that!) I started looking deeper into what was causing the issue in the small virus dataset, assuming that the lowercase nucleotides were not the problem. I found that during block merging there was a particular edge-case of an adjacent insertion and deletion that would cause small inconsistencies in the alignment.

I created a branch with a fix for that problem: [#63]. If possible could you test this version of pangraph on your datasets with the --test flag to see if this solves those issues as well? If so I will merge the PR and release a new version.

Thanks again!

Marco

@mmolari
Copy link
Collaborator

mmolari commented Nov 21, 2023

Hi @TheHarshShow,

another small update: I tested it on the 100 E.coli sequences. I could reproduce the error with the original version of pangraph, and the error was removed with the bug-fix. I will merge the PR and consider this issue closed, but feel free to re-open it if you encounter the error again.

Thanks again for all the feedback!
Marco

@TheHarshShow
Copy link
Author

Hi Marco,

Thanks a lot for looking into and fixing the issue! I think that since it's working for you, I agree that this issue can be closed. Our lab will use the latest version of PanGraph and if something doesn't work, we'll let you know.

Thanks,
Harsh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants