Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include vaccine strains #23

Merged
merged 4 commits into from
Apr 16, 2024
Merged

Include vaccine strains #23

merged 4 commits into from
Apr 16, 2024

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Apr 12, 2024

Full genomes for Edmonston-related vaccine strains were present in the ingest dataset, but weren't making it to the final genome or N450 results due to getting filtered out from lack of date metadata. This PR surfaces these vaccine strains by:

  1. Including annotations for wild-type Edmonston as well as 5 derived vaccine strains.
  2. Swapping to country year group-by so that samples with just year metadata make it into the final build.
  3. Including strain coloring to provide proper descriptions of these 6 samples.

I've just the entire pipeline locally and everything works as expected.

Results from running this branch are viewable at:

This commit adds strain and date annotations for 5 vaccine strains that all descend from Edmonston isolate collected in 1954. The Parks et al. paper describes these well. I purposely chose not to include location for these as I wanted the gray dot in the Auspice tree to make these look a bit different than wild-type isolates

This also includes strain, date and location for Edmonston WT strain.
There's not enough genome data to warrant inclusion of month in the subsampling grouping. Also, by including month the subsampling was dropping a number of older samples that were only annotated by year. I noticed this in wanting to include the 1954 Edmonston related vaccine strains and they were getting filtered out with the previous "country year month" group-by.
Strain name is often not included in GenBank or is not very helpful. But still good to surface as metadata for modal. I particularly wanted this for the 1954 Edmonston-related vaccine strains. People know these by their strain names, certainly not their GenBank accessions.
This swap to using --metadata-columns in augur export to surface strain, division and location.
Comment on lines +26 to +27
export:
metadata_columns: "strain division location"
Copy link
Contributor

@kimandrews kimandrews Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to surface the strain names now. Eventually we should be able to pull more strain names from GenBank, after NCBI Datasets starts pulling the "strain" field, which is where most measles strain names are reported on GenBank (currently we are getting strain names from Genbank's "isolate" field, which NCBI Datasets does pull). NCBI says this is planned for sometime this year. This would also enable us to recover dates for some samples that have empty dates, since dates are part of the strain name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thanks for the context.

@trvrb trvrb merged commit 6bed278 into main Apr 16, 2024
32 checks passed
@trvrb trvrb deleted the vaccine-strains branch April 16, 2024 23:12
kimandrews added a commit that referenced this pull request Apr 19, 2024
Explicitly add vaccine strains to genome tree and N450 tree, following up on #23

These strains currently end up in the trees due to our subsampling parameters and lack of other sequences from 1954, but this commit explicitly adds them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants