Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nextclade tree #28

Merged
merged 8 commits into from
May 9, 2024
Merged

Make nextclade tree #28

merged 8 commits into from
May 9, 2024

Conversation

kimandrews
Copy link
Contributor

Description of proposed changes

This workflow creates a tree that can be used as part of a Nextclade dataset to assign genotypes to measles samples based on criteria outlined by the WHO

The tree created by this workflow includes N450 sequences for 28 reference strains defined by the WHO, along with other representative strains for each genotype that are obtained from the ingest output, using genotype assignments reported on NCBI. Genotypes are then assigned to each node with augur clades, using clade-defining mutations in defaults/clades.tsv.

Details regarding the workflow are described in a README.md

Related issue(s)

Checklist

  • Checks pass

Creates a nextclade directory and copies relevant files from phylogenetic
directory to nextclade directory.
Subsequent commits will change these for the Nextclade Dataset tree workflow.
@kimandrews kimandrews requested a review from a team April 30, 2024 21:57
@joverlee521 joverlee521 self-requested a review May 1, 2024 21:01
@trvrb
Copy link
Member

trvrb commented May 1, 2024

This is wonderful @kimandrews! Cleanly constructed in terms of the workflow and the result looks great in terms of consistency of MeV Genotypes and amount of data used. I think this will work well as a Nextclade reference tree. I just made a few changes to the Auspice config for the Nextclade tree in the above commit.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this Kim! The overall workflow looks good to me, I only left some minor comments.

A couple high level thoughts (not blocking this PR):

  1. Will you be updating the workflow to create the full Nextclade dataset in the future? E.g. mpox's nextclade workflow packages the whole dataset.
  2. Since there are a lot of shared files between phylogenetic and nextclade workflows, we could consider using a shared directory to hold these files, so we don't need to duplicate the files across two directories.

nextclade/rules/prepare_sequences.smk Outdated Show resolved Hide resolved
nextclade/rules/prepare_sequences.smk Outdated Show resolved Hide resolved
nextclade/rules/prepare_sequences.smk Show resolved Hide resolved
nextclade/rules/annotate_phylogeny.smk Show resolved Hide resolved
Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extreme nitpick...

"""

rule export:
"""Exporting data files for for auspice"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Exporting data files for for auspice"""
"""Exporting data files for auspice"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in df7c7f0

@trvrb
Copy link
Member

trvrb commented May 1, 2024

Will you be updating the workflow to create the full Nextclade dataset in the future? E.g. mpox's nextclade workflow packages the whole dataset.

Packaging up into a Nextclade dataset will be super useful. Would be nice to have this on the clades.nextstrain.org list.

Since there are a lot of shared files between phylogenetic and nextclade workflows, we could consider using a shared directory to hold these files, so we don't need to duplicate the files across two directories.

I personally don't mind the code duplication in this particular case. Easier for me to think about these as separate things. But it's not a strong preference.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @kim, only a few minor requests from me. I didn't run this locally, but tested out the tree that it produces by using the use-nextclade-dataset branch which can also be loaded into the nextclade web UI and it's looking great.

nextclade/defaults/config.yaml Show resolved Hide resolved
nextclade/defaults/colors.tsv Show resolved Hide resolved
nextclade/rules/prepare_sequences.smk Show resolved Hide resolved
nextclade/rules/annotate_phylogeny.smk Show resolved Hide resolved
kimandrews and others added 6 commits May 6, 2024 17:05
* Add rule to align sequences to N450 reference using Nextclade,
following edbefd5
* Filter to produce a sample set appropriate for a Nextclade Dataset tree
Remove construct_phylogeny.smk wildcards that were needed for the Phylogenetic workflow but are not needed for the Nextclade Dataset tree workflow.
* Use `augur ancestral` option `--root-sequence` to add mutations from the reference sequence to the root of the tree, which is required for Nextclade Dataset trees, as described here: https://github.com/nextstrain/nextclade_data/blob/577696e4ee0f0dff925d3911d27351a7395fab3d/docs/dataset-creation-guide.md
* Use `augur clades` to assign genotypes to nodes
* Remove annotate_phylogeny.smk wildcards that were needed for the Phylogenetic workflow but are not needed for the Nextclade Dataset tree workflow.
* Include `augur clades` genotype assignments in exported tree and in coloring options
* Remove export.smk wildcards that were needed for the Phylogenetic workflow but are not needed for the Nextclade Dataset tree workflow.
This makes a few various updates to the Auspice config for the Nextclade tree including:
1. Making "clade_membership" default coloring. For this tree, we really care about MeV Genotype more than other colorings.
2. Call this "MeV Genotype (Nextstrain)". I think important to specify MeV Genotype to distinguish from the more common SNP-level genotype.
3. Clarify with "MeV Genotype (GenBank metadata)". The previous "(NCBI)" seemed like this could be a MeV Genotype call from an NCBI tool.
4. Order as Region then Country to follow ncov
5. Include clade_membership as filter option
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good to me 👍

@trvrb
Copy link
Member

trvrb commented May 9, 2024

Looks like this may be ready to merge? @jameshadfield: Could you confirm?

@kimandrews
Copy link
Contributor Author

Looks like this may be ready to merge? @jameshadfield: Could you confirm?

I talked with @jameshadfield yesterday and we are good to merge.

@kimandrews kimandrews merged commit 21d5a9c into main May 9, 2024
32 checks passed
@kimandrews kimandrews deleted the make-nextclade-tree branch May 9, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants