Split by serotype using NCBI virus_tax_id #20

j23414 · 2024-02-06T23:19:22Z

Description of proposed changes

Split records by serotype by using the virus-tax-id field in NCBI datasets. Historically, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Changes in this PR include:

Pull virus-tax-id and virus-name from the dengue NCBI dataset
Use the virus-tax-id to infer the ncbi_serotype field of each record
Use the ncbi_serotype field to split sequences and metadata files into serotypes
Update target files in the Snakefile to reflect the new serotype split

Related issue(s)

Split by dengue serotype (denv1-denv4) #19

Checklist

Checks pass

huddlej

It's exciting to see these serotype annotations in the dengue metadata, @j23414, so you can confidently build serotype-specific trees! The ingest workflow ran without issues and the general approach looks good.

I left a few non-blocking comments below to consider. My main requests are minor and involve whitespace (woo!) and removing bash-based conditional logic from a rule.

huddlej · 2024-02-07T23:52:39Z

ingest/Snakefile

@@ -11,10 +11,12 @@ if not config:

 send_slack_notifications = config.get("send_slack_notifications", False)

+serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']


Non-blocking: this kind of configuration detail might want to live in the config YAML, if you thought you might ever want to allow users to change the list of serotypes to ingest.

I'd be worried about mismatch with the serotype names defined in infer-dengue-serotype.py. We should add a wildcard constraint to ensure they match the predefined serotypes.

Connected this to GitHub Issue #27 for additional discourse and subsequent development.

huddlej · 2024-02-07T23:55:38Z

ingest/bin/infer-dengue-serotype.py

+        default="virus_tax_id",
+        help="Column name containing the NCBI taxon id of the virus serotype, default: virus_tax_id",


Non-blocking: If you instantiate the parser object like this:

parser = argparse.ArgumentParser( description="Dengue specific processing of metadata, infer serotype from virus_tax_id", formatter_class=argparse.ArgumentDefaultsHelpFormatter, )

Then the ArgumentDefaultsHelpFormatter will automatically print the default values for arguments like this in the help text, so you don't have to copy the default manually. It's not perfect, but it might be good enough for this script.

Good to know! Added in 5da76d9

huddlej · 2024-02-07T23:57:03Z

ingest/bin/infer-dengue-serotype.py

+    dengue_types = {
+        "11053": "denv1",
+        "11060": "denv2",        
+        "11069": "denv3",
+        "11070": "denv4",
+        "31634": "denv2", # Dengue virus 2 Thailand/16681/84
+    }


Non-blocking: This type of map could also live in a CSV/TSV file that gets passed to this script as an argument, allowing users (future us, etc.) to inspect and edit these mappings without digging into the script. For example, we might soon want to include all taxon ids in the taxonomy subtree for Dengue.

Linked this to an Issue #27 for further discussion and later development.

If we did move the map to a CSV/TSV file (or define a field_map in the config file directly), we could also refactor the infer_dengue_serotypes.py into a generalized infer_serotypes.py that could be housed in vendored/ingest.

It's a good suggestion, just thinking through some planning before implementation, possibly as a later PR

huddlej · 2024-02-07T23:58:20Z

ingest/bin/infer-dengue-serotype.py

+    for taxid, serotype in dengue_types.items():
+        if taxid in ids:


I was surprised by this loop, at first, since I expected each record to have a single tax id and you could do a dictionary lookup, but is it possible/common for a single record to have multiple tax ids?

Oops, this is left over from VirusLineage having multiple tax ids in a comma separated list from general to specific. On the contrary, virus-tax-id only has one and doing a dictionary lookup may be more streamlined. Thanks for flagging!

Switched to using a dictionary lookup directly in 28cfcad Thanks!

huddlej · 2024-02-08T00:05:42Z

ingest/workflow/snakemake_rules/split_serotypes.smk

+    if [[ "{params.ncbi_serotype}" == "all" ]]; then
+      cp {input.metadata} {output.metadata}
+      cp {input.sequences} {output.sequences}
+    else
+      augur filter \
+        --sequences {input.sequences} \
+        --metadata {input.metadata} \
+        --metadata-id-columns {params.id_field} \
+        --query "ncbi_serotype=='{params.ncbi_serotype}'" \
+        --output-sequences {output.sequences} \
+        --output-metadata {output.metadata}
+    fi


One common pattern that I use to avoid using shell if/else statements inside a Snakemake rule is to split out the logical blocks into their own rules and let Snakemake's own filename pattern matching handle the conditional logic more declaratively. For example, this rule can be split into two separate rules like:

rule split_by_ncbi_serotype: """ Split the data by serotype based on the NCBI metadata. """ input: metadata = "data/metadata.tsv", sequences = "data/sequences.fasta" output: metadata = "results/metadata_{serotype}.tsv", sequences = "results/sequences_{serotype}.fasta" params: ncbi_serotype = lambda wildcards: wildcards.serotype, id_field = config["transform"]["id_field"] shell: """ augur filter \ --sequences {input.sequences} \ --metadata {input.metadata} \ --metadata-id-columns {params.id_field} \ --query "ncbi_serotype=='{params.ncbi_serotype}'" \ --output-sequences {output.sequences} \ --output-metadata {output.metadata} """ rule use_all_ncbi_serotypes: """ Use data from all serotypes. """ input: metadata = "data/metadata.tsv", sequences = "data/sequences.fasta" output: metadata = "results/metadata_all.tsv", sequences = "results/sequences_all.fasta" shell: """ cp {input.metadata} {output.metadata} cp {input.sequences} {output.sequences} """

The benefit of this approach is that you keep a one-to-one mapping between rules and shell commands, you get clearer output from Snakemake on the terminal (see example below), and also Snakemake won't try to re-run the use_all_ncbi_serotypes rule when the params in the split_by_ncbi_serotype rule change which is the kind of thing Snakemake does nowadays.

An example of the printed shell commands from a Snakemake dryrun with the current PR state is:

if [[ "denv4" == "all" ]]; then cp data/metadata.tsv results/metadata_denv4.tsv cp data/sequences.fasta results/sequences_denv4.fasta else augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --metadata-id-columns genbank_accession --query "ncbi_serotype=='denv4'" --output-sequences results/sequences_denv4.fasta --output-metadata results/metadata_denv4.tsv fi

vs.

augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --metadata-id-columns genbank_accession --query "ncbi_serotype=='denv4'" --output-sequences results/sequences_denv4.fasta --output-metadata results/metadata_denv4.tsv

This deserves an entry in the Nextstrain Snakemake style guide 🌟

The if/then statement was dropped via a refactor and simplification in 0537e23 Thanks!

huddlej · 2024-02-08T00:09:56Z

ingest/workflow/snakemake_rules/split_serotypes.smk

+  """
+  Split the data by serotype based on the NCBI metadata.
+  """


Minor request: Can you use the 4-space indentation throughout this new rules file to match the indentation used elsewhere in our workflows?

Gah, sorry. I need to switch the default spacing in my code editor 😆 Fixed in 678fa5a

huddlej · 2024-02-08T00:38:05Z

ingest/workflow/snakemake_rules/transform.smk

-        metadata="results/metadata.tsv",
-        sequences="results/sequences.fasta",
+        metadata="data/metadata.tsv",
+        sequences="data/sequences.fasta",


On further reflection about my rule-splitting comment above, what if you didn't change the output directory here to data/ and instead wrote out the metadata and sequences to results/metadata_all.tsv and results/sequences_all.fasta? Then you don't have to copy any files and you can drop the if/else logic from your split_serotypes.smk rule above.

I agree! Fixed and your suggestion streamlines the pipeline 0537e23

joverlee521

Hooray for NCBI having the virus_tax_id 🎉

+1 for all of @huddlej's comments here and I've added a couple more stylistic suggestions.

ingest/bin/infer-dengue-serotype.py

ingest/Snakefile

ingest/bin/infer-dengue-serotype.py

ingest/workflow/snakemake_rules/split_serotypes.smk

ingest/bin/infer-dengue-serotype.py

1. Pull virus_tax_id and virus_name from the dengue NCBI dataset 2. Use the virus_tax_id to infer the ncbi_serotype of each record 3. Use the ncbi_serotype field to split sequences and metadata files into serotypes 4. Update target files in the Snakefile to reflect the new serotype split Co-authored-by: Jover Lee <joverlee521@gmail.com>

j23414 requested a review from a team February 6, 2024 23:36

j23414 mentioned this pull request Feb 7, 2024

Harmonize with pathogen repo guide #22

Closed

7 tasks

huddlej reviewed Feb 8, 2024

View reviewed changes

joverlee521 approved these changes Feb 8, 2024

View reviewed changes

j23414 force-pushed the split-serotype-by-ncbi-assignment branch from 4f2591d to 3ddefcf Compare February 12, 2024 14:40

j23414 mentioned this pull request Feb 12, 2024

ENH: Generalize taxon id to serotype map definitions to a configuration file #27

Open

joverlee521 reviewed Feb 12, 2024

View reviewed changes

ingest/bin/infer-dengue-serotype.py Outdated Show resolved Hide resolved

joverlee521 reviewed Feb 12, 2024

View reviewed changes

ingest/bin/infer-dengue-serotype.py Outdated Show resolved Hide resolved

j23414 force-pushed the split-serotype-by-ncbi-assignment branch 2 times, most recently from 18536b4 to 09170ac Compare February 13, 2024 22:20

j23414 force-pushed the split-serotype-by-ncbi-assignment branch from 09170ac to 6c206bc Compare February 14, 2024 18:16

j23414 merged commit 9f54ad5 into main Feb 14, 2024
32 checks passed

j23414 deleted the split-serotype-by-ncbi-assignment branch February 14, 2024 18:25

This was referenced Feb 14, 2024

Split by dengue serotype (denv1-denv4) #19

Closed

Nextclade assignment #16

Merged

j23414 mentioned this pull request Jun 7, 2024

Add Dengue virus DENVx genotypes dataset nextstrain/nextclade_data#203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split by serotype using NCBI virus_tax_id #20

Split by serotype using NCBI virus_tax_id #20

j23414 commented Feb 6, 2024

huddlej left a comment

huddlej Feb 7, 2024

joverlee521 Feb 8, 2024

j23414 Feb 12, 2024

huddlej Feb 7, 2024

j23414 Feb 12, 2024

huddlej Feb 7, 2024

j23414 Feb 12, 2024

huddlej Feb 7, 2024

j23414 Feb 9, 2024

j23414 Feb 12, 2024

huddlej Feb 8, 2024

joverlee521 Feb 8, 2024

j23414 Feb 12, 2024

huddlej Feb 8, 2024

j23414 Feb 12, 2024

huddlej Feb 8, 2024

j23414 Feb 12, 2024

joverlee521 left a comment

		@@ -11,10 +11,12 @@ if not config:

		send_slack_notifications = config.get("send_slack_notifications", False)

		serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']

		default="virus_tax_id",
		help="Column name containing the NCBI taxon id of the virus serotype, default: virus_tax_id",

		for taxid, serotype in dengue_types.items():
		if taxid in ids:

Split by serotype using NCBI virus_tax_id #20

Split by serotype using NCBI virus_tax_id #20

Conversation

j23414 commented Feb 6, 2024

Description of proposed changes

Changes in this PR include:

Related issue(s)

Checklist

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment