Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split by serotype using NCBI virus_tax_id #20

Merged
merged 1 commit into from
Feb 14, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 6, 2024

Description of proposed changes

Split records by serotype by using the virus-tax-id field in NCBI datasets. Historically, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Screenshot 2024-02-05 at 4 33 31 PM

Changes in this PR include:

  1. Pull virus-tax-id and virus-name from the dengue NCBI dataset
  2. Use the virus-tax-id to infer the ncbi_serotype field of each record
  3. Use the ncbi_serotype field to split sequences and metadata files into serotypes
  4. Update target files in the Snakefile to reflect the new serotype split

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 requested a review from a team February 6, 2024 23:36
@j23414 j23414 mentioned this pull request Feb 7, 2024
7 tasks
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting to see these serotype annotations in the dengue metadata, @j23414, so you can confidently build serotype-specific trees! The ingest workflow ran without issues and the general approach looks good.

I left a few non-blocking comments below to consider. My main requests are minor and involve whitespace (woo!) and removing bash-based conditional logic from a rule.

@@ -11,10 +11,12 @@ if not config:

send_slack_notifications = config.get("send_slack_notifications", False)

serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: this kind of configuration detail might want to live in the config YAML, if you thought you might ever want to allow users to change the list of serotypes to ingest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be worried about mismatch with the serotype names defined in infer-dengue-serotype.py. We should add a wildcard constraint to ensure they match the predefined serotypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connected this to GitHub Issue #27 for additional discourse and subsequent development.

Comment on lines 14 to 15
default="virus_tax_id",
help="Column name containing the NCBI taxon id of the virus serotype, default: virus_tax_id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: If you instantiate the parser object like this:

parser = argparse.ArgumentParser(
    description="Dengue specific processing of metadata, infer serotype from virus_tax_id",
    formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)

Then the ArgumentDefaultsHelpFormatter will automatically print the default values for arguments like this in the help text, so you don't have to copy the default manually. It's not perfect, but it might be good enough for this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Added in 5da76d9

Comment on lines 28 to 35
dengue_types = {
"11053": "denv1",
"11060": "denv2",
"11069": "denv3",
"11070": "denv4",
"31634": "denv2", # Dengue virus 2 Thailand/16681/84
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: This type of map could also live in a CSV/TSV file that gets passed to this script as an argument, allowing users (future us, etc.) to inspect and edit these mappings without digging into the script. For example, we might soon want to include all taxon ids in the taxonomy subtree for Dengue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked this to an Issue #27 for further discussion and later development.

If we did move the map to a CSV/TSV file (or define a field_map in the config file directly), we could also refactor the infer_dengue_serotypes.py into a generalized infer_serotypes.py that could be housed in vendored/ingest.

It's a good suggestion, just thinking through some planning before implementation, possibly as a later PR

Comment on lines 38 to 39
for taxid, serotype in dengue_types.items():
if taxid in ids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised by this loop, at first, since I expected each record to have a single tax id and you could do a dictionary lookup, but is it possible/common for a single record to have multiple tax ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this is left over from VirusLineage having multiple tax ids in a comma separated list from general to specific. On the contrary, virus-tax-id only has one and doing a dictionary lookup may be more streamlined. Thanks for flagging!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using a dictionary lookup directly in 28cfcad Thanks!

Comment on lines 31 to 42
if [[ "{params.ncbi_serotype}" == "all" ]]; then
cp {input.metadata} {output.metadata}
cp {input.sequences} {output.sequences}
else
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.id_field} \
--query "ncbi_serotype=='{params.ncbi_serotype}'" \
--output-sequences {output.sequences} \
--output-metadata {output.metadata}
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One common pattern that I use to avoid using shell if/else statements inside a Snakemake rule is to split out the logical blocks into their own rules and let Snakemake's own filename pattern matching handle the conditional logic more declaratively. For example, this rule can be split into two separate rules like:

rule split_by_ncbi_serotype:
  """
  Split the data by serotype based on the NCBI metadata.
  """
  input:
    metadata = "data/metadata.tsv",
    sequences = "data/sequences.fasta"
  output:
    metadata = "results/metadata_{serotype}.tsv",
    sequences = "results/sequences_{serotype}.fasta"
  params:
    ncbi_serotype = lambda wildcards: wildcards.serotype,
    id_field = config["transform"]["id_field"]
  shell:
    """
    augur filter \
        --sequences {input.sequences} \
        --metadata {input.metadata} \
        --metadata-id-columns {params.id_field} \
        --query "ncbi_serotype=='{params.ncbi_serotype}'" \
        --output-sequences {output.sequences} \
        --output-metadata {output.metadata}
    """

rule use_all_ncbi_serotypes:
  """
  Use data from all serotypes.
  """
  input:
    metadata = "data/metadata.tsv",
    sequences = "data/sequences.fasta"
  output:
    metadata = "results/metadata_all.tsv",
    sequences = "results/sequences_all.fasta"
  shell:
    """
    cp {input.metadata} {output.metadata}
    cp {input.sequences} {output.sequences}
    """

The benefit of this approach is that you keep a one-to-one mapping between rules and shell commands, you get clearer output from Snakemake on the terminal (see example below), and also Snakemake won't try to re-run the use_all_ncbi_serotypes rule when the params in the split_by_ncbi_serotype rule change which is the kind of thing Snakemake does nowadays.

An example of the printed shell commands from a Snakemake dryrun with the current PR state is:

    if [[ "denv4" == "all" ]]; then
      cp data/metadata.tsv results/metadata_denv4.tsv
      cp data/sequences.fasta results/sequences_denv4.fasta
    else
      augur filter         --sequences data/sequences.fasta         --metadata data/metadata.tsv         --metadata-id-columns genbank_accession         --query "ncbi_serotype=='denv4'"         --output-sequences results/sequences_denv4.fasta         --output-metadata results/metadata_denv4.tsv
    fi

vs.

    augur filter     --sequences data/sequences.fasta     --metadata data/metadata.tsv     --metadata-id-columns genbank_accession     --query "ncbi_serotype=='denv4'"     --output-sequences results/sequences_denv4.fasta     --output-metadata results/metadata_denv4.tsv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves an entry in the Nextstrain Snakemake style guide 🌟

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/then statement was dropped via a refactor and simplification in 0537e23 Thanks!

Comment on lines 17 to 19
"""
Split the data by serotype based on the NCBI metadata.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor request: Can you use the 4-space indentation throughout this new rules file to match the indentation used elsewhere in our workflows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, sorry. I need to switch the default spacing in my code editor 😆 Fixed in 678fa5a

Comment on lines 45 to 46
metadata="results/metadata.tsv",
sequences="results/sequences.fasta",
metadata="data/metadata.tsv",
sequences="data/sequences.fasta",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further reflection about my rule-splitting comment above, what if you didn't change the output directory here to data/ and instead wrote out the metadata and sequences to results/metadata_all.tsv and results/sequences_all.fasta? Then you don't have to copy any files and you can drop the if/else logic from your split_serotypes.smk rule above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Fixed and your suggestion streamlines the pipeline 0537e23

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray for NCBI having the virus_tax_id 🎉

+1 for all of @huddlej's comments here and I've added a couple more stylistic suggestions.

ingest/bin/infer-dengue-serotype.py Outdated Show resolved Hide resolved
ingest/Snakefile Outdated Show resolved Hide resolved
ingest/bin/infer-dengue-serotype.py Outdated Show resolved Hide resolved
ingest/bin/infer-dengue-serotype.py Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/split_serotypes.smk Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the split-serotype-by-ncbi-assignment branch 2 times, most recently from 18536b4 to 09170ac Compare February 13, 2024 22:20
1. Pull virus_tax_id and virus_name from the dengue NCBI dataset
2. Use the virus_tax_id to infer the ncbi_serotype of each record
3. Use the ncbi_serotype field to split sequences and metadata files into serotypes
4. Update target files in the Snakefile to reflect the new serotype split

Co-authored-by: Jover Lee <joverlee521@gmail.com>
@j23414 j23414 force-pushed the split-serotype-by-ncbi-assignment branch from 09170ac to 6c206bc Compare February 14, 2024 18:16
@j23414 j23414 merged commit 9f54ad5 into main Feb 14, 2024
32 checks passed
@j23414 j23414 deleted the split-serotype-by-ncbi-assignment branch February 14, 2024 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants