Skip to content

Commit

Permalink
ingest: Merge Nextclade metadata with augur merge
Browse files Browse the repository at this point in the history
This construction reads a bit clearer and cleaner.  It's also a good
example of how to use `augur merge`.

The limitation on non-seekable streams means the rule now uses
additional transient disk space, but it typically shouldn't be an issue.
The way Augur's slow start up time impacts `augur merge` also
contributes to a longer rule execution time, but it should be negligible
in the context of the larger workflow and presumably we'll fix the slow
start up eventually.¹

The output is semantically identical but has some syntactic changes re:
quoting.  It's worth noting that the pre-existing TSV format was _not_
IANA TSV, despite it (still) being treated as such in a few places, but
was (and remains) a CSV-like TSV with some quoted fields (and some
mangled quotes², e.g. the "institution" column for accession KJ556895).
We really need to sort out our TSV formats³, but that's for a larger
project.

¹ <nextstrain/augur#1628>
² <nextstrain/augur#1565>
³ <nextstrain/augur#1566>
  • Loading branch information
tsibley committed Sep 10, 2024
1 parent faebd64 commit 4212c40
Showing 1 changed file with 13 additions and 10 deletions.
23 changes: 13 additions & 10 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ rule join_metadata_and_nextclade:
metadata="data/subset_metadata.tsv",
output:
metadata="results/metadata.tsv",
nextclade_metadata=temp("results/nextclade_metadata.csv"),
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
Expand All @@ -75,13 +76,15 @@ rule join_metadata_and_nextclade:
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| tsv-select --header --fields {params.nextclade_fields:q} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
--data-fields {params.metadata_id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude {params.nextclade_id_field} \
> {output.metadata}
"""
> {output.nextclade_metadata:q}
augur merge \
--metadata \
metadata={input.metadata:q} \
nextclade={output.nextclade_metadata:q} \
--metadata-id-columns \
metadata={params.metadata_id_field:q} \
nextclade={params.nextclade_id_field:q} \
--output-metadata {output.metadata:q} \
--no-source-columns
"""

0 comments on commit 4212c40

Please sign in to comment.