Skip to content

Commit

Permalink
Error on duplicates in augur parse
Browse files Browse the repository at this point in the history
Previously, augur parse would let duplicates pass silently with the following behavior:

- sequence output contains all duplicates
- metadata output contains only the last occurrence of duplicates

The behavior is unintentional and undocumented.

This small change will immediately raise an error and exit when a duplicate is detected.
  • Loading branch information
victorlin committed May 10, 2022
1 parent 2342f07 commit 7b215a8
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 1 deletion.
4 changes: 3 additions & 1 deletion augur/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import sys

from .io import open_file, read_sequences, write_sequences
from .utils import get_numerical_date_from_value
from .utils import AugurError, get_numerical_date_from_value

forbidden_characters = str.maketrans(
{' ': None,
Expand Down Expand Up @@ -176,6 +176,8 @@ def run(args):
args.prettify_fields,
args.fix_dates
)
if sequence_record.id in meta_data:
raise AugurError(f"Duplicate found for '{sequence_record.id}'.")
meta_data[sequence_record.id] = sequence_metadata

sequences_written = write_sequences(
Expand Down
18 changes: 18 additions & 0 deletions tests/functional/parse.t
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,22 @@ Parse compressed Zika sequences into sequences and metadata.
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Error on the first duplicate.

$ echo -e '> SEQ1' >> $TMP/data.fasta
$ echo -e 'AAA' >> $TMP/data.fasta
$ echo -e '> SEQ1' >> $TMP/data.fasta
$ echo -e 'AAA' >> $TMP/data.fasta
$ echo -e '> SEQ2' >> $TMP/data.fasta
$ echo -e 'AAA' >> $TMP/data.fasta
$ echo -e '> SEQ2' >> $TMP/data.fasta
$ echo -e 'AAA' >> $TMP/data.fasta
$ ${AUGUR} parse \
> --sequences $TMP/data.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain
ERROR: Duplicate found for 'SEQ1'.
[2]

$ popd > /dev/null

0 comments on commit 7b215a8

Please sign in to comment.