Subcommand for data de-duplication #919

victorlin · 2022-05-10T04:51:33Z

[note] This was proposed in #860, but creating this issue for discussion tailored to the de-duplication task.

Context

Currently, the ncov workflow has scripts for removing duplicates from

metadata: sanitize_metadata.get_database_ids_by_strain() + sanitize_metadata.filter_duplicates()
sequences: sanitize_sequences.drop_duplicate_sequences()

The goal is to add this functionality into Augur for more general use, especially to help users that run into situations such as #616 and #915.

Things to address

Name/placement of command

Both were proposed in #860 but we need to choose here:

A sub-command of the clean/transform/sanitize sub-command (name also undetermined):

augur transform deduplicate
--sequences sequences.fasta \
--output-sequences sequences-unique.fasta \
...

A direct sub-command of augur

augur deduplicate \
--sequences sequences.fasta \
--output-sequences sequences-unique.fasta \
...

Parameters

Here are some parameters and I/O combinations that may be appropriate for this new command.

Metadata I/O:
```
augur deduplicate \
--metadata metadata.tsv \
--output-metadata metadata-unique.tsv \
...
```
Currently, in sanitize_metadata.get_database_ids_by_strain(), duplicates are detected by ID column (e.g. strain) and the set of duplicate IDs is written to a file.
Then, in sanitize_metadata.filter_duplicates(), duplicates are resolved by sorting records by strain ID and by database accession IDs (e.g., Accession ID from GISAID, gisaid_epi_isl, or genbank_accession) in descending order and selecting the record with the most recent accession ID.
Sequences I/O:
```
augur deduplicate \
--sequences sequences.fasta \
--output-sequences sequences-unique.fasta \
...
```
Currently, in sanitize_sequences.drop_duplicate_sequences(), duplicates are detected by sequence name (usually the strain ID, matching strain in the metadata) and only the first occurrence is kept. This functional optionally raises a DuplicateSequenceError, if the user has requested it and if the function finds different sequences for the same sequence name. The function does not consider multiple identical sequences for the same name to be an error.
Metadata + sequences I/O:
```
augur deduplicate \
--metadata metadata.tsv \
--sequences sequences.fasta \
--output-metadata metadata-unique.tsv \
--output-sequences sequences-unique.fasta \
...
```
If the metadata + sequences are a 1:1 mapping, then we may be able to use the metadata way of removing duplicates and apply to the sequences counterpart. However, when there is not a 1:1 mapping between metadata and sequences, this implementation of a deduplicate command could effectively turn into a "filter" command. Common issues include presence of metadata but not sequences, absence of metadata but presence of sequences, or any number of duplicates for either data type.
Current parameters from ncov/sanitize_metadata.py:
- --metadata-id-columns: A list of potential id columns for strain names in the metadata.
- --database-id-columns: A list of potential database id columns whose values can be used to deduplicate records with the same strain name.
- --metadata-chunk-size: Number of records to read into memory at once from the metadata.
- --error-on-duplicate-strains: Throw an error when duplicate records are detected.
Current parameters from ncov/sanitize_sequences.py:
- --error-on-duplicate-strains: Throw an error when duplicate records are detected.

De-duplication method (similar to keep parameter of pandas.DataFrame.drop_duplicates()):

augur deduplicate \
--metadata metadata.tsv \
--sequences sequences.fasta \
--output-metadata metadata-unique.tsv \
--output-sequences sequences-unique.fasta \
--keep first \
...

In the ncov workflow, this is effectively last for metadata and first for sequences.

Write duplicates to a custom file instead of hardcoded {metadata}.duplicates.txt:

augur deduplicate \
--metadata metadata.tsv \
--output-metadata metadata-unique.tsv \
--output-duplicate-ids duplicates.txt \
...

Anything else?

The text was updated successfully, but these errors were encountered:

joverlee521 · 2022-05-10T18:04:27Z

One question I have is do we want the deduplicate to be more general and not specifically tied to "metadata".
For example, it would be nice to be able to use augur to deduplicate titer data. This would require using a hash of multiple columns as the "id", similar to how fauna creates an index before uploading titer data.

huddlej · 2022-05-10T19:53:13Z

The point about a general deduplicate command might help us scope this issue. Deduplicating sequences (FASTA files) is a different process from deduplicating data frames. Sequences often don't include metadata we need to optimally resolve duplicates, so we have to rely on methods like "keep first". When additional metadata does exist with the sequence content, it is stored in the sequence name as delimited values (e.g., >strain_A|2020-01-01|2020-01-05) where the meaning and names of the metadata is somehow known externally (e.g., "strain", "collection_date", and "submission_date", in the preceding example data).

For data frames (usually CSV/TSV files), other tools already exist that could perform the deduplication process accurately and quickly. For example, one can apply csvtk and/or tsv-utils to deduplicate example SARS-CoV-2 metadata from the ncov workflow:

cd ncov/

# csvtk to sort by name and accession and get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | csvtk -t uniq -f "Virus name"

# csvtk to sort by name and accession, tsv-utils to get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | tsv-uniq -H -f "Virus\ name"

# UNIX sort with tsv-utils's keep-header to sort by name and accession and get first record by name.
keep-header tests/unsanitized_metadata.tsv -- sort -k 1,1 -k 4,4r | tsv-uniq -H -f "Virus\ name"

Along this same line of using existing tools to accomplish the deduplication task, we could use seqkit to convert FASTA files to TSVs, apply the same "unique" logic from tsv-uniq to the corresponding data frames, and then convert TSVs back to FASTA files like so:

seqkit fx2tab -Q data/example_sequences.fasta | tsv-uniq -f 1 | seqkit tab2fx

If detection of duplicates if our goal (as opposed to resolution), seqkit also includes options to generate hashes of sequences. The following command generates MD5 hashes of all sequences in the input. We could then pass the names and hashes to a command like tsv-uniq or tsv-summarize to find any duplicates by name and/or sequence content.

# Get sequence names and hashes in TSV format.
seqkit fx2tab -H -s -n data/example_sequences.fasta > hashes.tsv

# Get duplicate sequences by name.
tsv-uniq -H -r -f "#name" hashes.tsv > duplicate_sequences_by_name.tsv

# Get sequence names with duplicate different sequences.
 tsv-summarize -H -g "#name" --unique-count "seq.hash" hashes.tsv | tsv-filter -H --str-gt seq.hash_unique_count:1 > duplicate_distinct_sequences.tsv

You can imagine skipping the hash step completely and just applying tsv-summarize on the name and sequence directly.

I just tested out a couple of the above commands on 119,053 sequences from Africa.

# Find duplicate distinct sequences for the same sequence name.
# This ran in 14 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -H -n -s | tsv-summarize -H -g "#name" --unique-count "seq.hash" | tsv-filter -H --str-gt seq.hash_unique_count:1
#name	seq.hash_unique_count
hCoV-19/Rwanda/U66872-RD/2020|2020-07|2020-11-05	2

# Find duplicate sequences by name (N=48).
# This ran in 10 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -n | tsv-uniq -r -f 1 | wc -l
48

# Get distinct sequences by name (N=119,004).
# This ran for 1 minute.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -Q | tsv-uniq -f 1 | seqkit tab2fx | xz -2 -c > unique.fasta.xz

These sequences from Africa are a great example of how metadata can appear in the "sequence name", depending on how it is downloaded. For example, I extracted the GISAID sequence named hCoV-19/South Africa/Tygerberg_1233/2020|2020-07-31|2021-07-09 from the full GISAID sequence download file. This sequence name contains the "strain", "submission_date", and "collection_date". We could use this information in the commands above to further resolve duplicates.

# Resolve duplicates by submission date.
# This ran in 1 minute, 46 seconds.
# Sorting like this won't be efficient for larger files.
xz -c -d data/sequences_africa.fasta.xz \
  | seqkit fx2tab -H -Q \
  | csvtk -C '$' -t sep -f '#name' -n strain,submission_date,collection_date -s "|" \
  | csvtk -C '$' -t sort -k strain -k submission_date:r \
  | tsv-uniq -H -f strain \
  | seqkit tab2fx \
  | xz -2 -c > unique.fasta.xz

The sanitize sequences script currently ignores this kind of metadata by throwing away whatever follows the first |. It could be more intelligent about this logic.

We might spend a little time playing around with these approaches that rely on existing tools before we try to implement a command in Augur. We could find that these external tools are sufficient or that a light-weight wrapper command around these would be enough for our needs.

victorlin · 2022-05-10T20:24:48Z

Thanks for compiling that @huddlej! Really useful to know that existing tools are sufficient for these use cases.

I think this sub-command would be more beneficial if we use databases over TSV/FASTA files down the road (#866).

In the short term, how about:

Make sure Augur is explicit and clear in erroring on duplicates (fix #616 and #915 with proper error messages).
(maybe) Under Augur docs FAQ, add a page on using external tools to resolve duplicates. Though, your issue comment above is search-able via GitHub so that might be good enough.

huddlej · 2022-05-10T21:14:20Z

We should definitely implement the first short-term solution. I'm not sure about the best place to have the discussion about deduplicating data. A tutorial (or how-to guide, depending on the audience) might be easier for users to find and also allow us to contextualize deduplication as part of "data preparation". As we start to support users running workflows for seasonal flu and other pathogens, some of the content in the ncov tutorial will need to bubble up to the top-level docs.

I didn't mean to suggest that the existing tools are completely sufficient for our use cases, but that we should try them out and see where they fail to meet our needs. It's possible that an Augur subcommand could handle some of the Nextstrain-specific details and validation that were implicit in my examples above. For example, the subcommand might check the input metadata for the presence of valid accession id columns and strain id columns before piping the input to csvtk, etc. Similarly for the sequences example, we might want a subcommand that accepts a list of expected metadata fields to find in the sequence name (like augur parse does) so those fields can be passed through to an underlying seqkit command. This kind of wrapper command would allow us to change the implementation details in the future, for whichever data types start to live in a database.

huddlej · 2022-07-07T04:33:32Z

While working on the seasonal flu builds with data from FluDB, I noticed that seqkit has an rmdup command that can remove duplicate sequences by name or sequence content and optionally log the duplicates to a file. Adding this command right after running augur parse (to get a FASTA file with strain names only in the header) seems to be enough for the flu workflow:

seqkit rmdup \
    --by-name \
    -D data/h3n2/ha.duplicates.tsv \
    data/h3n2/ha.fasta > data/h3n2/ha.deduplicated.fasta

victorlin · 2024-08-27T19:12:02Z

Update: augur merge is a newly introduced subcommand (#1563) which handles metadata de-duplication across multiple inputs. It will still error out on duplicates within a single input. The plan is to extend this to sequences as well: #1579

victorlin added the enhancement New feature or request label May 10, 2022

victorlin self-assigned this May 10, 2022

victorlin mentioned this issue May 10, 2022

augur filter not identifying duplicates, causing pandas error #915

Closed

joverlee521 mentioned this issue Jun 7, 2022

ingest: deduplicate sequences using strain names nextstrain/mpox#33

Open

joverlee521 mentioned this issue Jul 5, 2022

Add augur curate subcommands #860

Closed

7 tasks

victorlin mentioned this issue Aug 13, 2024

read_metadata: Allow graceful handling of duplicate strain names #810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subcommand for data de-duplication #919

Subcommand for data de-duplication #919

victorlin commented May 10, 2022 •

edited by huddlej

Loading

joverlee521 commented May 10, 2022

huddlej commented May 10, 2022

victorlin commented May 10, 2022

huddlej commented May 10, 2022

huddlej commented Jul 7, 2022 •

edited

Loading

victorlin commented Aug 27, 2024

Subcommand for data de-duplication #919

Subcommand for data de-duplication #919

Comments

victorlin commented May 10, 2022 • edited by huddlej Loading

Context

Things to address

Name/placement of command

Parameters

joverlee521 commented May 10, 2022

huddlej commented May 10, 2022

victorlin commented May 10, 2022

huddlej commented May 10, 2022

huddlej commented Jul 7, 2022 • edited Loading

victorlin commented Aug 27, 2024

victorlin commented May 10, 2022 •

edited by huddlej

Loading

huddlej commented Jul 7, 2022 •

edited

Loading