Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subcommand for data de-duplication #919

Open
victorlin opened this issue May 10, 2022 · 6 comments
Open

Subcommand for data de-duplication #919

victorlin opened this issue May 10, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented May 10, 2022

[note] This was proposed in #860, but creating this issue for discussion tailored to the de-duplication task.

Context

Currently, the ncov workflow has scripts for removing duplicates from

  1. metadata: sanitize_metadata.get_database_ids_by_strain() + sanitize_metadata.filter_duplicates()
  2. sequences: sanitize_sequences.drop_duplicate_sequences()

The goal is to add this functionality into Augur for more general use, especially to help users that run into situations such as #616 and #915.

Things to address

Name/placement of command

Both were proposed in #860 but we need to choose here:

  1. A sub-command of the clean/transform/sanitize sub-command (name also undetermined):

    augur transform deduplicate
    --sequences sequences.fasta \
    --output-sequences sequences-unique.fasta \
    ...
  2. A direct sub-command of augur

    augur deduplicate \
    --sequences sequences.fasta \
    --output-sequences sequences-unique.fasta \
    ...

Parameters

Here are some parameters and I/O combinations that may be appropriate for this new command.

  1. Metadata I/O:

    augur deduplicate \
    --metadata metadata.tsv \
    --output-metadata metadata-unique.tsv \
    ...

    Currently, in sanitize_metadata.get_database_ids_by_strain(), duplicates are detected by ID column (e.g. strain) and the set of duplicate IDs is written to a file.
    Then, in sanitize_metadata.filter_duplicates(), duplicates are resolved by sorting records by strain ID and by database accession IDs (e.g., Accession ID from GISAID, gisaid_epi_isl, or genbank_accession) in descending order and selecting the record with the most recent accession ID.

  2. Sequences I/O:

    augur deduplicate \
    --sequences sequences.fasta \
    --output-sequences sequences-unique.fasta \
    ...

    Currently, in sanitize_sequences.drop_duplicate_sequences(), duplicates are detected by sequence name (usually the strain ID, matching strain in the metadata) and only the first occurrence is kept. This functional optionally raises a DuplicateSequenceError, if the user has requested it and if the function finds different sequences for the same sequence name. The function does not consider multiple identical sequences for the same name to be an error.

  3. Metadata + sequences I/O:

    augur deduplicate \
    --metadata metadata.tsv \
    --sequences sequences.fasta \
    --output-metadata metadata-unique.tsv \
    --output-sequences sequences-unique.fasta \
    ...

    If the metadata + sequences are a 1:1 mapping, then we may be able to use the metadata way of removing duplicates and apply to the sequences counterpart. However, when there is not a 1:1 mapping between metadata and sequences, this implementation of a deduplicate command could effectively turn into a "filter" command. Common issues include presence of metadata but not sequences, absence of metadata but presence of sequences, or any number of duplicates for either data type.

  4. Current parameters from ncov/sanitize_metadata.py:

    • --metadata-id-columns: A list of potential id columns for strain names in the metadata.
    • --database-id-columns: A list of potential database id columns whose values can be used to deduplicate records with the same strain name.
    • --metadata-chunk-size: Number of records to read into memory at once from the metadata.
    • --error-on-duplicate-strains: Throw an error when duplicate records are detected.
  5. Current parameters from ncov/sanitize_sequences.py:

    • --error-on-duplicate-strains: Throw an error when duplicate records are detected.
  6. De-duplication method (similar to keep parameter of pandas.DataFrame.drop_duplicates()):

    augur deduplicate \
    --metadata metadata.tsv \
    --sequences sequences.fasta \
    --output-metadata metadata-unique.tsv \
    --output-sequences sequences-unique.fasta \
    --keep first \
    ...

    In the ncov workflow, this is effectively last for metadata and first for sequences.

  7. Write duplicates to a custom file instead of hardcoded {metadata}.duplicates.txt:

    augur deduplicate \
    --metadata metadata.tsv \
    --output-metadata metadata-unique.tsv \
    --output-duplicate-ids duplicates.txt \
    ...
  8. Anything else?

@joverlee521
Copy link
Contributor

One question I have is do we want the deduplicate to be more general and not specifically tied to "metadata".
For example, it would be nice to be able to use augur to deduplicate titer data. This would require using a hash of multiple columns as the "id", similar to how fauna creates an index before uploading titer data.

@huddlej
Copy link
Contributor

huddlej commented May 10, 2022

The point about a general deduplicate command might help us scope this issue. Deduplicating sequences (FASTA files) is a different process from deduplicating data frames. Sequences often don't include metadata we need to optimally resolve duplicates, so we have to rely on methods like "keep first". When additional metadata does exist with the sequence content, it is stored in the sequence name as delimited values (e.g., >strain_A|2020-01-01|2020-01-05) where the meaning and names of the metadata is somehow known externally (e.g., "strain", "collection_date", and "submission_date", in the preceding example data).

For data frames (usually CSV/TSV files), other tools already exist that could perform the deduplication process accurately and quickly. For example, one can apply csvtk and/or tsv-utils to deduplicate example SARS-CoV-2 metadata from the ncov workflow:

cd ncov/

# csvtk to sort by name and accession and get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | csvtk -t uniq -f "Virus name"

# csvtk to sort by name and accession, tsv-utils to get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | tsv-uniq -H -f "Virus\ name"

# UNIX sort with tsv-utils's keep-header to sort by name and accession and get first record by name.
keep-header tests/unsanitized_metadata.tsv -- sort -k 1,1 -k 4,4r | tsv-uniq -H -f "Virus\ name"

Along this same line of using existing tools to accomplish the deduplication task, we could use seqkit to convert FASTA files to TSVs, apply the same "unique" logic from tsv-uniq to the corresponding data frames, and then convert TSVs back to FASTA files like so:

seqkit fx2tab -Q data/example_sequences.fasta | tsv-uniq -f 1 | seqkit tab2fx

If detection of duplicates if our goal (as opposed to resolution), seqkit also includes options to generate hashes of sequences. The following command generates MD5 hashes of all sequences in the input. We could then pass the names and hashes to a command like tsv-uniq or tsv-summarize to find any duplicates by name and/or sequence content.

# Get sequence names and hashes in TSV format.
seqkit fx2tab -H -s -n data/example_sequences.fasta > hashes.tsv

# Get duplicate sequences by name.
tsv-uniq -H -r -f "#name" hashes.tsv > duplicate_sequences_by_name.tsv

# Get sequence names with duplicate different sequences.
 tsv-summarize -H -g "#name" --unique-count "seq.hash" hashes.tsv | tsv-filter -H --str-gt seq.hash_unique_count:1 > duplicate_distinct_sequences.tsv

You can imagine skipping the hash step completely and just applying tsv-summarize on the name and sequence directly.

I just tested out a couple of the above commands on 119,053 sequences from Africa.

# Find duplicate distinct sequences for the same sequence name.
# This ran in 14 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -H -n -s | tsv-summarize -H -g "#name" --unique-count "seq.hash" | tsv-filter -H --str-gt seq.hash_unique_count:1
#name	seq.hash_unique_count
hCoV-19/Rwanda/U66872-RD/2020|2020-07|2020-11-05	2

# Find duplicate sequences by name (N=48).
# This ran in 10 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -n | tsv-uniq -r -f 1 | wc -l
48

# Get distinct sequences by name (N=119,004).
# This ran for 1 minute.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -Q | tsv-uniq -f 1 | seqkit tab2fx | xz -2 -c > unique.fasta.xz

These sequences from Africa are a great example of how metadata can appear in the "sequence name", depending on how it is downloaded. For example, I extracted the GISAID sequence named hCoV-19/South Africa/Tygerberg_1233/2020|2020-07-31|2021-07-09 from the full GISAID sequence download file. This sequence name contains the "strain", "submission_date", and "collection_date". We could use this information in the commands above to further resolve duplicates.

# Resolve duplicates by submission date.
# This ran in 1 minute, 46 seconds.
# Sorting like this won't be efficient for larger files.
xz -c -d data/sequences_africa.fasta.xz \
  | seqkit fx2tab -H -Q \
  | csvtk -C '$' -t sep -f '#name' -n strain,submission_date,collection_date -s "|" \
  | csvtk -C '$' -t sort -k strain -k submission_date:r \
  | tsv-uniq -H -f strain \
  | seqkit tab2fx \
  | xz -2 -c > unique.fasta.xz

The sanitize sequences script currently ignores this kind of metadata by throwing away whatever follows the first |. It could be more intelligent about this logic.

We might spend a little time playing around with these approaches that rely on existing tools before we try to implement a command in Augur. We could find that these external tools are sufficient or that a light-weight wrapper command around these would be enough for our needs.

@victorlin
Copy link
Member Author

Thanks for compiling that @huddlej! Really useful to know that existing tools are sufficient for these use cases.

I think this sub-command would be more beneficial if we use databases over TSV/FASTA files down the road (#866).

In the short term, how about:

  1. Make sure Augur is explicit and clear in erroring on duplicates (fix #616 and #915 with proper error messages).
  2. (maybe) Under Augur docs FAQ, add a page on using external tools to resolve duplicates. Though, your issue comment above is search-able via GitHub so that might be good enough.

@huddlej
Copy link
Contributor

huddlej commented May 10, 2022

We should definitely implement the first short-term solution. I'm not sure about the best place to have the discussion about deduplicating data. A tutorial (or how-to guide, depending on the audience) might be easier for users to find and also allow us to contextualize deduplication as part of "data preparation". As we start to support users running workflows for seasonal flu and other pathogens, some of the content in the ncov tutorial will need to bubble up to the top-level docs.

I didn't mean to suggest that the existing tools are completely sufficient for our use cases, but that we should try them out and see where they fail to meet our needs. It's possible that an Augur subcommand could handle some of the Nextstrain-specific details and validation that were implicit in my examples above. For example, the subcommand might check the input metadata for the presence of valid accession id columns and strain id columns before piping the input to csvtk, etc. Similarly for the sequences example, we might want a subcommand that accepts a list of expected metadata fields to find in the sequence name (like augur parse does) so those fields can be passed through to an underlying seqkit command. This kind of wrapper command would allow us to change the implementation details in the future, for whichever data types start to live in a database.

@huddlej
Copy link
Contributor

huddlej commented Jul 7, 2022

While working on the seasonal flu builds with data from FluDB, I noticed that seqkit has an rmdup command that can remove duplicate sequences by name or sequence content and optionally log the duplicates to a file. Adding this command right after running augur parse (to get a FASTA file with strain names only in the header) seems to be enough for the flu workflow:

seqkit rmdup \
    --by-name \
    -D data/h3n2/ha.duplicates.tsv \
    data/h3n2/ha.fasta > data/h3n2/ha.deduplicated.fasta

@victorlin
Copy link
Member Author

Update: augur merge is a newly introduced subcommand (#1563) which handles metadata de-duplication across multiple inputs. It will still error out on duplicates within a single input. The plan is to extend this to sequences as well: #1579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

3 participants