Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly error on duplicates in parse and filter, handle AugurError globally #918

Merged
merged 6 commits into from
May 12, 2022

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented May 10, 2022

Description of proposed changes

This PR introduces a new AugurError and uses it in parse and filter to properly error on duplicates. See individual commits.

Related issue(s)

Testing

Added tests for new error messages.

@victorlin victorlin self-assigned this May 10, 2022
Comment on lines +179 to +180
if sequence_record.id in meta_data:
raise AugurError(f"Duplicate found for '{sequence_record.id}'.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another more comprehensive approach - the solution outlined in #616 - is to run through all sequences to capture a list of all duplicates, then error with a complete list of duplicates. However, this would require either:

  1. one pass through the data + large memory usage
  2. two passes with minimal memory usage (1 for duplicate detection, one for writing outputs if no duplicates)
  3. one pass through + temporary files that are moved to the proper location if no duplicates are found

but I'm reluctant to use that as the initial implementation unless a complete list of duplicates is really desired from a user perspective.

I think this small change is sufficient to address the inaccurate behavior described by #616.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if there are a large number of duplicates, it would not be user-friendly to print all to console output.

augur/__init__.py Outdated Show resolved Hide resolved
@victorlin victorlin requested a review from a team May 10, 2022 00:19
@codecov
Copy link

codecov bot commented May 10, 2022

Codecov Report

Merging #918 (182361a) into master (0db0a41) will decrease coverage by 0.06%.
The diff coverage is 31.25%.

❗ Current head 182361a differs from pull request most recent head dd783ec. Consider uploading reports for the commit dd783ec to get more accurate results

@@            Coverage Diff             @@
##           master     #918      +/-   ##
==========================================
- Coverage   34.66%   34.59%   -0.07%     
==========================================
  Files          42       42              
  Lines        6018     6044      +26     
  Branches     1542     1549       +7     
==========================================
+ Hits         2086     2091       +5     
- Misses       3859     3879      +20     
- Partials       73       74       +1     
Impacted Files Coverage Δ
augur/filter.py 68.55% <21.05%> (-1.83%) ⬇️
augur/__init__.py 30.76% <33.33%> (-0.49%) ⬇️
augur/parse.py 62.96% <33.33%> (-1.60%) ⬇️
augur/io.py 97.56% <66.66%> (-2.44%) ⬇️
augur/utils.py 44.31% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0db0a41...dd783ec. Read the comment docs.

@victorlin victorlin changed the title Error on duplicates in parse, handle AugurError globally Properly error on duplicates in parse and filter, handle AugurError globally May 11, 2022
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from me (ideally after applying [fix] commits as fixups). Given @huddlej chimed in too, would be good to get a 👍 from him maybe as well.

Whoops. Wrong PR.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It's not clear to me why 5c7d6f3 and 44e1910 are separate commits, but not a big deal.

Rename the unused AugurException as AugurError…

Looking at why this is unused led me to f04f357. Seems like it'd be good to switch those places where AugurException was swapped for RuntimeError back now to AugurError.

tests/functional/parse.t Outdated Show resolved Hide resolved
tests/functional/filter.t Outdated Show resolved Hide resolved
augur/io.py Show resolved Hide resolved
@victorlin
Copy link
Member Author

@tsibley It's not clear to me why 5c7d6f3 and 44e1910 are separate commits, but not a big deal.

This is an artifact of me trying to cherry-pick changes from #914. I'll squash together in the rebase.

Previously, augur parse would let duplicates pass silently with the following behavior:

- sequence output contains all duplicates
- metadata output contains only the last occurrence of duplicates

The behavior is unintentional and undocumented.

This small change will immediately raise an error and exit when a duplicate is detected.
This isn't used for anything now, but would make things easier if there is any need to add properties/methods to all custom exceptions in the project.
print(file=sys.stderr) is used a lot in the codebase. This new function print_err acts as an alias to the original call.
Previously, augur filter would let fail on duplicates with a cryptic error:

    AttributeError: 'DataFrame' object has no attribute 'name'

This small change will immediately raise a more meaningful error and properly exit when a duplicate is detected.
@victorlin victorlin merged commit c3d3250 into master May 12, 2022
@victorlin victorlin deleted the victorlin/error-handling branch May 12, 2022 19:14
@victorlin victorlin added this to the Major release 16.0.0 milestone May 12, 2022
@victorlin victorlin added the breaking Makes a backwards incompatible change and should wait for major release label Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Makes a backwards incompatible change and should wait for major release
Projects
No open projects
2 participants