-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly error on duplicates in parse
and filter
, handle AugurError
globally
#918
Conversation
if sequence_record.id in meta_data: | ||
raise AugurError(f"Duplicate found for '{sequence_record.id}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another more comprehensive approach - the solution outlined in #616 - is to run through all sequences to capture a list of all duplicates, then error with a complete list of duplicates. However, this would require either:
- one pass through the data + large memory usage
- two passes with minimal memory usage (1 for duplicate detection, one for writing outputs if no duplicates)
- one pass through + temporary files that are moved to the proper location if no duplicates are found
but I'm reluctant to use that as the initial implementation unless a complete list of duplicates is really desired from a user perspective.
I think this small change is sufficient to address the inaccurate behavior described by #616.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if there are a large number of duplicates, it would not be user-friendly to print all to console output.
Codecov Report
@@ Coverage Diff @@
## master #918 +/- ##
==========================================
- Coverage 34.66% 34.59% -0.07%
==========================================
Files 42 42
Lines 6018 6044 +26
Branches 1542 1549 +7
==========================================
+ Hits 2086 2091 +5
- Misses 3859 3879 +20
- Partials 73 74 +1
Continue to review full report at Codecov.
|
Rename the unused AugurException as AugurError, to be consistent with NextstrainCliError: https://github.com/nextstrain/cli/blob/8ec779b489254c3abb7bbbf4a8f3694c4ade854e/nextstrain/cli/errors.py#L7
a62b3eb
to
7b215a8
Compare
parse
, handle AugurError
globallyparse
and filter
, handle AugurError
globally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 from me (ideally after applying [fix] commits as fixups). Given @huddlej chimed in too, would be good to get a 👍 from him maybe as well.
Whoops. Wrong PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
It's not clear to me why 5c7d6f3 and 44e1910 are separate commits, but not a big deal.
Rename the unused AugurException as AugurError…
Looking at why this is unused led me to f04f357. Seems like it'd be good to switch those places where AugurException was swapped for RuntimeError back now to AugurError.
Previously, augur parse would let duplicates pass silently with the following behavior: - sequence output contains all duplicates - metadata output contains only the last occurrence of duplicates The behavior is unintentional and undocumented. This small change will immediately raise an error and exit when a duplicate is detected.
This isn't used for anything now, but would make things easier if there is any need to add properties/methods to all custom exceptions in the project.
print(file=sys.stderr) is used a lot in the codebase. This new function print_err acts as an alias to the original call.
Previously, augur filter would let fail on duplicates with a cryptic error: AttributeError: 'DataFrame' object has no attribute 'name' This small change will immediately raise a more meaningful error and properly exit when a duplicate is detected.
182361a
to
dd783ec
Compare
Description of proposed changes
This PR introduces a new
AugurError
and uses it inparse
andfilter
to properly error on duplicates. See individual commits.Related issue(s)
Testing
Added tests for new error messages.