Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur filter not identifying duplicates, causing pandas error #915

Closed
ammaraziz opened this issue May 6, 2022 · 6 comments · Fixed by #918
Closed

augur filter not identifying duplicates, causing pandas error #915

ammaraziz opened this issue May 6, 2022 · 6 comments · Fixed by #918
Assignees
Labels
bug Something isn't working

Comments

@ammaraziz
Copy link

Current Behavior

When filtering a metadata file that contains duplicates in the strain field, this error is produced:

AttributeError: 'DataFrame' object has no attribute 'name'

Expected behavior

Previous behavior was that the filter subcommand produces a helpful error (see #751 for an example), catching the error before processing data.

How to reproduce

Given a meta.tsv with contents:

strain	date
a	2010-10-10
a	2010-10-10
b	2011-10-10
c	2012-10-10
d	2013-10-10

Run this command:

augur filter --metadata meta.tsv --group-by year --sequences-per-group 2 --output-metadata meta2.tsv

It will produce this error and trace:

Traceback (most recent call last):
  File "/home/amar/miniconda3/bin/augur", line 10, in <module>
    sys.exit(main())
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
  File "/home/amar/miniconda3/lib/python3.8/site-packages/augur/filter.py", line 1544, in run
    subsampled_strains.add(record.name)
  File "/home/amar/miniconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 5465, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'name'

The output of record when there is no duplicate:

date    2011-10-10
Name: b, dtype: object
date    2012-10-10
Name: c, dtype: object
date    2013-10-10
Name: d, dtype: object

The same with duplicates:

              date
strain            
a       2010-10-10
a       2010-10-10

Your environment: if running Nextstrain locally

  • Operating system: linux mint 20 + bash + conda
  • Test on Augur versions: 15.0.1 (latest), 14.0, 13.1.2, 12.1.1. Last version which worked was 11.3

Additional context

The bug is hard to detect when subsampling with grouping option. It will only result in an error when a duplicate strain name is in the queue. This means data is being written to the output but stops when duplicate is in the queue.

@ammaraziz ammaraziz added the bug Something isn't working label May 6, 2022
@victorlin victorlin self-assigned this May 6, 2022
@victorlin
Copy link
Member

victorlin commented May 6, 2022

Thanks for sending this! I can reproduce this on my end, noting that it's more predictable when passing --subsample-seed. This is what I ran to reproduce:

echo -e 'strain\tdate
a\t2010-10-10
a\t2010-10-10
b\t2010-10-10
c\t2010-10-10
d\t2010-10-10
' > meta.tsv

augur filter \
  --metadata meta.tsv \
  --group-by year \
  --sequences-per-group 2 \
  --subsample-seed 0 \
  --output-metadata meta2.tsv

Version 12.0.0 was the last to provide a meaningful output:

ERROR: Problem reading in meta.tsv:
Duplicated strain in metadata: a

And the versions after that come with varying warnings, lack of warnings, and the error AttributeError: 'DataFrame' object has no attribute 'name'.

This is due to the change in 12.1.0 where augur filter reads metadata in chunks - during the process of that refactor, which came with nice improvements, the feature to detect duplicates in metadata was silently lost.

I agree that this uncaught exception should not be exposed to the user. Two things that can be done here (not mutually exclusive):

  1. Show a proper error message on duplicates in augur filter, as previously done.
  2. Implement a new subcommand as outlined in Add augur curate subcommands #860, which include duplicate checking.

@victorlin
Copy link
Member

Another tricky thing here is that a collection of names must be maintained outside of the metadata chunk iteration. Notice that there is no error when using --metadata-chunk-size 1 since each chunk does not have duplicates:

augur filter \
  --metadata meta.tsv \
  --group-by year \
  --sequences-per-group 2 \
  --subsample-seed 0 \
  --metadata-chunk-size 1 \
  --output-metadata meta2.tsv
# 3 strains were dropped during filtering
# 	3 of these were dropped because of subsampling criteria
# 1 strains passed all filters

I believe we already have a similar collection being maintained, so this should not be difficult to fix.

@ammaraziz
Copy link
Author

Thanks for the super quick response Victor!

Any chance of a quick and dirty fix as a stop gap? This bug combined with #616 means duplicates are a hassle to deal with in the current version of augur.

@victorlin
Copy link
Member

@ammaraziz Regarding this exact issue, the "quick and dirty fix" would still result in an error to the user (ERROR: Duplicated strain in metadata vs. a cryptic AttributeError). So it would only be a cosmetic change to user experience, not a functional improvement.

However, I've started planning the new subcommand for removing duplicates, which should solve broader issue of duplicates being a hassle to deal with in Augur: #919

I can't think of anything that can be done quickly to alleviate the pains here, besides work on the de-duplication feature. Please let me know if I missed anything.

@ammaraziz
Copy link
Author

Very good point. I look forward to the new subcommand!

@huddlej
Copy link
Contributor

huddlej commented May 11, 2022

@ammaraziz In the short term, you may want to check out our discussion of how to resolve duplicates in metadata and sequences using other standard bioinformatics tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants