Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tables] Add flag to remove duplicate header rows #89

Merged
merged 2 commits into from
May 22, 2020

Conversation

paulopaixaoamaral
Copy link
Collaborator

@paulopaixaoamaral paulopaixaoamaral commented May 22, 2020

Description

This will add a flag to the table extraction functions to enable the removal of rows which are duplicats of the header row.

Linked issues

Closes #76

Testing

Tests have been added to the table extraction tests to make sure the duplicate header rows are removed.

Checklist

  • I have provided a good description of the change above
  • I have added any necessary tests
  • I have added all necessary type hints
  • I have checked my linting (docker-compose run --rm lint)
  • I have added/updated all necessary documentation
  • I have updated CHANGELOG.md, following the format from
    Keep a Changelog.

Copy link
Owner

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paulopaixaoamaral - looks generally good to me.

I've added two suggestions. Also:

  • A test where the element text (but not the font) matches, and also one where the font (but not the text) matches might be good, just to ensure these elements don't get removed. (I think I could e.g. remove font comparison from your function and still have tests pass).
  • However, maybe it's cleaner to pull your inner function outside, since then you can test that in isolation?

@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]
### Changed
- Added flag to `extract_simple_table` and `extract_table` functions to remove duplicate header rows. ([#89](https://github.com/jstockwin/py-pdf-parser/pull/89))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably be in a section called ### Added

Comment on lines 468 to 471
if (elem_1 is None or elem_2 is None) or (
elem_2 is None and elem_1 is not None
):
return False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above check, we know they're not BOTH None, so I think this can be simplified to if elem1 is None or elem2 is None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did that originally, but I thought it would be better to be explicit for readability purposes. But I guess you're right, it will be readable enough, I am going to change it 👍

Copy link
Owner

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add feature to remove duplicate header rows
2 participants