[tables] Add flag to remove duplicate header rows #89

paulopaixaoamaral · 2020-05-22T10:55:39Z

Description

This will add a flag to the table extraction functions to enable the removal of rows which are duplicats of the header row.

Linked issues

Closes #76

Testing

Tests have been added to the table extraction tests to make sure the duplicate header rows are removed.

Checklist

I have provided a good description of the change above
I have added any necessary tests
I have added all necessary type hints
I have checked my linting (docker-compose run --rm lint)
I have added/updated all necessary documentation
I have updated CHANGELOG.md, following the format from
Keep a Changelog.

jstockwin

Thanks @paulopaixaoamaral - looks generally good to me.

I've added two suggestions. Also:

A test where the element text (but not the font) matches, and also one where the font (but not the text) matches might be good, just to ensure these elements don't get removed. (I think I could e.g. remove font comparison from your function and still have tests pass).
However, maybe it's cleaner to pull your inner function outside, since then you can test that in isolation?

jstockwin · 2020-05-22T11:11:13Z

CHANGELOG.md

@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]
 ### Changed
+- Added flag to `extract_simple_table` and `extract_table` functions to remove duplicate header rows. ([#89](https://github.com/jstockwin/py-pdf-parser/pull/89))


I think this should probably be in a section called ### Added

jstockwin · 2020-05-22T11:12:27Z

py_pdf_parser/tables.py

+        if (elem_1 is None or elem_2 is None) or (
+            elem_2 is None and elem_1 is not None
+        ):
+            return False


Given the above check, we know they're not BOTH None, so I think this can be simplified to if elem1 is None or elem2 is None?

Yeah I did that originally, but I thought it would be better to be explicit for readability purposes. But I guess you're right, it will be readable enough, I am going to change it 👍

jstockwin

LGTM, thanks!

[tables] Add flag to remove duplicate header rows

d75001f

paulopaixaoamaral force-pushed the remove-duplicate-header-rows branch from 6247868 to d75001f Compare May 22, 2020 10:57

paulopaixaoamaral marked this pull request as ready for review May 22, 2020 10:59

paulopaixaoamaral requested a review from jstockwin May 22, 2020 10:59

paulopaixaoamaral assigned jstockwin May 22, 2020

jstockwin requested changes May 22, 2020

View reviewed changes

Changed according to CR

0bda4c2

paulopaixaoamaral requested a review from jstockwin May 22, 2020 12:11

jstockwin approved these changes May 22, 2020

View reviewed changes

paulopaixaoamaral merged commit b4a61a8 into master May 22, 2020

paulopaixaoamaral deleted the remove-duplicate-header-rows branch May 22, 2020 14:22

jstockwin mentioned this pull request Jun 22, 2020

Add remove_duplicate_header_rows flag to a documentation example #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tables] Add flag to remove duplicate header rows #89

[tables] Add flag to remove duplicate header rows #89

paulopaixaoamaral commented May 22, 2020 •

edited

Loading

jstockwin left a comment

jstockwin May 22, 2020

jstockwin May 22, 2020

paulopaixaoamaral May 22, 2020

jstockwin left a comment

[tables] Add flag to remove duplicate header rows #89

[tables] Add flag to remove duplicate header rows #89

Conversation

paulopaixaoamaral commented May 22, 2020 • edited Loading

jstockwin left a comment

Choose a reason for hiding this comment

jstockwin May 22, 2020

Choose a reason for hiding this comment

jstockwin May 22, 2020

Choose a reason for hiding this comment

paulopaixaoamaral May 22, 2020

Choose a reason for hiding this comment

jstockwin left a comment

Choose a reason for hiding this comment

paulopaixaoamaral commented May 22, 2020 •

edited

Loading