Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update form-parsing example in README #958

Merged
merged 8 commits into from
Aug 12, 2023

Conversation

jeremybmerrill
Copy link
Contributor

This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).

the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. (Nested fields were previously ignored.)

The prior example also entirely ignored the TU entry alternate field name.

This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).

the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. The prior example also entirely ignored the TU entry alternate field name.
@jsvine
Copy link
Owner

jsvine commented Aug 8, 2023

Thanks! A few questions:

  • Given that this does seem to handle nested fields, should we cut this line from right above the code, or is it still true in some way?: "You may have to modify this script to handle cases like nested fields (see page 676 of the specification)."

  • Given your (helpful) familiarity with forms, do you think it'd be possible/wise/unwise to decode the bytes-strings in the results to standard unicode strings?

  • Would you be open to adding a bit of documentation re. what some sample output looks like and what each item in a result represents?

@jeremybmerrill
Copy link
Contributor Author

Given that this does seem to handle nested fields, should we cut this line from right above the code, or is it still true in some way?: "You may have to modify this script to handle cases like nested fields (see page 676 of the specification)."

Yes!

Given your (helpful) familiarity with forms, do you think it'd be possible/wise/unwise to decode the bytes-strings in the results to standard unicode strings?

Let's not overstate my familiarity with this domain :)

I have code to do this. Happy to contribute it, it just seemed a bit out-of-domain. Is there built-in code to pdfplumber for decoding bytestring in the general case? (I just have some munged-together garbage; unclear what's general and what's specific to my specific set of PDFs).

Would you be open to adding a bit of documentation re. what some sample output looks like and what each item in a result represents?

sure!

@jsvine
Copy link
Owner

jsvine commented Aug 10, 2023

Thanks for these updates! One bit of follow-up:

Is there built-in code to pdfplumber for decoding bytestring in the general case?

There is, although I don't know whether it'd work equally well with form data. I'm not 100% confident it would work equally well for form data, but want to give it a shot?

If this feels out of scope or doesn't work, I'm happy just to merge as-is.

@jsvine
Copy link
Owner

jsvine commented Aug 10, 2023

Thanks for that commit! I think the example output might now need tweaking, though?

@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #958 (a97ff07) into stable (7c2d46b) will not change coverage.
The diff coverage is n/a.

❗ Current head a97ff07 differs from pull request most recent head 441acdc. Consider uploading reports for the commit 441acdc to get more accurate results

@@            Coverage Diff            @@
##            stable      #958   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1588           
=========================================
  Hits          1588      1588           

@jeremybmerrill
Copy link
Contributor Author

adjusted to use resolve_and_decode (which simplifies by project code substantially too, so ty! :) )

it mostly worked for my project as a drop-in replacement for my custom cleaners. It failed when some form field names include other garbage, e.g. 'DATE Of ACCIDENT\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10/Section 2. Accident information vehicle one enter DATE Of ACCIDENT\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e',

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

(The only PDF forms I've ever parsed is this one specific set of PDF forms, so I'm deep but not broad)

@jeremybmerrill
Copy link
Contributor Author

@jsvine this is what I'm using to remove the garbage:

def remove_control_characters(s):
    if s is None: return s
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

curious if that would better belong in resolve_and_decode, or at least in the utils file (rather than living just in the README)

@jsvine
Copy link
Owner

jsvine commented Aug 12, 2023

Thanks! Couple of follow-ups below.

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

Hmmm! I haven't come across that much, if at all. Maybe something specific-ish to your set of forms? For that reason, at least for now, I'd lean toward leaving that cruft-reduction out of the README example, though I'm persuadable otherwise. What do you think?

Other than that, just one more bit I noticed, after testing the new README code out on a couple of PDFs (1, 2):

# I.e., adding resolve to imports
from pdfplumber.utils.pdfinternals import resolve, resolve_and_decode

# This line seems to have been accidentally cut, 
# though now realizing it needed to be tweaked slightly, 
# to resolve(...) instead of [...].resolve()
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]

@jeremybmerrill
Copy link
Contributor Author

Done!

Let's leave out the control-character removal stuff for now, probably idiosyncratic.

@jsvine jsvine merged commit 4149831 into jsvine:stable Aug 12, 2023
@jsvine
Copy link
Owner

jsvine commented Aug 12, 2023

Great, thanks, and merging!

@jeremybmerrill jeremybmerrill deleted the jeremybmerrill-patch-1 branch August 12, 2023 21:28
@jeremybmerrill
Copy link
Contributor Author

Thank you for building and maintaining this tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants