update form-parsing example in README #958

jeremybmerrill · 2023-08-08T00:09:45Z

This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).

the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. (Nested fields were previously ignored.)

The prior example also entirely ignored the TU entry alternate field name.

This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field). the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. The prior example also entirely ignored the TU entry alternate field name.

jsvine · 2023-08-08T14:54:21Z

Thanks! A few questions:

Given that this does seem to handle nested fields, should we cut this line from right above the code, or is it still true in some way?: "You may have to modify this script to handle cases like nested fields (see page 676 of the specification)."
Given your (helpful) familiarity with forms, do you think it'd be possible/wise/unwise to decode the bytes-strings in the results to standard unicode strings?
Would you be open to adding a bit of documentation re. what some sample output looks like and what each item in a result represents?

jeremybmerrill · 2023-08-08T16:04:08Z

Given that this does seem to handle nested fields, should we cut this line from right above the code, or is it still true in some way?: "You may have to modify this script to handle cases like nested fields (see page 676 of the specification)."

Yes!

Given your (helpful) familiarity with forms, do you think it'd be possible/wise/unwise to decode the bytes-strings in the results to standard unicode strings?

Let's not overstate my familiarity with this domain :)

I have code to do this. Happy to contribute it, it just seemed a bit out-of-domain. Is there built-in code to pdfplumber for decoding bytestring in the general case? (I just have some munged-together garbage; unclear what's general and what's specific to my specific set of PDFs).

Would you be open to adding a bit of documentation re. what some sample output looks like and what each item in a result represents?

sure!

jsvine · 2023-08-10T14:52:41Z

Thanks for these updates! One bit of follow-up:

Is there built-in code to pdfplumber for decoding bytestring in the general case?

There is, although I don't know whether it'd work equally well with form data. I'm not 100% confident it would work equally well for form data, but want to give it a shot?

If this feels out of scope or doesn't work, I'm happy just to merge as-is.

jsvine · 2023-08-10T16:57:49Z

Thanks for that commit! I think the example output might now need tweaking, though?

codecov · 2023-08-10T17:01:41Z

Codecov Report

Merging #958 (a97ff07) into stable (7c2d46b) will not change coverage.
The diff coverage is n/a.

❗ Current head a97ff07 differs from pull request most recent head 441acdc. Consider uploading reports for the commit 441acdc to get more accurate results

@@            Coverage Diff            @@
##            stable      #958   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1588           
=========================================
  Hits          1588      1588

jeremybmerrill · 2023-08-10T17:23:46Z

adjusted to use resolve_and_decode (which simplifies by project code substantially too, so ty! :) )

it mostly worked for my project as a drop-in replacement for my custom cleaners. It failed when some form field names include other garbage, e.g. 'DATE Of ACCIDENT\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10/Section 2. Accident information vehicle one enter DATE Of ACCIDENT\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e',

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

(The only PDF forms I've ever parsed is this one specific set of PDF forms, so I'm deep but not broad)

jeremybmerrill · 2023-08-10T17:59:27Z

@jsvine this is what I'm using to remove the garbage:

def remove_control_characters(s):
    if s is None: return s
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

curious if that would better belong in resolve_and_decode, or at least in the utils file (rather than living just in the README)

jsvine · 2023-08-12T13:02:02Z

Thanks! Couple of follow-ups below.

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

Hmmm! I haven't come across that much, if at all. Maybe something specific-ish to your set of forms? For that reason, at least for now, I'd lean toward leaving that cruft-reduction out of the README example, though I'm persuadable otherwise. What do you think?

Other than that, just one more bit I noticed, after testing the new README code out on a couple of PDFs (1, 2):

# I.e., adding resolve to imports
from pdfplumber.utils.pdfinternals import resolve, resolve_and_decode

# This line seems to have been accidentally cut, 
# though now realizing it needed to be tweaked slightly, 
# to resolve(...) instead of [...].resolve()
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]

jeremybmerrill · 2023-08-12T14:10:00Z

Done!

Let's leave out the control-character removal stuff for now, probably idiosyncratic.

jsvine · 2023-08-12T21:04:16Z

Great, thanks, and merging!

jeremybmerrill · 2023-08-12T21:28:40Z

Thank you for building and maintaining this tool!

jeremybmerrill added 2 commits August 7, 2023 14:49

replace opaque lambda thingy with clearer ternary

ceea4cb

jeremybmerrill added 2 commits August 9, 2023 09:54

readme form demo: remove reference to modifications for nested fields

9fea4b2

README form demo: add example output

d2236ea

use resolve_and_decode to decode binary values to unicode

a97ff07

output is now a string not a bytestring

2c0f1f6

jeremybmerrill added 2 commits August 12, 2023 10:02

add back accidentally-removed line

89814b3

use resolve() instead of .resolve()

441acdc

jsvine merged commit 4149831 into jsvine:stable Aug 12, 2023

jeremybmerrill deleted the jeremybmerrill-patch-1 branch August 12, 2023 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update form-parsing example in README #958

update form-parsing example in README #958

jeremybmerrill commented Aug 8, 2023

jsvine commented Aug 8, 2023

jeremybmerrill commented Aug 8, 2023

jsvine commented Aug 10, 2023

jsvine commented Aug 10, 2023

codecov bot commented Aug 10, 2023 •

edited

Loading

jeremybmerrill commented Aug 10, 2023

jeremybmerrill commented Aug 10, 2023

jsvine commented Aug 12, 2023

jeremybmerrill commented Aug 12, 2023

jsvine commented Aug 12, 2023

jeremybmerrill commented Aug 12, 2023

update form-parsing example in README #958

update form-parsing example in README #958

Conversation

jeremybmerrill commented Aug 8, 2023

jsvine commented Aug 8, 2023

jeremybmerrill commented Aug 8, 2023

jsvine commented Aug 10, 2023

jsvine commented Aug 10, 2023

codecov bot commented Aug 10, 2023 • edited Loading

Codecov Report

jeremybmerrill commented Aug 10, 2023

jeremybmerrill commented Aug 10, 2023

jsvine commented Aug 12, 2023

jeremybmerrill commented Aug 12, 2023

jsvine commented Aug 12, 2023

jeremybmerrill commented Aug 12, 2023

codecov bot commented Aug 10, 2023 •

edited

Loading