-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update form-parsing example in README #958
Conversation
This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field). the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. The prior example also entirely ignored the TU entry alternate field name.
Thanks! A few questions:
|
Yes!
Let's not overstate my familiarity with this domain :) I have code to do this. Happy to contribute it, it just seemed a bit out-of-domain. Is there built-in code to pdfplumber for decoding bytestring in the general case? (I just have some munged-together garbage; unclear what's general and what's specific to my specific set of PDFs).
sure! |
Thanks for these updates! One bit of follow-up:
There is, although I don't know whether it'd work equally well with form data. I'm not 100% confident it would work equally well for form data, but want to give it a shot? If this feels out of scope or doesn't work, I'm happy just to merge as-is. |
Thanks for that commit! I think the example output might now need tweaking, though? |
Codecov Report
@@ Coverage Diff @@
## stable #958 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 18 18
Lines 1588 1588
=========================================
Hits 1588 1588 |
adjusted to use it mostly worked for my project as a drop-in replacement for my custom cleaners. It failed when some form field names include other garbage, e.g. Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode? (The only PDF forms I've ever parsed is this one specific set of PDF forms, so I'm deep but not broad) |
@jsvine this is what I'm using to remove the garbage:
curious if that would better belong in |
Thanks! Couple of follow-ups below.
Hmmm! I haven't come across that much, if at all. Maybe something specific-ish to your set of forms? For that reason, at least for now, I'd lean toward leaving that cruft-reduction out of the README example, though I'm persuadable otherwise. What do you think? Other than that, just one more bit I noticed, after testing the new README code out on a couple of PDFs (1, 2): # I.e., adding resolve to imports
from pdfplumber.utils.pdfinternals import resolve, resolve_and_decode
# This line seems to have been accidentally cut,
# though now realizing it needed to be tweaked slightly,
# to resolve(...) instead of [...].resolve()
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"] |
Done! Let's leave out the control-character removal stuff for now, probably idiosyncratic. |
Great, thanks, and merging! |
Thank you for building and maintaining this tool! |
This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).
the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. (Nested fields were previously ignored.)
The prior example also entirely ignored the TU entry alternate field name.