Skip to content

Commit

Permalink
Merge pull request #320 from jsvine/issue-316
Browse files Browse the repository at this point in the history
Code and commits by @samkit-jain:

* Treat invalid/unparseable metadata values as warnings — Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw

* Recursively parse metadata values to handle nested `PDFObjRef` objects — Fixes #316

* Resolve lint issues and remove unused imports

* Make metadata parse failure handling behaviour configurable

* Update tests to bump up test coverage

* Update changelog

Co-authored-by: Samkit Jain <15127115+samkit-jain@users.noreply.github.com>
  • Loading branch information
jsvine and samkit-jain committed Dec 9, 2020
2 parents d3b84da + e849cbb commit f2c510d
Show file tree
Hide file tree
Showing 8 changed files with 66 additions and 19 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,17 @@
All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.5.25] — Unreleased
## Added
- Add new boolean argument `strict_metadata` (default `False`) to `pdfplumber.open(...)` method for handling metadata resolve failures ([#320](https://github.com/jsvine/pdfplumber/pull/320))

### Fixed
- Fix metadata extraction to handle integer/floating-point values ([#297](https://github.com/jsvine/pdfplumber/issues/297))
- Explicitly load text as utf-8 in `setup.py` ([#304](https://github.com/jsvine/pdfplumber/issues/304))
- Fix `pdfplumber.open(...)` so that it does not close file objects passed to it ([#312](https://github.com/jsvine/pdfplumber/issues/312))

### Changed
- Extend metadata resolver to handle more data types ([#320](https://github.com/jsvine/pdfplumber/pull/320))

## [0.5.24] — 2020-10-20
### Added
- Added `extra_attrs=[...]` parameter to `.extract_text(...)` ([c8b200e](https://github.com/jsvine/pdfplumber/commit/c8b200e)) ([#28](https://github.com/jsvine/pdfplumber/issues/28))
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ The `open` method returns an instance of the `pdfplumber.PDF` class.

To load a password-protected PDF, pass the `password` keyword argument, e.g., `pdfplumber.open("file.pdf", password = "test")`.

Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.

### The `pdfplumber.PDF` class

The top-level `pdfplumber.PDF` class represents a single PDF and has two main properties:
Expand Down
38 changes: 25 additions & 13 deletions pdfplumber/pdf.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from .container import Container
from .page import Page
from .utils import decode_text
from .utils import resolve_and_decode

import logging
import pathlib
import itertools
from pdfminer.pdfparser import PDFParser
Expand All @@ -10,13 +11,22 @@
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.psparser import PSLiteral

logger = logging.getLogger(__name__)


class PDF(Container):
cached_properties = Container.cached_properties + ["_pages"]

def __init__(self, stream, pages=None, laparams=None, precision=0.001, password=""):
def __init__(
self,
stream,
pages=None,
laparams=None,
precision=0.001,
password="",
strict_metadata=False,
):
self.laparams = None if laparams is None else LAParams(**laparams)
self.stream = stream
self.pages_to_parse = pages
Expand All @@ -27,16 +37,18 @@ def __init__(self, stream, pages=None, laparams=None, precision=0.001, password=
for info in self.doc.info:
self.metadata.update(info)
for k, v in self.metadata.items():
if hasattr(v, "resolve"):
v = v.resolve()
if type(v) == list:
self.metadata[k] = list(map(decode_text, v))
elif isinstance(v, PSLiteral):
self.metadata[k] = decode_text(v.name)
elif isinstance(v, (str, bytes)):
self.metadata[k] = decode_text(v)
else:
self.metadata[k] = v
try:
self.metadata[k] = resolve_and_decode(v)
except Exception as e:
if strict_metadata:
# Raise an exception since unable to resolve the metadata value.
raise
# This metadata value could not be parsed. Instead of failing the PDF
# read, treat it as a warning only if ``strict_metadata=False``.
logger.warning(
f'[WARNING] Metadata key "{k}" could not be parsed due to '
f"exception: {str(e)}"
)
self.device = PDFPageAggregator(rsrcmgr, laparams=self.laparams)
self.interpreter = PDFPageInterpreter(rsrcmgr, self.device)

Expand Down
18 changes: 18 additions & 0 deletions pdfplumber/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,24 @@ def decode_text(s):
return "".join(PDFDocEncoding[o] for o in ords)


def resolve_and_decode(obj):
"""Recursively resolve the metadata values."""
if hasattr(obj, "resolve"):
obj = obj.resolve()
if isinstance(obj, list):
return list(map(resolve_and_decode, obj))
elif isinstance(obj, PSLiteral):
return decode_text(obj.name)
elif isinstance(obj, (str, bytes)):
return decode_text(obj)
elif isinstance(obj, dict):
for k, v in obj.items():
obj[k] = resolve_and_decode(v)
return obj

return obj


def decode_psl_list(_list):
return [
decode_text(value.name) if isinstance(value, PSLiteral) else value
Expand Down
6 changes: 3 additions & 3 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
pytest
pytest-cov
pytest-parallel
pytest==6.1.2
pytest-cov==2.10.1
pytest-parallel==0.1.0
flake8==3.8.3
black==20.8b0
Binary file added tests/pdfs/issue-316-example.pdf
Binary file not shown.
4 changes: 3 additions & 1 deletion tests/test_display.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@ def teardown_class(self):

def test_basic_conversion(self):
self.im.reset()
self.im.draw_rect(self.im.page.rects[0])
self.im.draw_rects(self.im.page.rects)
self.im.draw_circle(self.im.page.chars[0])
self.im.draw_line(self.im.page.edges[0])
self.im.draw_vlines([10])
self.im.draw_hlines([10])

def test_debug_tablefinder(self):
self.im.reset()
Expand Down
11 changes: 9 additions & 2 deletions tests/test_issues.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
#!/usr/bin/env python
import unittest
import pdfplumber
import sys, os
import six
import os

import logging
logging.disable(logging.ERROR)
Expand Down Expand Up @@ -168,3 +167,11 @@ def test_issue_297(self):
path = os.path.join(HERE, "pdfs/issue-297-example.pdf")
with pdfplumber.open(path) as pdf:
assert isinstance(pdf.metadata["Copies"], int)

def test_issue_316(self):
"""
Handle invalid metadata
"""
path = os.path.join(HERE, "pdfs/issue-316-example.pdf")
with pdfplumber.open(path) as pdf:
assert pdf.metadata["Changes"][0]["CreationDate"] == "D:20061207105020Z00'00'"

3 comments on commit f2c510d

@Abdur-rahmaanJ
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsvine @samkit-jain tell me how you resolved the issue, just as a curiosity!

@samkit-jain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abdur-rahmaanJ I'll focus on the core issue that affected the PDF you shared. Prior to this change, pdfplumber expected a metadata value of type list to contain only strings. In the PDF you shared, that wasn't the case. There was a list, not of strings, but of metadata objects itself. Think of it like this, say, a metadata object looks like

{
  "k1": "v1",
  "k2": ["v2", "v3"]
}

In your case, the object came out to be

{
  "k1": "v1",
  "k2": [
    {
      "k1": "vv1",
      "k2": ["vv2", "vv3"]
    },
    {
      "k1": "vvv1",
      "k2": ["vvv2", "vvv3"]
    }
  ]
}

That is, instead of being

M = {
  "k1": "v1",
  "k2": ["v2", "v3"]
}

it was

M1 = {
  "k1": "v1",
  "k2": [M, M]
}

So, I made the resolver method recursive to account for such cases. The method would continue resolving Ms at all nested levels until all are resolved.

Was I able to satisfy your curiosity?

@Abdur-rahmaanJ
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samkit-jain so it was not in the PDF specs ^^, thank you very much for this answer!

Please sign in to comment.