Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visual debugging raise ValueError: Decompressed Data Too Large #413

Closed
holytony opened this issue Apr 15, 2021 · 6 comments
Closed

Visual debugging raise ValueError: Decompressed Data Too Large #413

holytony opened this issue Apr 15, 2021 · 6 comments
Assignees
Labels

Comments

@holytony
Copy link

holytony commented Apr 15, 2021

Describe the bug

While trying to using the visual debugging tool in pdf plumber, the module had issue converting certain pdf file to PageImage obj, raise ValueError: Decompressed Data Too Large, which seems to be caused by issue with Pillow module

By the way, here is pdf file I was dealing with
pdf_file_link

Interestingly enough, if I drag the several pages from the original pdf to make a new pdf using the Preview App on Mac , problem gone, everything works fine, even after I have set the resolution to very high value (500), it would still run smoothly.

Just for the sake of testing the limit of systems, after setting the resolution to 1500, I did manage to raise a DecompressionBombError.
PIL.Image.DecompressionBombError: Image size (217518678 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

so, to summarize,
while using .to_image

original file-> ValueError: Decompressed Data Too Large
dragged_pages_using_Preview-> No problem what so ever
original file + super high resolution(1500)-> DecompressionBombError
dragged_pages_using_Preview+super high resolution(1500)->DecompressionBombError

something magical with the original file?
pdf_file_link

Code to reproduce the problem

with pdfplumber.open(pdf_filename) as pdf:
    for page_num, page in enumerate(pdf.pages):
        im = page.to_image(resolution=72)

Actual behavior

Traceback (most recent call last):
File "/Users/Tc/PycharmProjects/plumber_test_3/pdf_visual_debugging.py", line 188, in
visualize(‘test_page.pdf')
File "/Users/Tc/PycharmProjects/plumber_test_3/pdf_visual_debugging.py", line 172, in visualize
im = page.to_image(resolution=72)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/pdfplumber/page.py", line 299, in to_image
return PageImage(self, **kwargs)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/pdfplumber/display.py", line 53, in init
self.original = get_page_image(
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/pdfplumber/display.py", line 45, in get_page_image
im = PIL.Image.open(BytesIO(png.make_blob()))
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/Image.py", line 2944, in open
im = _open_core(fp, filename, prefix, formats)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/Image.py", line 2930, in _open_core
im = factory(fp, filename)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/ImageFile.py", line 121, in init
self._open()
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 694, in open
s = self.png.call(cid, pos, length)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 187, in call
return getattr(self, "chunk
" + cid.decode("ascii"))(pos, length)
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 545, in chunk_zTXt
v = _safe_zlib_decompress(v[1:])
File "/Users/Tc/PycharmProjects/plumber_test_3/lib/python3.8/site-packages/PIL/PngImagePlugin.py", line 133, in _safe_zlib_decompress
raise ValueError("Decompressed Data Too Large")
ValueError: Decompressed Data Too Large

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber version: [e.g., 0.5.22]
  • Python version: [ 3.8]
  • OS: [ Mac]

Additional context

Add any other context/notes about the problem here.

@holytony holytony added the bug label Apr 15, 2021
@samkit-jain samkit-jain self-assigned this Apr 15, 2021
@samkit-jain
Copy link
Collaborator

Hi @holytony Appreciate your interest in the library. Can you try modifying the memory values in the file at /etc/ImageMagick-6/policy.xml (The actual path might be different in your computer for the policy.xml file)? The portion of interest would be something like

<policymap>
  <policy domain="resource" name="memory" value="256MiB"/>
  <policy domain="resource" name="map" value="512MiB"/>
  <policy domain="resource" name="width" value="16KP"/>
  <policy domain="resource" name="height" value="16KP"/>
  <policy domain="resource" name="area" value="128MB"/>
  <policy domain="resource" name="disk" value="1GiB"/>

I also found this StackOverflow question that might be helpful.

@holytony
Copy link
Author

holytony commented Apr 15, 2021

this is the policy.xml I got on my imagemagick 7
given the look from the default value given in the file, doesn't seem like to be the cause

<policy domain="resource" name="temporary-path" value="/tmp"/> --> <policy domain="resource" name="memory" value="2GiB"/> --> <policy domain="resource" name="map" value="4GiB"/> --> <policy domain="resource" name="width" value="10KP"/> --> <policy domain="resource" name="height" value="10KP"/> --> <policy domain="resource" name="list-length" value="128"/> --> <policy domain="resource" name="area" value="100MP"/> --> <policy domain="resource" name="disk" value="16EiB"/> --> <policy domain="resource" name="file" value="768"/> --> <policy domain="resource" name="thread" value="4"/> --> <policy domain="resource" name="throttle" value="0"/> --> <policy domain="resource" name="time" value="3600"/> --> <policy domain="coder" rights="none" pattern="MVG" /> --> <policy domain="module" rights="none" pattern="{PS,PDF,XPS}" /> --> <policy domain="delegate" rights="none" pattern="HTTPS" /> --> <policy domain="path" rights="none" pattern="@*" /> --> <policy domain="cache" name="memory-map" value="anonymous"/> --> <policy domain="cache" name="synchronize" value="True"/> --> <policy domain="cache" name="shared-secret" value="passphrase" stealth="true"/> --> <policy domain="system" name="max-memory-request" value="256MiB"/> --> <policy domain="system" name="shred" value="2"/> --> <policy domain="system" name="precision" value="6"/> --> <policy domain="system" name="font" value="/path/to/unicode-font.ttf"/>

@samkit-jain
Copy link
Collaborator

samkit-jain commented Apr 17, 2021

I tried replicating the issue on my machine but couldn't using the code you provided and the PDF. Did you try the answers provided at https://stackoverflow.com/q/42671252/7760998?

@holytony
Copy link
Author

I tried replicating the issue on my machine but couldn't using the code you provided and the PDF. Did you try the answers provided at https://stackoverflow.com/q/42671252/7760998?

nah, unfortunately not working

@jsvine
Copy link
Owner

jsvine commented Aug 31, 2021

I found myself in a similar situation, and this worked for me:

from PIL import PngImagePlugin
LARGE_ENOUGH_NUMBER = 100
PngImagePlugin.MAX_TEXT_CHUNK = LARGE_ENOUGH_NUMBER * (1024**2)

@jsvine
Copy link
Owner

jsvine commented Jul 20, 2022

Update — this more reliably fixes things for me:

from PIL import Image
Image.MAX_IMAGE_PIXELS = 1000000000  # Or sufficiently large number

Will be adding a try/except/raise to get_page_image(...), with the error message suggesting this fix and pointing people to this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants