'_io.BytesIO' object has no attribute 'name' #124

OisinMoran · 2019-06-27T13:45:25Z

The to_image() method does not seem to work if the pdfplumber.PDF object was created using a BytesIO stream. The rest of the functionality seems unaffected.

The problem seems to arise in the call to wand.image.Image() in the get_page_image() function in display.py. This image function have the ability to take file objects using the file argument explained here but get_page_image() only ever uses the filename parameter. Line 42 of the PageImage class is also looking for the name of the stream, but BytesIO objects do not have a name. Extracting characters, rectangles etc. can still be done with these BytesIO objects.

The MWE:

import pdfplumber
from io import BytesIO

file_path = "file.pdf"

# This example succesfully extracts chars and makes an image
file_like_object = open(file_path, "rb") # _io.BufferedReader object
first_page = pdfplumber.load(file_like_object).pages[0]
chars = first_page.chars
im = first_page.to_image()

# This example succesfully extracts chars but does not make an image
file_like_object.seek(0)
different_file_like_object = BytesIO(file_like_object.read()) # _io.BytesIO object
first_page_2 = pdfplumber.load(different_file_like_object).pages[0]
chars_2 = first_page_2.chars
im = first_page_2.to_image()

Gives the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-3f3cda6c1277> in <module>
      5 first_page_2 = pdfplumber.load(different_file_like_object).pages[0]
      6 chars_2 = first_page_2.chars
----> 7 im = first_page_2.to_image()

~/web_app/lib/python3.6/site-packages/pdfplumber/page.py in to_image(self, **conversion_kwargs)
    256         if "resolution" not in conversion_kwargs:
    257             kwargs["resolution"] = DEFAULT_RESOLUTION
--> 258         return PageImage(self, **kwargs)
    259 
    260 class DerivedPage(Page):

~/web_app/lib/python3.6/site-packages/pdfplumber/display.py in __init__(self, page, original, resolution)
     40         if original == None:
     41             self.original = get_page_image(
---> 42                 page.pdf.stream.name,
     43                 page.page_number - 1,
     44                 resolution

AttributeError: '_io.BytesIO' object has no attribute 'name'

Not sure how best to fix this issue.

The text was updated successfully, but these errors were encountered:

Fix #124 opening PDF with bytes stream

ubmarco · 2020-04-27T09:36:57Z

Hi, I just wanna tell that PR #179 breaks the to_image function for me.
Ghostscript runs dozens of seconds on a page crop and when returned Python eats up all 16GB on my Linux machine until it becomes unresponsive. I'm still debugging why this happens, but I wanted to communicate that early.

ubmarco · 2020-04-27T10:44:58Z

So I pinpointed the issue. The issue is related to resolution and amount of pages.
I modified the test function to this:

import io
import pdfplumber

TEST_PDF = 'WARN-Report-for-7-1-2015-to-03-25-2016_times3.pdf'


def bytes_stream_to_image():
    page = pdfplumber.PDF(io.BytesIO(open(TEST_PDF, 'rb').read())).pages[0]
    im = page.to_image(resolution=300)
    im.save('out.png', 'png')


bytes_stream_to_image()

The TEST_PDF is the one from pdfplumber 'WARN-Report-for-7-1-2015-to-03-25-2016.pdf'. I extended it by using PDFSAM and merging it 3 times with itself, so instead of 16 pages it now has 48 pages. Here is the file:
WARN-Report-for-7-1-2015-to-03-25-2016_times3.pdf

This will let GS run for appr. a minute with full load on one core, return and start eating all my memory while having huge CPU usage.

This won't happen in the old implementation, there above test function exits after appr. 5 seconds on the same machine.

ubmarco · 2020-04-27T11:19:56Z

The difference comes in lib/python3.7/site-packages/wand/image.py in function def read() on line 8716. The former implementation called MagickReadImage where the binary implementation now calls MagickReadImageBlob:

        if blob is not None:
            if not isinstance(blob, abc.Iterable):
                raise TypeError('blob must be iterable, not ' +
                                repr(blob))
            if not isinstance(blob, binary_type):
                blob = b''.join(blob)
            r = library.MagickReadImageBlob(self.wand, blob, len(blob))
        elif filename is not None:
            filename = encode_filename(filename)
            r = library.MagickReadImage(self.wand, filename)

MagickReadImage instantiates the reader within the C extension while for MagickReadImageBlob the binary data comes from the Python interface. I don't know how to debug that any further, any advice?

The PR jsvine#179 leads to a CPU load and memory leak. The problem is documented here jsvine#124

jsvine · 2020-04-27T13:04:00Z

Thank you for flagging this, @ubmarco! I'm not terribly familiar with Wand's internals, so may have to do some additional research. But, in the meantime, what do you think of this short-term solution?:

Revert .to_image to its previous state, but check hasattr(page.pdf.stream, "name")
If hasattr(page.pdf.stream, "name") is False (because user is working directly with a stream, rather than a saved file), then process the stream bytes instead

ubmarco · 2020-04-28T10:17:41Z

Hi @jsvine yes that would be an option. We're working on a fork anyway because we needed some further adaptations so I just reverted the PR on our side. No need to rush. However this problem might affect others and CPU/memory leak does not obviously correlate to the binary reading of the PDF. It's definitely worth to further investigate.
I actually don't know, if this is a platform specific problem or related to the ImageMagick Version. I'm on an up-to-date 64bit Manjaro with magemagick 7.0.10.8-1.
Can the issue be confirmed by others?

@ubmarco

See discussion in #124 for details. h/t @ubmarco

jsvine · 2020-04-29T13:36:37Z

Thanks again, @ubmarco. I checked and noticed I was seeing the same memory problems. (I hadn't noticed it in the test suite because the tested PDFs are intentionally small/short.) In v0.5.20, just now released, .to_image uses the filename if possible. If only bytes are available, though, it can still handle that.

Still, the CPU/memory leak when using bytes is not ideal. If anyone has suggestions on how to resolve that in pdfplumber (or whether it requires changes to ImageMagick or Wand), I'd be very interested to hear them. Thanks in advance.

ubmarco · 2020-04-30T07:04:22Z

Thanks for that quick fix @jsvine, I will try it out. Defaulting to Wand/ImageMagick file reading is a good option for me. And generally thanks for being so responsive.

jsvine · 2020-04-30T13:25:11Z

Thank you for the very clear bug reports!

jsvine · 2022-07-20T19:55:44Z

Closing this issue on the realization that the core bug here has been fixed and that the CPU / memory leak issue is being tracked in #193

cheungpat added a commit to cheungpat/pdfplumber that referenced this issue Feb 5, 2020

Fix jsvine#124 opening PDF with bytes stream

9d67aa9

cheungpat mentioned this issue Feb 5, 2020

Fix #124 opening PDF with bytes stream #179

Merged

jsvine closed this as completed in 8d6e52a Mar 30, 2020

jsvine added a commit that referenced this issue Mar 30, 2020

Merge pull request #179 from cheungpat/read-bytes-stream

bfeb454

Fix #124 opening PDF with bytes stream

ubmarco added a commit to useblocks/pdfplumber that referenced this issue Apr 27, 2020

Reverted PR jsvine#179 to use streams for page images

65789ee

The PR jsvine#179 leads to a CPU load and memory leak. The problem is documented here jsvine#124

jsvine reopened this Apr 27, 2020

jsvine added a commit that referenced this issue Apr 29, 2020

Fix .get_page_image to prefer paths over streams

ab957de

See discussion in #124 for details. h/t @ubmarco

jsvine added the help wanted label Apr 29, 2020

jsvine removed the help wanted label Jul 20, 2022

jsvine closed this as completed Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'_io.BytesIO' object has no attribute 'name' #124

'_io.BytesIO' object has no attribute 'name' #124

OisinMoran commented Jun 27, 2019

ubmarco commented Apr 27, 2020

ubmarco commented Apr 27, 2020 •

edited

Loading

ubmarco commented Apr 27, 2020 •

edited

Loading

jsvine commented Apr 27, 2020

ubmarco commented Apr 28, 2020

jsvine commented Apr 29, 2020

ubmarco commented Apr 30, 2020

jsvine commented Apr 30, 2020

jsvine commented Jul 20, 2022

'_io.BytesIO' object has no attribute 'name' #124

'_io.BytesIO' object has no attribute 'name' #124

Comments

OisinMoran commented Jun 27, 2019

ubmarco commented Apr 27, 2020

ubmarco commented Apr 27, 2020 • edited Loading

ubmarco commented Apr 27, 2020 • edited Loading

jsvine commented Apr 27, 2020

ubmarco commented Apr 28, 2020

jsvine commented Apr 29, 2020

ubmarco commented Apr 30, 2020

jsvine commented Apr 30, 2020

jsvine commented Jul 20, 2022

ubmarco commented Apr 27, 2020 •

edited

Loading

ubmarco commented Apr 27, 2020 •

edited

Loading