Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'_io.BytesIO' object has no attribute 'name' #124

Closed
OisinMoran opened this issue Jun 27, 2019 · 9 comments
Closed

'_io.BytesIO' object has no attribute 'name' #124

OisinMoran opened this issue Jun 27, 2019 · 9 comments

Comments

@OisinMoran
Copy link
Contributor

The to_image() method does not seem to work if the pdfplumber.PDF object was created using a BytesIO stream. The rest of the functionality seems unaffected.

The problem seems to arise in the call to wand.image.Image() in the get_page_image() function in display.py. This image function have the ability to take file objects using the file argument explained here but get_page_image() only ever uses the filename parameter. Line 42 of the PageImage class is also looking for the name of the stream, but BytesIO objects do not have a name. Extracting characters, rectangles etc. can still be done with these BytesIO objects.

The MWE:

import pdfplumber
from io import BytesIO

file_path = "file.pdf"

# This example succesfully extracts chars and makes an image
file_like_object = open(file_path, "rb") # _io.BufferedReader object
first_page = pdfplumber.load(file_like_object).pages[0]
chars = first_page.chars
im = first_page.to_image()

# This example succesfully extracts chars but does not make an image
file_like_object.seek(0)
different_file_like_object = BytesIO(file_like_object.read()) # _io.BytesIO object
first_page_2 = pdfplumber.load(different_file_like_object).pages[0]
chars_2 = first_page_2.chars
im = first_page_2.to_image()

Gives the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-3f3cda6c1277> in <module>
      5 first_page_2 = pdfplumber.load(different_file_like_object).pages[0]
      6 chars_2 = first_page_2.chars
----> 7 im = first_page_2.to_image()

~/web_app/lib/python3.6/site-packages/pdfplumber/page.py in to_image(self, **conversion_kwargs)
    256         if "resolution" not in conversion_kwargs:
    257             kwargs["resolution"] = DEFAULT_RESOLUTION
--> 258         return PageImage(self, **kwargs)
    259 
    260 class DerivedPage(Page):

~/web_app/lib/python3.6/site-packages/pdfplumber/display.py in __init__(self, page, original, resolution)
     40         if original == None:
     41             self.original = get_page_image(
---> 42                 page.pdf.stream.name,
     43                 page.page_number - 1,
     44                 resolution

AttributeError: '_io.BytesIO' object has no attribute 'name'

Not sure how best to fix this issue.

cheungpat added a commit to cheungpat/pdfplumber that referenced this issue Feb 5, 2020
@jsvine jsvine closed this as completed in 8d6e52a Mar 30, 2020
jsvine added a commit that referenced this issue Mar 30, 2020
@ubmarco
Copy link

ubmarco commented Apr 27, 2020

Hi, I just wanna tell that PR #179 breaks the to_image function for me.
Ghostscript runs dozens of seconds on a page crop and when returned Python eats up all 16GB on my Linux machine until it becomes unresponsive. I'm still debugging why this happens, but I wanted to communicate that early.

@ubmarco
Copy link

ubmarco commented Apr 27, 2020

So I pinpointed the issue. The issue is related to resolution and amount of pages.
I modified the test function to this:

import io
import pdfplumber

TEST_PDF = 'WARN-Report-for-7-1-2015-to-03-25-2016_times3.pdf'


def bytes_stream_to_image():
    page = pdfplumber.PDF(io.BytesIO(open(TEST_PDF, 'rb').read())).pages[0]
    im = page.to_image(resolution=300)
    im.save('out.png', 'png')


bytes_stream_to_image()

The TEST_PDF is the one from pdfplumber 'WARN-Report-for-7-1-2015-to-03-25-2016.pdf'. I extended it by using PDFSAM and merging it 3 times with itself, so instead of 16 pages it now has 48 pages. Here is the file:
WARN-Report-for-7-1-2015-to-03-25-2016_times3.pdf

This will let GS run for appr. a minute with full load on one core, return and start eating all my memory while having huge CPU usage.

This won't happen in the old implementation, there above test function exits after appr. 5 seconds on the same machine.

@ubmarco
Copy link

ubmarco commented Apr 27, 2020

The difference comes in lib/python3.7/site-packages/wand/image.py in function def read() on line 8716. The former implementation called MagickReadImage where the binary implementation now calls MagickReadImageBlob:

        if blob is not None:
            if not isinstance(blob, abc.Iterable):
                raise TypeError('blob must be iterable, not ' +
                                repr(blob))
            if not isinstance(blob, binary_type):
                blob = b''.join(blob)
            r = library.MagickReadImageBlob(self.wand, blob, len(blob))
        elif filename is not None:
            filename = encode_filename(filename)
            r = library.MagickReadImage(self.wand, filename)

MagickReadImage instantiates the reader within the C extension while for MagickReadImageBlob the binary data comes from the Python interface. I don't know how to debug that any further, any advice?

ubmarco added a commit to useblocks/pdfplumber that referenced this issue Apr 27, 2020
The PR jsvine#179
leads to a CPU load and memory leak.
The problem is documented here
jsvine#124
@jsvine
Copy link
Owner

jsvine commented Apr 27, 2020

Thank you for flagging this, @ubmarco! I'm not terribly familiar with Wand's internals, so may have to do some additional research. But, in the meantime, what do you think of this short-term solution?:

  • Revert .to_image to its previous state, but check hasattr(page.pdf.stream, "name")
  • If hasattr(page.pdf.stream, "name") is False (because user is working directly with a stream, rather than a saved file), then process the stream bytes instead

@jsvine jsvine reopened this Apr 27, 2020
@ubmarco
Copy link

ubmarco commented Apr 28, 2020

Hi @jsvine yes that would be an option. We're working on a fork anyway because we needed some further adaptations so I just reverted the PR on our side. No need to rush. However this problem might affect others and CPU/memory leak does not obviously correlate to the binary reading of the PDF. It's definitely worth to further investigate.
I actually don't know, if this is a platform specific problem or related to the ImageMagick Version. I'm on an up-to-date 64bit Manjaro with magemagick 7.0.10.8-1.
Can the issue be confirmed by others?

jsvine added a commit that referenced this issue Apr 29, 2020
@jsvine
Copy link
Owner

jsvine commented Apr 29, 2020

Thanks again, @ubmarco. I checked and noticed I was seeing the same memory problems. (I hadn't noticed it in the test suite because the tested PDFs are intentionally small/short.) In v0.5.20, just now released, .to_image uses the filename if possible. If only bytes are available, though, it can still handle that.

Still, the CPU/memory leak when using bytes is not ideal. If anyone has suggestions on how to resolve that in pdfplumber (or whether it requires changes to ImageMagick or Wand), I'd be very interested to hear them. Thanks in advance.

@ubmarco
Copy link

ubmarco commented Apr 30, 2020

Thanks for that quick fix @jsvine, I will try it out. Defaulting to Wand/ImageMagick file reading is a good option for me. And generally thanks for being so responsive.

@jsvine
Copy link
Owner

jsvine commented Apr 30, 2020

Thank you for the very clear bug reports!

@jsvine
Copy link
Owner

jsvine commented Jul 20, 2022

Closing this issue on the realization that the core bug here has been fixed and that the CPU / memory leak issue is being tracked in #193

@jsvine jsvine closed this as completed Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants