Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracting text from a two columns page #244

Closed
fdq09eca opened this issue Aug 7, 2020 · 16 comments
Closed

extracting text from a two columns page #244

fdq09eca opened this issue Aug 7, 2020 · 16 comments

Comments

@fdq09eca
Copy link

fdq09eca commented Aug 7, 2020

I extract the text of the following page:
image

I used the following code

import requests, pdfplumber
from io import BytesIO
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0428/2020042801961.pdf'
rq = requests.get(url)
pdf = pdfplumber.load(BytesIO(rq.content))
page = pdf.pages[62]
txt = page.extract_text()
print(txt)

it produce this
image

I want to turn two columns ending section, see below, into two rows
image

such that it is

Mazars CPA Limited
Certified Public Accountants
Hong Kong, 22 April 2020
The engagement director of Mazars CPA Limited on the
audit resulting in this independent joint auditors report
is:
She Shing Pang Yan Tat Wah, Joseph
Practising Certificate number: P05510
LKY China
Certified Public Accountants
Hong Kong, 22 April 2020
The engagement partner of LKY China on the audit 
resulting in this independent joint auditors report is:

not sure if it is possible. It will be great that if there is a function which returns boolean that show if the ending is a two-columns. Any suggestion will be appreciated!

@samkit-jain
Copy link
Collaborator

Hi @fdq09eca , thanks for showing interest in the library. This is something not possible since when extracting text, it goes top to down and left to right. To get the text in the way you desire, you would have to come up with your own logic. One possible workaround could be to (assuming that the 2 columns section is a table) extract table and then read the text column by column.

@samkit-jain
Copy link
Collaborator

samkit-jain commented Aug 7, 2020

Another possible solution, specific to the page 63 could be:

  1. Crop the page into 2 equal haves vertically and removing the top 40% portion and bottom 10% assuming the interested region is only that 2-column section.
    left = page.crop((0, 0.4 * float(page.height), 0.5 * float(page.width), 0.9 * float(page.height)))
    right = page.crop((0.5 * float(page.width), 0.4 * float(page.height), page.width, 0.9 * float(page.height)))
  2. Extract text from the left half.
    left.extract_text()
  3. Extract text from the right half.
    right.extract_text()

@fdq09eca
Copy link
Author

fdq09eca commented Aug 7, 2020

@samkit-jain excellent work-around!

@samkit-jain
Copy link
Collaborator

Glad it helped you out. Thanks to you @fdq09eca as well as I found a bug (#245 ) when working on this.

@cmicek1
Copy link

cmicek1 commented Apr 2, 2021

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

@samkit-jain
Copy link
Collaborator

@cmicek1 Could you please share a sample PDF that you are dealing with and elaborate on the problem that you are facing along with a reproducible code and the result that you are getting?

@jsvine
Copy link
Owner

jsvine commented Apr 8, 2021

Is there no general solution to this problem though?

This StackOverflow page provides some interesting insight into this very question: https://stackoverflow.com/questions/22675690/if-identifying-text-structure-in-pdf-documents-is-so-difficult-how-do-pdf-reade

@danielbellhv
Copy link

danielbellhv commented Nov 30, 2021

  1. Crop the page into 2 equal haves vertically and removing the top 40% portion and bottom 10% assuming the interested region is only that 2-column section.

How might I do this conditionally? Only if when a page does have column text @samkit-jain

@samkit-jain
Copy link
Collaborator

@danielbellhv It would depend on the PDFs you are dealing with. A sophisticated solution might be to use a layout analysis algorithm to identify whether a page is multi column or not.

A simpler solution could be to crop the page keeping the middle 5% and run text extraction on it to see if there's any text or not. If no text, then there could be 2 columns. Of course, you will have to tweak the 5% and see what best fits your need. This also assumes that the full page is in a 2 column layout.

@danielbellhv
Copy link

danielbellhv commented Nov 30, 2021

Would you mind copying and pasting that as a comment under my SO post, please? I will try middle 5% :)

@danielbellhv
Copy link

danielbellhv commented Dec 1, 2021

I have since solved my own sub-issue, using pdfminer. Answered here. Thanks for your input

@ameymn
Copy link

ameymn commented Nov 3, 2022

You can use PyPDF2 instead of pdfplumber there it reads the pdf left side first and then the right side just like humans do .Hence if you use PyPDF2 no need splitting the page .

@abubelinha
Copy link

You can use PyPDF2 instead of pdfplumber there it reads the pdf left side first and then the right side just like humans do. Hence if you use PyPDF2 no need splitting the page

@ameymn can you post a link to a example code of that specific PyPDF2 usage?
Thanks
@abubelinha

@ameymn
Copy link

ameymn commented Jun 26, 2023

This is my code which I used
Screenshot 2023-06-26 154357

You can read this documentation if you need any help

https://pypdf2.readthedocs.io/en/3.0.0/index.html

@abubelinha

@eshwarbc30
Copy link

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

did you got any solution for this? I need extract text from pdf containing 2 pdf format.

@eshwarbc30
Copy link

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

hi, @cmicek1 Did you get any solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants