extracting text from a two columns page #244

fdq09eca · 2020-08-07T14:48:43Z

I extract the text of the following page:

I used the following code

import requests, pdfplumber
from io import BytesIO
url = 'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0428/2020042801961.pdf'
rq = requests.get(url)
pdf = pdfplumber.load(BytesIO(rq.content))
page = pdf.pages[62]
txt = page.extract_text()
print(txt)

it produce this

I want to turn two columns ending section, see below, into two rows

such that it is

Mazars CPA Limited
Certified Public Accountants
Hong Kong, 22 April 2020
The engagement director of Mazars CPA Limited on the
audit resulting in this independent joint auditors report
is:
She Shing Pang Yan Tat Wah, Joseph
Practising Certificate number: P05510
LKY China
Certified Public Accountants
Hong Kong, 22 April 2020
The engagement partner of LKY China on the audit 
resulting in this independent joint auditors report is:

not sure if it is possible. It will be great that if there is a function which returns boolean that show if the ending is a two-columns. Any suggestion will be appreciated!

The text was updated successfully, but these errors were encountered:

samkit-jain · 2020-08-07T15:40:42Z

Hi @fdq09eca , thanks for showing interest in the library. This is something not possible since when extracting text, it goes top to down and left to right. To get the text in the way you desire, you would have to come up with your own logic. One possible workaround could be to (assuming that the 2 columns section is a table) extract table and then read the text column by column.

samkit-jain · 2020-08-07T15:54:22Z

Another possible solution, specific to the page 63 could be:

Crop the page into 2 equal haves vertically and removing the top 40% portion and bottom 10% assuming the interested region is only that 2-column section.

left = page.crop((0, 0.4 * float(page.height), 0.5 * float(page.width), 0.9 * float(page.height)))
right = page.crop((0.5 * float(page.width), 0.4 * float(page.height), page.width, 0.9 * float(page.height)))

Extract text from the left half.
```
left.extract_text()
```
Extract text from the right half.
```
right.extract_text()
```

fdq09eca · 2020-08-07T16:36:43Z

@samkit-jain excellent work-around!

samkit-jain · 2020-08-07T16:41:53Z

Glad it helped you out. Thanks to you @fdq09eca as well as I found a bug (#245 ) when working on this.

cmicek1 · 2021-04-02T03:20:17Z

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

samkit-jain · 2021-04-06T16:29:28Z

@cmicek1 Could you please share a sample PDF that you are dealing with and elaborate on the problem that you are facing along with a reproducible code and the result that you are getting?

jsvine · 2021-04-08T03:34:33Z

Is there no general solution to this problem though?

This StackOverflow page provides some interesting insight into this very question: https://stackoverflow.com/questions/22675690/if-identifying-text-structure-in-pdf-documents-is-so-difficult-how-do-pdf-reade

danielbellhv · 2021-11-30T09:18:59Z

Crop the page into 2 equal haves vertically and removing the top 40% portion and bottom 10% assuming the interested region is only that 2-column section.

How might I do this conditionally? Only if when a page does have column text @samkit-jain

samkit-jain · 2021-11-30T14:48:05Z

@danielbellhv It would depend on the PDFs you are dealing with. A sophisticated solution might be to use a layout analysis algorithm to identify whether a page is multi column or not.

A simpler solution could be to crop the page keeping the middle 5% and run text extraction on it to see if there's any text or not. If no text, then there could be 2 columns. Of course, you will have to tweak the 5% and see what best fits your need. This also assumes that the full page is in a 2 column layout.

danielbellhv · 2021-11-30T14:50:24Z

Would you mind copying and pasting that as a comment under my SO post, please? I will try middle 5% :)

danielbellhv · 2021-12-01T16:04:08Z

I have since solved my own sub-issue, using pdfminer. Answered here. Thanks for your input

ameymn · 2022-11-03T18:06:41Z

You can use PyPDF2 instead of pdfplumber there it reads the pdf left side first and then the right side just like humans do .Hence if you use PyPDF2 no need splitting the page .

abubelinha · 2023-06-26T08:26:56Z

You can use PyPDF2 instead of pdfplumber there it reads the pdf left side first and then the right side just like humans do. Hence if you use PyPDF2 no need splitting the page

@ameymn can you post a link to a example code of that specific PyPDF2 usage?
Thanks
@abubelinha

ameymn · 2023-06-26T10:20:08Z

This is my code which I used

You can read this documentation if you need any help

https://pypdf2.readthedocs.io/en/3.0.0/index.html

@abubelinha

eshwarbc30 · 2024-01-09T05:51:36Z

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

did you got any solution for this? I need extract text from pdf containing 2 pdf format.

eshwarbc30 · 2024-01-09T05:53:11Z

Is there no general solution to this problem though? I have a dataset of 100 or so scientific papers I'm trying to parse, and having text extraction just smush the columns together is a bit problematic when I'm trying to see certain parts in context.

hi, @cmicek1 Did you get any solution for this?

samkit-jain closed this as completed Aug 7, 2020

samkit-jain mentioned this issue Aug 7, 2020

Inconsistent results when cropping an already cropped page #245

Closed

roger-mahler mentioned this issue Jan 26, 2021

Inventory of candidate tools to extract text from multicolumn page (Courier) inidun/unesco_data_collection#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extracting text from a two columns page #244

extracting text from a two columns page #244

fdq09eca commented Aug 7, 2020 •

edited

Loading

samkit-jain commented Aug 7, 2020

samkit-jain commented Aug 7, 2020 •

edited

Loading

fdq09eca commented Aug 7, 2020 •

edited

Loading

samkit-jain commented Aug 7, 2020

cmicek1 commented Apr 2, 2021

samkit-jain commented Apr 6, 2021

jsvine commented Apr 8, 2021

danielbellhv commented Nov 30, 2021 •

edited

Loading

samkit-jain commented Nov 30, 2021

danielbellhv commented Nov 30, 2021 •

edited

Loading

danielbellhv commented Dec 1, 2021 •

edited

Loading

ameymn commented Nov 3, 2022 •

edited

Loading

abubelinha commented Jun 26, 2023

ameymn commented Jun 26, 2023

eshwarbc30 commented Jan 9, 2024

eshwarbc30 commented Jan 9, 2024

extracting text from a two columns page #244

extracting text from a two columns page #244

Comments

fdq09eca commented Aug 7, 2020 • edited Loading

samkit-jain commented Aug 7, 2020

samkit-jain commented Aug 7, 2020 • edited Loading

fdq09eca commented Aug 7, 2020 • edited Loading

samkit-jain commented Aug 7, 2020

cmicek1 commented Apr 2, 2021

samkit-jain commented Apr 6, 2021

jsvine commented Apr 8, 2021

danielbellhv commented Nov 30, 2021 • edited Loading

samkit-jain commented Nov 30, 2021

danielbellhv commented Nov 30, 2021 • edited Loading

danielbellhv commented Dec 1, 2021 • edited Loading

ameymn commented Nov 3, 2022 • edited Loading

abubelinha commented Jun 26, 2023

ameymn commented Jun 26, 2023

eshwarbc30 commented Jan 9, 2024

eshwarbc30 commented Jan 9, 2024

fdq09eca commented Aug 7, 2020 •

edited

Loading

samkit-jain commented Aug 7, 2020 •

edited

Loading

fdq09eca commented Aug 7, 2020 •

edited

Loading

danielbellhv commented Nov 30, 2021 •

edited

Loading

danielbellhv commented Nov 30, 2021 •

edited

Loading

danielbellhv commented Dec 1, 2021 •

edited

Loading

ameymn commented Nov 3, 2022 •

edited

Loading