Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word-level font names and heights #28

Closed
jsfenfen opened this issue Mar 8, 2017 · 21 comments
Closed

word-level font names and heights #28

jsfenfen opened this issue Mar 8, 2017 · 21 comments

Comments

@jsfenfen
Copy link
Contributor

jsfenfen commented Mar 8, 2017

Having a font for an entire word helps parsing. A lot. Height also helps some.

I took a crack at this here, with some settings. Defaults also may need adjustment.

If you've got thoughts, @jsvine, lemme know and I can clean this up into a pr. Haven't gotten the testing set up yet.

jsfenfen@847a3bb

@jsfenfen
Copy link
Contributor Author

jsfenfen commented Mar 8, 2017

I guess with word heights I'm going back and forth on averaging them or taking the mode; left the latter in for the moment.

@jsvine
Copy link
Owner

jsvine commented Mar 8, 2017

Thanks! I like this. For testing's sake: Do you have shareable examples of PDFs where chars that should belong to the same word either have different heights or fontnames?

@jsfenfen
Copy link
Contributor Author

So I still haven't heard back about the files that originally required this. I could pretty easily just make up a sample pdf that failed the font height test, though obviously having an example would be better... The other time this stuff (can) come up is when the word tolerance is set too high and words run together inadvertently--though only if adjacent cells have different fonts. Will look around a bit.

@jsvine
Copy link
Owner

jsvine commented Mar 15, 2017

No worries. Thinking through this a bit. I'm tempted to, by default, group words by fonts, size, and color. (Yes, upcoming versions of pdfplumber will include font color!) Boolean params could turn them off. I.e., defaults would be:

def extract_words(chars,
  x_tolerance=DEFAULT_X_TOLERANCE,
  y_tolerance=DEFAULT_Y_TOLERANCE,
  keep_blank_chars=False,
  match_fontsize=True,
  match_fontcolor=True,
  match_fontname=True
)

That'd mean losing some of the flexibility of, e.g., DEFAULT_FONT_HEIGHT_TOLERANCE, but might make the options clearer. It'd also mean avoiding having to calculate the average/mode values for tolerance-ed attributes. For instance, this ...

page.extract_words()

.... might return ...

[ {
  "text": "Hello",
  "fontsize": 12,
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

... while ...

page.extract_words(match_fontsize=False)

.... would return ...

[ {
  "text": "Hello",
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

What do you think? Too inflexible?

@jsfenfen
Copy link
Contributor Author

I think that's great!

Also, I think whatever adjustments might be needed will become more obvious the more pdfs we trawl through...

@jsfenfen
Copy link
Contributor Author

I got a different sample of the docs with the font height thing! Going through them, uh, soonish.

@jsfenfen
Copy link
Contributor Author

Ok, I have this working in the word_fonts branch here using made up pdfs as tests. Trying to dig up the sample observed in the wild.

Am doing this with a custom WordFontError subclassed from RuntimeError, but am open to suggestions...

No idea if this will be at all helpful ahead of 0.60 rewrite, but...

@jsvine
Copy link
Owner

jsvine commented Apr 28, 2017

Ooh, thanks! Will definitely aim to incorporate this (or something close to it) into the next big release.

@problemsniper
Copy link

Is this in the current version? I am looking for font name and font size per work and not per letter.

@jsfenfen
Copy link
Contributor Author

hey @krishnakt031990 I don't think so, though the version I did of it is still here: https://github.com/jsfenfen/pdfplumber/tree/master . I guess there's a minor release that's been added since, I will update when I've got a sec.
@jsvine it looks like the pr doesn't have squashed commits? this isn't a big change, though would be clearer if I could squash those. Hmm.

@problemsniper
Copy link

Works perfectly! thanks @jsfenfen. Just have another question regarding the document. Did you try to reverse engineer to build a pdf out of the extracted properties of text? Just wanted some tips to create one if you did look into doing it.

@jsfenfen
Copy link
Contributor Author

"Did you try to reverse engineer to build a pdf out of the extracted properties of text?"
No.... I'm not sure I get the use case--couldn't you just use the original pdf? But if you really want to create a pdf from objects of your choosing, maybe https://bitbucket.org/rptlab/reportlab ?

@jsfenfen
Copy link
Contributor Author

jsfenfen commented Oct 3, 2017

@krishnakt031990 is this a pdf that's been OCR'ed? Fonts aren't very reliable in most of the OCR I've seen--could this have been set there? Also possible this is a pdfminer thing? Can you share a doc that does this?

@problemsniper
Copy link

problemsniper commented Oct 23, 2017

For the font size.. the point size is about 4-5 pts more than the actual font. I can give an example with an image here.

image

See that extra spacing on top of My?

@Saqhas
Copy link

Saqhas commented Mar 19, 2020

@jsvine Is this issue resolved and the functionality added.

@jsvine
Copy link
Owner

jsvine commented Apr 1, 2020

This functionality has not yet been added. I'm certainly open to adding it, but haven't had the time quite yet.

@Saqhas
Copy link

Saqhas commented Apr 1, 2020

I wanted this functionality in one of my project. I have done some changes in the repo code to support this functionality, should I push it in a branch and create pull request. So that we can discuss and add it.

@jsvine
Copy link
Owner

jsvine commented Apr 1, 2020

Thanks, @Saqhas! It's definitely worth a discussion and opening a pull request. I'm not certain I'll use your code, but it could definitely be helpful inspiration and I would certainly credit you for that.

@ibrahimshuail
Copy link

can we capture based on the font size, for eg if my font size is 12 I need the relevant words from that?

@jsvine
Copy link
Owner

jsvine commented Jul 24, 2020

@ibrahimshuail See my response to the separate issue you opened, #234

@jsvine
Copy link
Owner

jsvine commented Jan 27, 2021

Closing this now-done issue. Per merged PR above, this feature was added last year! 🎉

@jsvine jsvine closed this as completed Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants