Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The font name from an embedded font contains a strange prefix #349

Closed
2 tasks done
pietermarsman opened this issue Dec 30, 2019 · 1 comment · Fixed by #357
Closed
2 tasks done

The font name from an embedded font contains a strange prefix #349

pietermarsman opened this issue Dec 30, 2019 · 1 comment · Fixed by #357

Comments

@pietermarsman
Copy link
Member

pietermarsman commented Dec 30, 2019

Describe the bug

This is a follow up of #72. When converting the test.pdf to html, the font is not recognized because it is named incorrectly.

There are actually two issues here:

  • The font-name has an strange prefix, e.g. "VZWISY+Georgia" instead of "Georgia"
  • The font-family css attribute has a binary string as value, including the b'' prefix.

To Reproduce

Convert the test.pdf to html:

pdf2txt.py test.pdf -t html -o test.html

Observe the span element:

<span style="font-family: b'VZWISY+Georgia'; font-size:12px">The Portable Document Format (PDF) is the world’s leading language for describing 
<br>the printed page</span>

Expected behavior

Extract font-name correctly, or rename it in HTMLConverter such that Georgia is recognized by the browser.

@pietermarsman
Copy link
Member Author

I figured out what the strange prefix is. According to Section 5.5.3 from the PDF Reference:

For a font subset, the PostScript name of the font - the value of the font's BaseFont entry and the font descriptor's FontName entry - begins with a tag followed by a plus sign (+). The tag consists of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the same PDF file must have different tags. For example, EOODIA+Poetica is the name of a subset of Poetica, a Type 1 font.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant