Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using fonttools #418

Closed
gmischler opened this issue May 7, 2022 · 13 comments
Closed

Switch to using fonttools #418

gmischler opened this issue May 7, 2022 · 13 comments

Comments

@gmischler
Copy link
Collaborator

gmischler commented May 7, 2022

Problem
Currently fpdf2 uses its own ttfonts.py module to read and process the TrueType family of font files.
This obviously works for what whe're doing so far, and with the most common types of such files.
But there are several problems with that approach.

  1. There's a significant variety of font file types out there that all sail under the general "TrueType" flag, but contain either different types of information or the same information stored in different ways. While we cover the most common cases, users can easily run into fonts that we are unable to support.
  2. There can be a lot of different information to be found in a font file and we currently only support a small subset of that. I've seen several feature requests just in the last few months where the implementation would require access to more detailed information from fonts than we currently have.
  3. Both the variety and the amount of information in font files are constantly expanding, as the standards are continuously enhanced.

Solution
Fortunately, other people have already dealt with those issues. There are libraries available that can be used to access the data in font files without having to worry about how it is actually stored and how that might change over time.
In the Python world, Fonttools seems to be weapon of choice. According to the description, it is implemented in pure Python, and it seems to be under very active continuous development.
Fonttools actually does a lot more than what we need, what we would use is essentially just fontTools.ttLib().

Additional context

  • Issue Vertically align Text in a Cell or Multicell #210, Discussion Migrating codebase from old FPDF to fpdf2 #411, and probably others, requesting the ability to position text vertically with more precision (in a variety of contexts). This requires access to more font metrics data than we currently use.
  • Issue Hindi text is rendered incorrectly #365, which notes that writing systems with mandatory ligatures don't render correctly (the same is true for contextual glyph selection, and possibly other features). Doing this right would require to access the substitution tables (there are several types) in supporting font files. I've had a very quick look at the file format details, and found that it would take a lot of time and effort to add that ourselfes. With fonttools, at least the data retreival would come right out of the box, even if we'd still need to figure out the actual text transformations based on that.
  • Issue CBDT/CBLC font support for color emojis #224, could probably be resolved by this
  • Apparently this is not a new topic at all, and has been discussed for years and years, even in the old fpdf repository (can't find the links right now).

Open questions

  • Are there any alternatives to fonttools out there that I've missed?
  • Given the apparently quite dynamic development over there, should we fix the dependency to a fixed major version?
@Lucas-C
Copy link
Member

Lucas-C commented May 9, 2022

Thank you for initiating this issue @gmischler!
I planned to do the same following discussion #411.

Currently, the ttfonts.TTFontFile class contains all the font-parsing logic.
Its usage is very closely located:

  • among its public methods, the only ones used in other parts of the code are .getMetrics() & .makeSubset()
  • several attributes are also directly read on TTFontFile instance objects: .ascent, .descent, .capHeight, .flags, .bbox, .defaultWidth...
  • all those usages are made inside FPDF.add_font & FPDF._putfonts_

I think a starting point could be to rewrite FPDF.add_font & FPDF._putfonts_ to use the fonttools lib,
and then check that we can successfully generate a PDF with text.

Then, in a second phase, care should be taken to ensure backward compatibility
and try to make all the existing text-related unit tests pass with minimum visual changes.

Contributions are welcome!

@gmischler
Copy link
Collaborator Author

and try to make all the existing text-related unit tests pass with minimum visual changes.

If the transition is done right (and assuming the current solution works correctly), I don't see why there should be any difference in the output at all.

@RedShy
Copy link

RedShy commented Jun 11, 2022

Contributions are welcome!

I would like to help with this! even if I'm new to fonts and new to how to embed fonts in PDFs so I think I will make some mistakes along the road.

I think a starting point could be to rewrite FPDF.add_font & FPDF._putfonts_ to use the fonttools

I managed to use fonttools inside FPDF.add_font and drop ttfonts.py completely, I will open a draft PR for these changes.

I'm finding problems in using fonttools inside FPDF._putfonts_. I see that there is the method ttf.makeSubset() that I don't get what really does and then I see that ttfontstream is generated.
From what I know, this is a sequence of bytes embedded in the PDF. Here I have some doubts:

  • Why ttfontstream has to be created and why we cannot embed directly the .ttf file in the PDF?
  • I observed that only the tables ("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep") are included in ttfontstream, the others are dropped, don't get why.

Where I can find new information to keep going?
What I could do is to try to assemble the ttfontstream bytes sequence using only fonttools, not sure how to do that for now

@gmischler
Copy link
Collaborator Author

  • Why ttfontstream has to be created and why we cannot embed directly the .ttf file in the PDF?

Simply put: We only want to include the data that is actually needed to render the PDF.
For 8-bit codepage based ttfs, this apparently results in the tables you list in the next point.
For Unicode font files, the volume needs to get reduced further. Those can get arbitrarily large, dozens of megabytes are not uncommon. Because of that, only the glyphs that are actually used in the file are included. Each glyph gets a local index number for that purpose, which is usually different from its Unicode code point.
I'm not sure if you need to worry about those details too much, though. As a first step, it should be enough to just find the currently used data through the new library. The level of abstraction between our ttfonts.TTFontFile and fonttools is likely to be quite different, so you'll have to do a little research to find the respective equivalent calls.

  • I observed that only the tables ("OS/2","cmap","cvt","fpgm","gasp","glyf","head","hhea","hmtx","loca","maxp","name","post","prep") are included in ttfontstream, the others are dropped, don't get why.

TTF files can contain a large number of different tables, some of which are only used on a particular OS, or serve some other special purposes. Many of the "dropped" tables could be used to help select the right glyph (eg. "gsub"), or to position it optimally (eg. kerning). These are tasks that need to happen when the file is created (if at all), so there's no benefit in including that data in there, since a PDF reader would have no use for it.

In fact, making it easier to access some of the other tables (eg. "gsub" for solving #365) is one of the primary purposes of using fonttools in the first place.

@Lucas-C
Copy link
Member

Lucas-C commented Jun 14, 2022

I managed to use fonttools inside FPDF.add_font and drop ttfonts.py completely, I will open a draft PR for these changes.

Good job @RedShy!
Were you able to produce a PDF using a font coming from a .ttf file?

there is the method ttf.makeSubset() that I don't get what really does

It comes directly from the PHP original code:
https://github.com/Setasign/tFPDF/blob/master/font/unifont/ttfonts.php#L494

I couldn't explain its role clearly...

@gmischler already provided an excellent answer. I don't have more useful information to share here...
There is a lot of code exploration to do.
Maybe fonttools documentation would be helpful in understanding tables roles:
https://fonttools.readthedocs.io/en/latest/ttLib/index.html

@gmischler
Copy link
Collaborator Author

Where I can find new information to keep going?

The most comprehensive information I've found on the font file format and the meaning of the various tables is from Microsoft: OpenType Specification Version 1.9
Apple also has some information: TrueType Reference Manual

@RedShy
Copy link

RedShy commented Jun 16, 2022

Thank you both! it’s really encouraging and motivating to receive thoroughly answers!

We only want to include the data that is actually needed to render the PDF.

It makes sense and now it’s more clear!

As a first step, it should be enough to just find the currently used data through the new library.

I managed to do that inside FPDF.add_font(). I looked at every data extracted with ttfonts.TTFontFile and searched for an equivalent data using fonttools, then I runned the test and are all green. For example this is the code I put inside FPDF.add_font(). I would like to better organize the code and make it more self explained.

# font tools
ft = ttLib.TTFont(ttffilename)

scale = 1000 / ft["head"].unitsPerEm
ascent = ft["hhea"].ascent * scale
descent = ft["hhea"].descent * scale
try:
    capHeight = ft["OS/2"].sCapHeight * scale
except AttributeError:
    capHeight = ascent
bbox = (
    f"[{ft['head'].xMin * scale:.0f} {ft['head'].yMin * scale:.0f}"
    f" {ft['head'].xMax * scale:.0f} {ft['head'].yMax * scale:.0f}]"
)
stemV = 50 + int(pow((ft["OS/2"].usWeightClass / 65), 2))
italicAngle = ft["post"].italicAngle
underlinePosition = ft["post"].underlinePosition * scale
underlineThickness = ft["post"].underlineThickness * scale

flags = 4
if ft["post"].isFixedPitch:
    flags |= 1
if ft["post"].italicAngle != 0:
    flags |= 64
if ft["OS/2"].usWeightClass >= 600:
    flags |= 262144

aw = ft["hmtx"].metrics[".notdef"][0]
defaultWidth = scale * aw

name = ft["name"].getBestFullName()

charWidths = [len(ft.getBestCmap().keys()) - 1]
for char in ft.getBestCmap().keys():
    if char in (0, 65535) or char >= 196608:
        continue

    glyph = ft.getBestCmap()[char]
    aw = ft["hmtx"].metrics[glyph][0]

    if char >= len(charWidths):
        size = (((char + 1) // 1024) + 1) * 1024
        delta = size - len(charWidths)
        if delta > 0:
            charWidths += [defaultWidth] * delta

    w = round(scale * aw + 0.001) or 65535  # ROUND_HALF_UP
    charWidths[char] = w

ttf = TTFontFile()
ttf.getMetrics(ttffilename)

assert ascent == ttf.ascent
assert descent == ttf.descent
assert capHeight == ttf.capHeight
assert bbox == (
    f"[{ttf.bbox[0]:.0f} {ttf.bbox[1]:.0f}"
    f" {ttf.bbox[2]:.0f} {ttf.bbox[3]:.0f}]"
)
assert italicAngle == ttf.italicAngle
assert stemV == ttf.stemV
assert underlinePosition == ttf.underlinePosition
assert underlineThickness == ttf.underlineThickness
assert flags == ttf.flags
assert defaultWidth == ttf.defaultWidth

After this, I wanted to do the same with FPDF._putfonts(). I see that the used data are just ttfontstream and codeToGlyph, both are initialized inside ttf.makeSubset(). So my idea was to produce them using fonttools the exact way are currently made, but I was not able to do that. It's hard for me to read that part of the code and understand what's really going on.

For now what I understood about ttfontstream is that is basically a "cleaned" font file, embedded inside the PDF that contains only the relevant information about how to render the font.

But how exactly the font has to be "cleaned" (in order to create it with fonttools)?

  • Which tables have to stay, which have to be dropped?
  • The kept tables have to be modified?
  • Are there other modifications to do other than working on the tables?

@gmischler
Copy link
Collaborator Author

But how exactly the font has to be "cleaned" (in order to create it with fonttools)?

"Use the source, luke!" 😉

I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools. Anyone else would have to go through the same steps to give you a better answer, and whoever originally wrote that code is probably not following the project anymore.

If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first. Try to simplify it by farming out the code dealing with individual tables (or other data structures) to seperate methods with speaking names. In a second step, you can then transition one of those at a time. Such a refactoring might also help to simplify future modifications and extensions.

@RedShy
Copy link

RedShy commented Jun 27, 2022

Loved the Star Wars reference! 😁

I guess there's no other way to figure it out than to step through the existing code, look where it gets its data from, and then replace that source with fonttools.

Okay then, I'm a bit busy these days, but I will try to do it in the following weeks

If it turns out to be too confusing for the direct approach, you could try to refactor the existing code first.

Yes it's a good idea

@Lucas-C
Copy link
Member

Lucas-C commented Jul 21, 2022

Hi @RedShy!
I'd like to rekindle this ^^
Have you been blocked by anything that I could help with ?

@RedShy
Copy link

RedShy commented Jul 25, 2022

Hi! Unfortunately I could not work much on this in the last days. But still I would like to give my contribution! In the following days I hope I will have more time to dedicate

@Lucas-C
Copy link
Member

Lucas-C commented Sep 7, 2022

Given that this migration has been beautifully made by @RedShy in #477,
do you think that we can cloe this @gmischler?

@gmischler
Copy link
Collaborator Author

Well, this task looks quite finished, so I guess we can declare it as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants