Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hindi text is rendered incorrectly #365

Closed
namastevis opened this issue Mar 14, 2022 · 20 comments · Fixed by #820
Closed

Hindi text is rendered incorrectly #365

namastevis opened this issue Mar 14, 2022 · 20 comments · Fixed by #820

Comments

@namastevis
Copy link

namastevis commented Mar 14, 2022

While trying to generate a pdf using FPDF2, the Hindi text is not generated correctly. I have tried using different fonts (Gargi, Mangal, Arjun-Wide, Mukta, Lohit) but all give the wrong result similar to what shown below.
Correct hindi text: इण्टरनेट पर हिन्दी के साधन
What is printed:
Screenshot 2022-03-14 at 10 40 13 PM

It seems the issue happens in the following two scenarios:
1.
Screenshot 2022-03-14 at 10 44 12 PM
When this appears before a character, while printing it moves to the next character.

  1. When two consonants are merged to generate a ligature in Hindi, they get split into two.
@namastevis namastevis added the bug label Mar 14, 2022
@gmischler
Copy link
Collaborator

Unfortunately, this is not a trivial problem to solve, and fpdf is a deliberately simple PDF generation library.

What you're seeing is a lack of support for automatic ligatures, more specifically Devaganari conjuncts. There are hundreds of those, many more than normal characters. A supporting font will include a table of character sequences that are supposed to be substituted for a ligature glyph. The most complex example in your text is this (separated with spaces on the left side, so the browser does not combine them):
न् ् द ी ▶️ न्दी
Yes, that's four (4) individual unicode characters in the text that together should result in a single glyph.

Unfortunately, fpdf currently operates on a character-by-character basis when first determining the width of each character and later printing a suitable glyph from the selected font. Supporting ligatures would require their substitution to happen as the very first step. We would also need a custom datastructure to represent them, because they cannot be represented by a python unicode character.

Technically all of that it is certainly possible, but I wouldn't hold my breath for it right now. Anyone who knows enough about the internal structure of ttf fonts is of course welcome to contribute...

Btw: Ligatures exist in many other writing systems. And another peculiarity that might also be interesting is contextual forms, where a different glyph is used for the same character, depending on whether it appears at the beginning, the middle, or the end of a word, or isolated (common eg. in Arabic, Hebrew, Mongolian, etc.).

@Lucas-C
Copy link
Member

Lucas-C commented Apr 23, 2022

This issue also affects Tamil text: global-healthy-liveable-cities/global_scorecards#7

@MayankFawkes
Copy link

MayankFawkes commented Aug 11, 2022

@gmischler @Lucas-C unfortunately still the same problem, Pillow had the same problem rendering fonts then they added font layout engine and ImageFont.Layout.RAQM which solves the problem i am not really good but ig libraqm can help if someone can add it in fpdf2 useful link https://github.com/python-pillow/Pillow/blob/main/src/_imagingft.c#L118

@Lucas-C
Copy link
Member

Lucas-C commented Aug 11, 2022

This is an interesting lead, thank you @MayankFawkes.
One limitation is that libraqm is a C library that hasn't been packaged as a Python package, AFAIK.
Hence it won't be straightforward for fpdf2 to have a dependency on it.

@gmischler
Copy link
Collaborator

Interesting indeed!

Especially since we already have Pillow as a dependency...
Could we possibly "borrow" their layout engine? Or can that only be used to add text to an image?

@MayankFawkes
Copy link

MayankFawkes commented Aug 11, 2022

@Lucas-C there are some ways we can use it like, The build binary of libraqm is available, We can just use the binary of libraqm for Linux is it really simple apt-get install libraqm0 libraqm-dev and for windows there are third party builds available, We can release an optional update which support of libraqm engine so the problem of rendering hindi and other Unicode fonts will get solved.

First, there is the ctypes module in the standard library. It allows you to load a dynamic-link library (DLL on Windows, shared libraries .so on Linux) and call functions from these libraries, directly from Python. Such libraries are usually written in C. -- source

dependency problem: Pillow uses libraqm and doesn't care about installing it with pillow because it is optional if we want pillow to decode fonts properly then we have to manually install it, we can do the same and if we want to provide it as a dependency the best way to make our build for different architecture and put it in the pip wheel file.

to add libraqm dependency: there is a lib written in c/c++ for decoding qr and barcodes called zbar and they also have binary files so someone made a warper for that called pyzbar and this is how he building wheel file to add support of zbar binary link just adding binaries to wheel

I am dropping some more links to add c support in python with ctypes
digitalocean
betterprogramming

@gmischler
Copy link
Collaborator

gmischler commented Aug 12, 2022

If fpdf2 were linux-only, using stuff like ctypes would be no problem.
But for Windows, MacOS, and potentially other systems, only dependencies that can be installed via pip are practially realistic.

This specific issue here is "only" about ligatures, which is primarily necessary for indic scripts.
For scripts derived from aramaic (arabic, hebrew, mongolian, etc.), a more complete solution would indeed also handle bidi text and positional context, so having a feature complete "layout engine" would be nice.

A python implementation of the bidi algorithm is available in python-bidi, though it doesn't look particularly complicated, so we could easily roll our own.
There's also python-arabic-reshaper. It reverses the direction and does the necessary contextual and ligature substitutions. Unfortunately, all the substitutions are hardcoded, so it is suitable for arabic and kurdish text only. But it can still serve as an example on a possible way to proceed and what issues to look out for.

A general solution to ligatures requires a lookup of the substitutions in the font data. This seems straightforward, but we'll have to see what pitfalls we run into with it.
Note that we have some requirements that I haven't yet seen satisfied with any of the existing modules. Among other things, for any ligature glyph sequence, we need to preserve information about the original unicode code points. Those will be added as supplementary info to the PDF text, so that when you copy from a PDF viewer, you get the original unsubstituted text back.

So I suspect we can't just slap on a few more dependencies and let those do the work for us.
I'd suggest a step-by-step approach, that we essentially have already begun with:

  1. switch to fonttools (currently being worked on: Rewriting add_font() and _putfonts() using Fonttools library #477)
  2. support ligature substitutions
  3. support contextual substitutions
  4. support bidi text
  5. support vertical text

Obviously all of this won't happen within a few weeks. Care should also be taken at any step to take the possible requirements of the following steps into account, at least as far as can be predicted at the given time.

@Lucas-C
Copy link
Member

Lucas-C commented Aug 2, 2023

@andersonhc PR #820 has been merged today.

Could you test if that solved your issue @namastevis?

You can install fpdf2 directly from the master branch of this repo with this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

The documentation is there: https://pyfpdf.github.io/fpdf2/TextShaping.html

@mohindra9211
Copy link

mohindra9211 commented Sep 30, 2023

I continue to encounter the same problem, and unfortunately, it remains unresolved. I have experimented with various Hindi Devanagari fonts, but the text still does not render correctly.

Original Text = परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।

FPDF2 output
Screenshot 2023-09-30 193610

I'm perplexed by this issue. When I copy the output from FPDF2 and paste it into a web browser, it displays the correct output. I'm struggling to comprehend the source of this problem.

Browser output: परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।

It seems the issue happens in the following two scenarios:
1.
158225121-675921c8-9f5c-495a-838c-65b5b637900b
2. When this appears before a character, while printing it moves to the next character.

When two consonants are merged to generate a ligature in Hindi, they get split into two.

I humbly seek your support. In my role as a data scientist, I have explored different libraries for PDF generation and have observed that FPDF consistently delivers better outcomes in comparison to ReportLab, which encounters the same issue.

Python Version 3.11.4
FPDF2 Latest Version

@andersonhc
Copy link
Collaborator

@mohindra9211 did you try "set_text_shaping()"?

here is the small test I did:

from fpdf import FPDF

text= "परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।"

pdf = FPDF()
pdf.add_page()
pdf.add_font(family="Mangal", fname="C:\\Apps\\fpdf2\\test\\text_shaping\\Mangal 400.ttf")
pdf.set_font("Mangal", size=40)
pdf.set_text_shaping(False)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.ln()
pdf.set_text_shaping(True)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.output("hindi.pdf")

And the results with text shaping enabled looks correct.

@mohindra9211
Copy link

andersonhc

Dear AndersonHC,

I want to express my sincere gratitude for your assistance. You've helped me resolve a significant issue. However, I've noticed a minor problem in the output, and I suspect it might be related to the font. I'll try using different fonts to see if that resolves the issue.

Thank you once again for your valuable help.

123

@andersonhc
Copy link
Collaborator

Can you tell me what font and text you used?
I'd love to have all those glitches corrected.

@mohindra9211
Copy link

Can you tell me what font and text you used? I'd love to have all those glitches corrected.

This problem is solved
The "Karma" font (in the file "Karma-Regular.ttf") is the most suitable choice for displaying Hindi text. I have included a sample for your reference.

Thank you once again for your valuable help.

2

@mohindra9211

This comment was marked as resolved.

@mohindra9211
Copy link

Can you tell me what font and text you used? I'd love to have all those glitches corrected.

Tomorrow, I'll provide a list of fonts that correctly support Hindi text. Please incorporate this information into your document. It will be particularly beneficial for FPDF2 users, especially those in India. I appreciate your support and prompt response. Thank you.

@gmischler
Copy link
Collaborator

Those fonts that don't work with fpdf2, do they produce correct results with other software?
If so, then maybe you can provide a list of that category as well.
If we can't make them work on our own, then it may actually be that harfbuzz (the library that does the actual text shaping) is unable to handle them. In that case, the developers there might be interested to learn about it.

@sanjaykare
Copy link

@mohindra9211 did you try "set_text_shaping()"?

here is the small test I did:

from fpdf import FPDF

text= "परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।"

pdf = FPDF()
pdf.add_page()
pdf.add_font(family="Mangal", fname="C:\\Apps\\fpdf2\\test\\text_shaping\\Mangal 400.ttf")
pdf.set_font("Mangal", size=40)
pdf.set_text_shaping(False)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.ln()
pdf.set_text_shaping(True)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.output("hindi.pdf")

And the results with text shaping enabled looks correct.

I tried with the given code but it's not working & tried to Mangal_Regular font. May font problem. please correct it.
image

@mohindra9211
Copy link

@mohindra9211 did you try "set_text_shaping()"?
here is the small test I did:

from fpdf import FPDF

text= "परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।"

pdf = FPDF()
pdf.add_page()
pdf.add_font(family="Mangal", fname="C:\\Apps\\fpdf2\\test\\text_shaping\\Mangal 400.ttf")
pdf.set_font("Mangal", size=40)
pdf.set_text_shaping(False)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.ln()
pdf.set_text_shaping(True)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.output("hindi.pdf")

And the results with text shaping enabled looks correct.

I tried with the given code but it's not working & tried to Mangal_Regular font. May font problem. please correct it. image

Read "AttributeError" carefully and write the correct path

@sanjaykare
Copy link

sanjaykare commented Oct 25, 2023

@mohindra9211 did you try "set_text_shaping()"?
here is the small test I did:

from fpdf import FPDF

text= "परी कथाएँ काल्पनिक होते हुए भी मन को उड़ान देने वाली और शिक्षाप्रद होती हैं।"

pdf = FPDF()
pdf.add_page()
pdf.add_font(family="Mangal", fname="C:\\Apps\\fpdf2\\test\\text_shaping\\Mangal 400.ttf")
pdf.set_font("Mangal", size=40)
pdf.set_text_shaping(False)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.ln()
pdf.set_text_shaping(True)
pdf.multi_cell(w=pdf.epw, txt=text, new_x="LEFT", new_y="NEXT")
pdf.output("hindi.pdf")

And the results with text shaping enabled looks correct.

I tried with the given code but it's not working & tried to Mangal_Regular font. May font problem. please correct it. image

Read "AttributeError" carefully and write the correct path

fixed, the file path was not correct. Thank you!
need more questions 1) can we use Hindi text with HTML tag? 2) Can we use Hindi with English text both?

@mohindra9211
Copy link

To gain a better understanding of fpdf2, it is advisable to peruse the fpdf2 documentation along with its tutorials. It's worth noting that you can incorporate both Hindi and English text into your documents, depending on your coding proficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants