Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write() refactor to use new line wrapping code #346

Merged
merged 87 commits into from
Mar 7, 2022
Merged

Write() refactor to use new line wrapping code #346

merged 87 commits into from
Mar 7, 2022

Conversation

gmischler
Copy link
Collaborator

@gmischler gmischler commented Mar 3, 2022

Fixes #340

Checklist:

  • The GitHub pipeline is OK (green),
    meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.
  • A unit test is covering the code added / modified by this PR
  • This PR is ready to be merged
  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
  • A mention of the change is present in CHANGELOG.md
  • The PR description or comment contains a picture of a cute animal (not mandatory but encouraged 🦒 )

After changing .write() to use the new line wrapping code, it now handles soft hyphens.

Besides .write() itself, changes were necessary two other methods:

  • ._render_styled_cell_text() needed to learn how to set the current position to the end of the rendered text (for .write() really a bit left of the actual end).
  • When MultiLineBreak.get_line_of_given_width() ends up with a single word taking more space than the whole line, it just splits it apart unceremonously (with not even a hyphen). Since .write() often starts on an already partially populated line where the remaining available space may be very short, this could happen frequently but is not desired in this context. I added an option no_wordsplit=False to prevent that when set to True. If the current line has characters added but no break hint, then it will rewind its internal state and return an empty line, causing .write() to switch to a new line.

I have refactored the purely internal ._render_styled_cell_text() to use newpos_x=X.RIGHT and newpos_y=Y.TOP in place of ln=0. This opens the way for many other options.

	newpos_x: Current position in x after the call.
		X.LEFT    - left end of the cell
		X.RIGHT   - right end of the cell (default)
		X.START   - start of actual text
		X.END     - end of actual text
		X.WCONT   - for write() to continue next (slightly left of X.END
		X.CENTER  - center of actual text
		X.LMARGIN - left page margin (start of printable area)
		X.RMARGIN - right page margin (end of printable area)
	newpos_y: Current position in y after the call.
		Y.TOP     - top of the first line (default)
		Y.LAST    - top of the last line (same as TOP for single-line text)
		Y.NEXT    - top of next line (bottom of current text)
		Y.TMARGIN - top page margin (start of printable area)
		Y.BMARGIN - bottom page margin (end of printable area)

Not all of those have an immediately obvious use case, but they're essentially free and someone might find them practical for their purposes.
This change demonstrates that the concept works very nicely and intuitively. The next step will be to add them to the API of the public methods and deprecating ln=# (along with center=) in a follow-up PR.

There used to be no tests whatsoever for .write() itself, it only got tested incidentally because it is used by html.py. I've written some tests to verify the basic functionality, as well as that it handles page breaks and soft hyphens correctly.
At that opportunity, I've renamed the directory test/cells to test/text and collected all text related tests in there.
Additionally I created tests to verify that '._render_styled_cell_text()' determines the new positions correctly, since not all of them are currently exposed to the public API.

To make the code more flexible for future additions, I had to move the logic that sets the word spacing for justified text from .multi_cell() to ._render_styled_cell_text(). While doing so, I managed to optimize the PDF file size a bit, by only emitting "Tw" commands when actually necessary, and shortening the explicit word positioning values for unicode fonts to three decimals (analog to "Tw").
I also noticed that ._perform_page_break() caused unnecessary "Tw" entries in the PDF. It makes sense to reset "Tw" to the default 0 at the end of each page so that each page starts with a clean slate (and other software can extract individual pages without causing trouble). But recreating a non-0 value at the beginning of the new page is pointless, since typically each line will use a different value anyway. I removed that part, resulting in cleaner output.

The following PDFs needed to be changed to pass existing tests:

  • text/test_multi_cell_justified_with_unicode_font.pdf - Eliminated unnecessary "Tw" entries and shortened word positioning values
  • text/multi_cell_markdown_with_ttf_fonts.pdf - Shortened word positioning values
  • html/html_headings_line_height.pdf - Visible change! The new algorithm managed to fit one more word onto a line in the "P" paragraph
  • outline/2_pages_outline.pdf - Fewer unnecessary "Tw" entries
  • outline/html_toc.pdf - Minimal change in invisible whitespace (write_html() produces a lot of spurious whitespace, which should probably get fixed...).

*Values of csv files are converted by position, instead of content
* Updated tests to check for regression
* Updated documentation and tests to include multiline text.
restrict decimal seperator replacement to float fields
@gmischler gmischler requested a review from Lucas-C as a code owner March 3, 2022 16:10
@gmischler
Copy link
Collaborator Author

=========================== short test summary info ============================
FAILED test/end_to_end_legacy/charmap/test_charmap.py::test_first_999_chars[DejaVuSans.ttf]
FAILED test/end_to_end_legacy/charmap/test_charmap.py::test_first_999_chars[DroidSansFallback.ttf]
FAILED test/end_to_end_legacy/charmap/test_charmap.py::test_first_999_chars[Roboto-Regular.ttf]
FAILED test/end_to_end_legacy/charmap/test_charmap.py::test_first_999_chars[cmss12.ttf]
================= 4 failed, 846 passed, 11 warnings in 45.36s ==================

I had wondered why those tests failed on my system, and assumed it was because I have different versions of these fonts installed.
Now they fail here as well.
What's going on there?

@gmischler
Copy link
Collaborator Author

gmischler commented Mar 3, 2022

What's going on there?

Heh, now I had a closer look...

Those tests use .write() to list all characters in the respective fonts, including \u00ad, aka. soft-hyphen.

...which now gets eaten and never appears in the file. And to make things worse, eliminating that character causes all following characters to be indexed differently, which means the files are not different by just one character, but actually in most of the lines.

What is the best solution here?
a) Do we simply replace the files? Apparently PDF files show a soft-hyphen in the data as an actual hyphen, so there is also a visible difference.
b) Do we need a print_sh parameter to .write() and .multi_cell(), which just treats soft hyphens as printable characters (default False)? I guess that would be the cleaner solution.

Edit: I went with the second option. It was easy enough to implement.

@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Merging #346 (3a72364) into master (ef745ca) will increase coverage by 0.26%.
The diff coverage is 95.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
+ Coverage   90.58%   90.85%   +0.26%     
==========================================
  Files          20       20              
  Lines        5885     5936      +51     
  Branches     1182     1199      +17     
==========================================
+ Hits         5331     5393      +62     
+ Misses        331      320      -11     
  Partials      223      223              
Impacted Files Coverage Δ
fpdf/__init__.py 100.00% <ø> (ø)
fpdf/fpdf.py 86.60% <93.16%> (+0.72%) ⬆️
fpdf/line_break.py 99.25% <100.00%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef745ca...3a72364. Read the comment docs.

@Lucas-C
Copy link
Member

Lucas-C commented Mar 4, 2022

I'm reviewieng the code right now.
The change in test/html/html_headings_line_height.pdf seems a bit suspicious to me...
The text seemed better justified before. Now the "P" paragraph seems to bleed a little on the right.
Do you agree?

fpdf/__init__.py Outdated
@@ -39,6 +41,8 @@
"__license__",
# Classes
"FPDF",
"X",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really short for a class name...
It's not good if you read from fpdf import X, Y and you really don't know what are those X & Y objects!
I suggest to rename those enums XAlign & YAlign.
What do you think @gmischler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. They're not really alignments, though...
XPos & YPos ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm fine with that!

fpdf/fpdf.py Outdated
)

def _render_styled_cell_text(
self,
text_line,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter class (TextLine) could be added as annotation here, to improve readability

Copy link
Collaborator Author

@gmischler gmischler Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "annotation"?
It is mentioned in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant type hints, like this:

    def _render_styled_cell_text(
        self,
        text_line : TextLine,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I've never done that before...
I'll figure it out!

@@ -75,6 +76,8 @@ def __init__(self):
# SpaceHint is used fo this purpose.
# 3 - position of last inserted soft-hyphen
# HyphenHint is used fo this purpose.
# If print_sh=True, soft-hyphen is treated as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this big comment be converted to a docstring,
so that it gets rendered in the API docs?
https://pyfpdf.github.io/fpdf2/fpdf/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this class is part of the public API, and most of the comment is about its internals.
But the part I added about the print_sh parameter should indeed go into a docstring.

# HYPHEN is inserted instead of SOFT_HYPHEN
character = HYPHEN
return self.size_by_style(character, style)

def get_line_of_given_width(self, maximum_width):
# pylint: disable=too-many-return-statements
def get_line_of_given_width(self, maximum_width, no_wordsplit=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters with "no" in their names should be avoided, regarding code readability...
Could we replace this by wordsplit=True maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it this way because I like it when the default for an unusual condition is False... 😉
But yeah, wordsplit=True is probably easier on the eyes.

fpdf/fpdf.py Outdated
# Font styles preloading must be performed before any call to FPDF.get_string_width:
txt = self.normalize_text(txt)
styled_txt_frags = self._preload_font_styles(txt, markdown)
return self._render_styled_cell_text(
TextLine(styled_txt_frags, 0.0, 0, False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text_width, number_of_spaces_between_words & justify parameters could be passed by name here, to improve readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

fpdf/fpdf.py Outdated
@@ -2446,6 +2530,8 @@ def multi_cell(
max_line_height (int): optional maximum height of each sub-cell generated
markdown (bool): enable minimal markdown-like markup to render part
of text as bold / italics / underlined. Default to False.
print_sh (bool): Treat a soft-hyphen (\\u00ad) as a normal printable
character, instead of a line breaking opportunity. Default value: False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a detail: could you align the start of this costring line to match the "of text as bold" line above please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vim getting confused about something... will do.

test/text/test_render_styled.py Show resolved Hide resolved
test/text/test_write.py Show resolved Hide resolved
fpdf/fpdf.py Outdated
Possible values are: `L` or empty string: left align (default value) ;
`C`: center ; `R`: right align
newpos_x (Enum X): New current position in x after the call.
X.LEFT - left end of the cell
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't those details about all possible values be moved into docstrings in the enums definitions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Enums are meant for (eventual) public consumption...
Yeah, sounds like a good idea.

fpdf/fpdf.py Outdated
# adjustment before each space
if self.ws and self.unifontsubset:

word_spacing = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relation of this variable with self.ws?
Could a short comment be added here please, explaning things a little?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@gmischler
Copy link
Collaborator Author

gmischler commented Mar 4, 2022

Do you agree?

Hard to tell by eye. I was hoping that the new algorithm is just particularly efficient... 😉

On closer inspection, it does seem that the right margin is about one mm (~= c_margin) narrower than the left there. If I understand the workings of write_html() correctly, the the whole paragraph gets fed to .write() in one piece. In that case, and starting at ' l_margin' , it should behave exactly the same as .multi_cell(). I'll have to check if that produces the same effect with this font size, or what the difference might be.

Edit:
Ok, found the problem. I need to subtract c_margin from the printable width at least one more time than I actually did..
This is actually a quite confusing topic, because c_margin gets handled in many different places throughout the pipeline. I'll see if I can simplify this a bit. The user facing methods really shouldn't have to deal with it.

fpdf/fpdf.py Show resolved Hide resolved
Copy link
Member

@Lucas-C Lucas-C left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Great work there @gmischler, thank you.
Ping me when you'd like this PR to be merged 😊

@Lucas-C
Copy link
Member

Lucas-C commented Mar 5, 2022

Also, I have just merged #347 so you will have to rebase

@gmischler
Copy link
Collaborator Author

Also, I have just merged #347 so you will have to rebase

I've noticed... 😉

@gmischler
Copy link
Collaborator Author

gmischler commented Mar 5, 2022

Ok, I think this is ready to be merged now.

I changed newpos_[xy] to new[xy] for convenience, which should be equally understandable.
Once I had figured out annotations (I hope...), I also added them to other methods that I had already touched anyway.
I also fixed all the many instances where the docstrings incorrectly indicated argument types as "int" (often probably used as a placeholder for "number").

And after black had annoyed me about it all that time, I finally gave in and let it fix the (entirely unrelated) spacing on two lines of drawing.py.

There's an error with black in the 3.10 tests, but unfortunately it doesn't tell us what it thinks is wrong. Black doesn't complain here locally, so what's this about?

@Lucas-C
Copy link
Member

Lucas-C commented Mar 6, 2022

I made some tests with black and Python 3.10 on the code from your write_refactor branch:

# python --version
Python 3.10.2

# pip install --upgrade black
...
# black --version
black, 22.1.0 (compiled: yes)

# black fpdf/drawing.py
reformatted fpdf/drawing.py

All done! ✨ 🍰 ✨
1 file reformatted.

# git diff
diff --git a/fpdf/drawing.py b/fpdf/drawing.py
index b3776be..738df9f 100644
--- a/fpdf/drawing.py
+++ b/fpdf/drawing.py
@@ -512,7 +512,7 @@ class Point(NamedTuple):
             The scalar result of the distance computation.
         """

-        return (self.x ** 2 + self.y ** 2) ** 0.5
+        return (self.x**2 + self.y**2) ** 0.5

     @force_document
     def __add__(self, other):
@@ -2660,7 +2660,7 @@ class Arc(NamedTuple):
         lam_da = (prime.x / radii.x) ** 2 + (prime.y / radii.y) ** 2

         if lam_da > 1:
-            radii = Point(x=(lam_da ** 0.5) * radii.x, y=(lam_da ** 0.5) * radii.y)
+            radii = Point(x=(lam_da**0.5) * radii.x, y=(lam_da**0.5) * radii.y)

         sign = (self.large != self.sweep) - (self.large == self.sweep)
         rxry2 = (radii.x * radii.y) ** 2

@Lucas-C
Copy link
Member

Lucas-C commented Mar 6, 2022

I think you should leave fpdf/drawing.py as it was and/or maybe try to upgrade your local version of black?

@gmischler
Copy link
Collaborator Author

gmischler commented Mar 6, 2022

I think you should leave fpdf/drawing.py as it was and/or maybe try to upgrade your local version of black?

Weird.
black 22.1.0 just reverted the changes that black 21.9b0 had insisted on...
Wasn't the purpose of such a tool to bring more consistency?

Oh, and "ping", @Lucas-C 🔔

@Lucas-C Lucas-C merged commit a9ccc9f into py-pdf:master Mar 7, 2022
@gmischler gmischler deleted the write_refactor branch March 7, 2022 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make FPDF.write() use _render_styled_cell_text()
2 participants