Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i18n: upgrade to latest icu formatter #13834

Merged
merged 15 commits into from
Oct 5, 2023
Merged

i18n: upgrade to latest icu formatter #13834

merged 15 commits into from
Oct 5, 2023

Conversation

connorjclark
Copy link
Collaborator

@connorjclark connorjclark commented Apr 7, 2022

(Oct 2023)

Updated to intl-messageformat to 10.5.3 and @formatjs/icu-messageformat-parser to 2.6.2.


(original message)

This is a big jump:

  • intl-messageformat: 4.4.0 -> 9.12.0 changelog
  • @formatjs/icu-messageformat-parser: (1.8.1 of old package) -> 2.0.19 changelog (the entire changelog is our diff b/c new package)

The new parser/ICU spec has made some of our strings invalid because:

  • new rich text support means literals with HTML tags is now syntax that needs to be escaped. see this
  • also apostrophes need to be escaped now

To resolve this I replaced some apostrophes with backticks (they are technical keywords so they shouldn't be translated anyway), and added a regex to preprocess our message strings before parsing them (so we don't have to go escape-crazy). This solution should work fine up until the day we actually want to use this new syntax.

The parser emits a more easily consumable AST, ex: nested nodes in select/plural nodes are nodes themselves (I think before they were just ranges specifying start/end)

Some stuff we might like to try out:

build/build-bundle.js Outdated Show resolved Hide resolved
@@ -4516,13 +4555,23 @@ intl-messageformat-parser@^1.8.1:
resolved "https://registry.yarnpkg.com/intl-messageformat-parser/-/intl-messageformat-parser-1.8.1.tgz#0eb14c5618333be4c95c409457b66c8c33ddcc01"
integrity sha512-IMSCKVf0USrM/959vj3xac7s8f87sc+80Y/ipBzdKy4ifBv5Gsj2tZ41EAaURVg01QU71fYr77uA8Meh6kELbg==

intl-messageformat@^4.1.2, intl-messageformat@^4.4.0:
intl-messageformat@^4.1.2:
version "4.4.0"
Copy link
Collaborator Author

@connorjclark connorjclark Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub ads has this old version as a dependency ... I think we could move it to their devDeps?

} catch (err) {
if (err.name !== 'SyntaxError') throw err;
// Improve the intl-messageformat-parser syntax error output.
/** @type {Array<{text: string}>} */
const expected = err.expected;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't exist any more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error messages definitely seem less useful (at least looking at the test changes). Is there any kind of equivalent to replace this?

Copy link
Collaborator Author

@connorjclark connorjclark Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a "location" field with begin/end markers, could use that to point to the specific characters that are problematic.

Can't promise fancy ascii pointers like you'd see elsewhere though :P

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I've looked at updating intl-messageformat in the past, I've ended up holding off because

  • it's bigger
  • our strings would need updating
  • nothing's broken

it seems like these things are all still true, so it seems like it's still worth putting off?

The ignoreTag option looks like it would let us dodge most of the escaping pain and just deal with escaping apostrophes, however:

  • from the built files, it looks like the new version will add 18KiB to viewer and 19KiB to flow.js. If/when we want message formatting in the regular report, it'll also be included there.
  • the advantages of updating don't seem that important to us. Memoized formatters are actually supported in the version we use, and while we won't have skeleton support, our current format style does support mapping style names to Intl.NumberFormat options (e.g. extendedPercent uses style: 'percent'), so we can get the same thing, just without the auto-parsing from skeletons to formatting options. We can expand our formatting types (so if we wanted a narrowTenthsOfAKibibit we could do it :)

Finally, Intl.MessageFormat is now stage 1 in TC39 and should eventually be included without a library. It's using the (in progress) Unicode MessageFormat 2.0 syntax, though, so there's probably not a gradual migration path there.

} catch (err) {
if (err.name !== 'SyntaxError') throw err;
// Improve the intl-messageformat-parser syntax error output.
/** @type {Array<{text: string}>} */
const expected = err.expected;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error messages definitely seem less useful (at least looking at the test changes). Is there any kind of equivalent to replace this?


for (const option of Object.values(element.options)) {
validate(option.value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For lhl validation purposes, It's possible testing that the actual substrings were valid lhlMessages may have been intentional? option.value.elements was available in the old version of intl-messageformat-parser as well. I think this was checking that it would be TC compatible, not just intl-messageformat compatible

Copy link
Collaborator Author

@connorjclark connorjclark Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the same? I think that part of the grammar is pretty recursive (as in, if this lib can parse it into a value, and that field is equivalent in structure to the root field, then the raw string should be valid as a string we'd send to TC?). I'll try to inspect the original PR and see if it was brought up.

FWIW, you can see the recursive test case failing if you comment this line out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily the same. For instance what's checked right above here: the ICU docs recommend that there be no content outside of plurals, but TC requires that there not be. Unfortunately everyone uses their own subset of the ICU/message format syntax.

The original PR doesn't list it's reasoning :( However, the fact that I wrote the "Each option value must also be a valid lhlMessage" comment above it to me means either there was a known mismatch, or the invariant is just what that comment describes and we should just check it directly rather than hoping there are no corner cases where they differ. Since this is in collect-strings, not in a lighthouse run or the report, seems reasonable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option value is really an array of messages. if one of the messages is a plural, then the validate function checks that it is the only entry (array length is 1 === no content outside of it). It seems like we're checking that correctly, and there's no failing test to say we aren't.

I do preferred this structured approach to reparsing the options... but we could do both just in case?

@connorjclark
Copy link
Collaborator Author

the advantages of updating don't seem that important to us

What of the desire to drop the report/renderer/i18n.js formatter file and localize our units in a potentially more consistent manner? Granted we have an unresolved discussion regarding this from #13830, and should probably attempt an analysis of our strings to determine if/how often implicit display widths are respected (ie do we ever get a language's equivalent of milliseconds when ms would have been preferred).

I'm having a difficult time comparing the latest version capabilities with the newest (and the CHANGELOG was kinda of terse), are there versioned docs? Are you perhaps looking at docs packaged in node_modules?

I don't find anything to disagree with out about on the individual points you've mentioned, but to me I feel that using such an old version of a package core to our i18n story is something we should resolve.

from the built files, it looks like the new version will add 18KiB to viewer and 19KiB to flow.js

gzip or disk?

Finally, Intl.MessageFormat is now stage 1 in TC39 and should eventually be included without a library. It's using the (in progress) Unicode MessageFormat 2.0 syntax, though, so there's probably not a gradual migration path there.

Does / will our translation pipeline support this syntax? Probably too early to tell, but yeah that'd be great if we didn't need any library, let's tag that LH 15.0 :P

@connorjclark
Copy link
Collaborator Author

What of the desire to drop the report/renderer/i18n.js formatter ...

oh i see, we can do this today by configuring the formatters. skeleton syntax is just shorthand for all that.

Base automatically changed from i18n-formatter-units to master April 19, 2022 20:58
@jaylinski
Copy link

Would be nice to get this merged in order to get rid of this npm deprecation warning:
npm WARN deprecated intl-messageformat-parser@1.8.1: We've written a new parser that's 6x faster and is backwards compatible. Please use @formatjs/icu-messageformat-parser

@hamirmahal
Copy link
Contributor

Would be nice to get this merged in order to get rid of this npm deprecation warning: npm WARN deprecated intl-messageformat-parser@1.8.1: We've written a new parser that's 6x faster and is backwards compatible. Please use @formatjs/icu-messageformat-parser

I also ran into this.

$  npm install
npm WARN deprecated intl-messageformat-parser@1.8.1: We've written a new parser that's 6x faster and is backwards compatible. Please use @formatjs/icu-messageformat-parser

added 631 packages, and audited 632 packages in 10s

133 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

@hamirmahal
Copy link
Contributor

I just wanted to bump this because I ran into a related issue again.

@hamirmahal
Copy link
Contributor

image

@vercel
Copy link

vercel bot commented Oct 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
lighthouse ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 3, 2023 9:48pm

@connorjclark connorjclark marked this pull request as ready for review October 3, 2023 21:48
@connorjclark connorjclark requested a review from a team as a code owner October 3, 2023 21:48
@connorjclark connorjclark requested review from brendankenny and removed request for a team October 3, 2023 21:48
shared/localization/format.js Show resolved Hide resolved
shared/localization/format.js Outdated Show resolved Hide resolved
shared/localization/format.js Outdated Show resolved Hide resolved
core/lib/i18n/i18n.js Outdated Show resolved Hide resolved
shared/localization/format.js Show resolved Hide resolved
@connorjclark connorjclark merged commit 3c57e70 into main Oct 5, 2023
29 checks passed
@connorjclark connorjclark deleted the i18n-update-icu-pkg branch October 5, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants