Transliterator IDs with unknown BCP47 IDs #3891

skius · 2023-08-18T14:18:25Z

Currently, transliterator IDs are represented in DataLocales by putting their -t- ID into the aux key of und (#3765). This works for transliterators that exist at datagen-time and are not overridden by custom code-based impls at runtime.

Issues arise in the following:

A user writes a rule-based transliterator A-C, that recursively calls B-C. B-C does not exist at datagen, because the user wants to provide their ML implementation at runtime.
When compiling A-C, an ID for B-C needs to be stored. Transliterators available at datagen time (e.g., shipped ones in ICU4X¹) use the mentioned und+...-t- ID... representation with the aux key. B-C does not have this metadata available, because the user does not put this transliterator through the datagen pipeline (B-C is ML), so there is no metadata needed.
How do we refer to B-C in the datastruct for A-C? Keeping in mind that at runtime, the user will need to be able to hook into dataloading using this ID with the lookup function they pass to Transliterator::new.

Assuming we have some non-BCP47 representation for these non-existent-at-datagen-time transliterators, maybe und+x-LegacyID or similar, we run into issues when users want to override an existing (e.g., shipped) ICU4X datagen transliterator. Because there is no strong signal during compilation that a certain transliterator will be overriden, the compiler compiles the BCP47 ID, so at runtime, the lookup function also needs to support the BCP47 ID.

One fix for this is to take "this transform does not exist at datagen time" as a signal for "I want to override it during runtime", i.e., X-Z and Y-Z (which is used by X-Z) are shipped transliterators, the user wants to custom-impl Y-Z, so they have to delete the source files for Y-Z. Now during runtime the lookup function only needs to support one kind of ID, one that's easily derivable from the Legacy ID (e.g., und+x-LegacyID). A downside to this approach is that it's (potentially?) cumbersome for the user, and quite confusing to debug if they forget to delete the shipped Y-Z source file.

(BCP47 <> Legacy ID conversion is not obvious which is why this is an issue in the first place)

Discuss with:

Discuss:

How should B-C be compiled? (Transliterators for which there is no BCP47 ID available)
What should the interface of the lookup function be? FnMut(source: &str, target: &str, variant: Option<&str>) -> impl Translit? (aka LegacyID)
How to solve the issue with Y-Z? (Overriding shipped transliterators)

visible ones, internal ones have a different representation ↩

The text was updated successfully, but these errors were encountered:

sffc · 2023-08-21T19:15:53Z

How should B-C be compiled? (Transliterators for which there is no BCP47 ID available)

Do you have an example? I thought the idea with BCP-47 is that it is generalizable enough to represent anything you throw at it.

What should the interface of the lookup function be? FnMut(source: &str, target: &str, variant: Option<&str>) -> impl Translit? (aka LegacyID)

Why not just FnMut(id: &Locale) -> Option<impl Translit> ?

How to solve the issue with Y-Z? (Overriding shipped transliterators)

We should check the lookup function first, and then load from the data if we didn't find anything. I don't have a problem when a power user injects a runtime transliterator that is redundant with one from the data file. They are a power user so it's fine to expect them to know how to clean up their datagen to not generate redundant transliterators. Since everything is being handled with aux keys, it should be easy to filter these using the same machinery for filtering other aux keys.

skius · 2023-08-21T20:43:04Z

Do you have an example? I thought the idea with BCP-47 is that it is generalizable enough to represent anything you throw at it.

Right, I should have gone into more detail there:

The current approach does not do any programmatic conversion from legacy IDs (that are used in rule sources) to BCP47 IDs (that we use in the data struct). The way we can still use the BCP47 IDs is by relying on CLDR metadata and store a mapping using that metadata. BCP47 metadata is stored in the alias and backwardAlias fields: https://github.com/unicode-org/cldr/blob/main/common/transforms/de-ASCII.xml. Sample transliterators that do not have BCP47 metadata available are:

visibility: internal ones, like https://github.com/unicode-org/cldr/blob/main/common/transforms/Bengali-InterIndic.xml
Completely custom ones that the user knows they will provide at runtime, like B-C.

We do not do any programmatic conversion because the conversion seems a bit underspecified, and legacy IDs need data for full length + 4-length script names (Latn vs Latin), since legacy IDs can support both. If we have a programmatic (and user-understandable) conversion from legacy IDs, this issue is a non-issue. I suppose this might be the best way to solve this, but it might be associated with some UTS#35 work.

sffc · 2023-08-21T23:43:23Z

OK. Well I think the main thing we should do is to make any public APIs, including the plugin API, use BCP-47, because otherwise these internal CLDR strings get exposed to clients, which we should try to avoid.

For any transforms that don't have BCP-47 equivalents, can you just make up what they should be? There probably aren't more than a few dozen at most.

So, like, at runtime, there should not be any legacy IDs found anywhere. I still am not clear on why you think we should keep them around.

This is how we deal with IANA time zone names. We use BCP-47 time zone IDs everywhere at runtime, and we export a separate class that converts from IANA ot BCP-47 (#3499).

skius · 2023-08-22T00:25:21Z

OK. Well I think the main thing we should do is to make any public APIs, including the plugin API, use BCP-47, because otherwise these internal CLDR strings get exposed to clients, which we should try to avoid.

My suggestion in the OP is to have some easily convertible representation for legacy IDs (e.g., x-B-C) that would allow public construction from just "B-C" or "B" and "C", which is not an internal CLDR string because that's exactly the notation appearing in transform rule source syntax.

For any transforms that don't have BCP-47 equivalents, can you just make up what they should be? There probably aren't more than a few dozen at most.
So, like, at runtime, there should not be any legacy IDs found anywhere. I still am not clear on why you think we should keep them around.

This works for CLDR ones, like the linked Bengali-InterIndic, but not for cases where the user writes a rule based transliterator that uses a custom transliterator like :: B-C¹, which does not exist at all during datagen. Currently, the only way we can get a BCP47 ID from B-C in this scenario is to programmatically convert the arbitrary legacy ID B-C to BCP47.

However, we could require the user to provide (at datagen time) BCP47 IDs for all custom transliterators they intend to use, which would also fix the problem.

A quick note why programmatic legacy ID => BCP47 ID conversion is nontrivial:

Legacy ID	BCP47 according to Unicode	Comment
`Greek-Latin/UNGEGN`	`und-Latn-t-und-grek-m0-ungegn`	`und` language
`Greek-Latin/BGN`	`el-Latn-t-el-m0-bgn`	`el` language on source and target
`Russian-Latin/BGN`	`ru-Latn-t-ru-m0-bgn`	`ru` language for both source (understandable, it's `Russian`), but also target (why, it's `Latin`?)

If there's a better definition of the conversion that the one at the very top here: https://unicode.org/reports/tr35/tr35-general.html#Transforms, I haven't found it. I'm not saying it's impossible to programmatically convert, but I'm not sure if we want to invest the upstream work to get clarity on the conversion process right now?

assuming UTS#35 is not outdated, it would be more like :: x-B-x-C or :: x_B-x_C, because the literal B and C are not valid locales. ↩

sffc · 2023-08-22T15:44:16Z

Discussion notes:

@eggrobin - Since ICU only supports the legacy APIs we need to make sure there is a path for a compatible transition between ICU and ICU4X
@skius - Users can add custom transliterator data files at datagen time, and if they do, we need to make sure that we know how to convert those IDs
We may need a supplemental data file to include any ID mappings that don't come from XML
We will discuss these topics again with ICU-TC and CLDR-TC
@sffc - When reading the recursive IDs from the rule file can we try to parse them as both BCP-47 and legacy IDs?
@skius - Maybe but that is not clearly spec compliant
@robertbastian - You can create an empty rule file with a metadata file that contains the custom legacy ID to BCP-47 ID mapping
@sffc - If datagen gets messy, that's easy to clean up later. I'd rather have clean runtime and messy datagen.

Resolution:

CLDR transforms that don't yet have a modern ID should get one that first lives in ICU4X and should eventually be upstreamed
If the user wants a runtime-only transliterator, they can create a dummy transliterator at datagen time that has metadata for the ID mapping but no source

LGTM: @sffc @skius @robertbastian @eggrobin

skius · 2023-08-29T17:26:11Z

The code part is done, users are only exposed to BCP-47-T Locale's. Documenting the approach users should follow if the edge cases documented here apply to them as part of #3776

eggrobin · 2023-08-29T17:31:56Z

Since ICU only supports the legacy APIs we need to make sure there is a path for a compatible transition between ICU and ICU4X

On that subject, the path is to make ICU support the new IDs, and then to migrate CLDR data. See ICU-22474.

skius added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries labels Aug 18, 2023

skius mentioned this issue Aug 18, 2023

Checklist for Transliteration #3736

Closed

41 tasks

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 22, 2023

skius self-assigned this Aug 22, 2023

skius mentioned this issue Aug 29, 2023

Document transliteration data struct format #3776

Closed

5 tasks

skius closed this as completed Aug 29, 2023

skius mentioned this issue Sep 1, 2023

Chained Transliterator ID parsing #3991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transliterator IDs with unknown BCP47 IDs #3891

Transliterator IDs with unknown BCP47 IDs #3891

skius commented Aug 18, 2023

sffc commented Aug 21, 2023

skius commented Aug 21, 2023

sffc commented Aug 21, 2023

skius commented Aug 22, 2023 •

edited

Loading

sffc commented Aug 22, 2023

skius commented Aug 29, 2023

eggrobin commented Aug 29, 2023

Transliterator IDs with unknown BCP47 IDs #3891

Transliterator IDs with unknown BCP47 IDs #3891

Comments

skius commented Aug 18, 2023

Footnotes

sffc commented Aug 21, 2023

skius commented Aug 21, 2023

sffc commented Aug 21, 2023

skius commented Aug 22, 2023 • edited Loading

A quick note why programmatic legacy ID => BCP47 ID conversion is nontrivial:

Footnotes

sffc commented Aug 22, 2023

skius commented Aug 29, 2023

eggrobin commented Aug 29, 2023

skius commented Aug 22, 2023 •

edited

Loading