Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliterator IDs with unknown BCP47 IDs #3891

Closed
skius opened this issue Aug 18, 2023 · 7 comments
Closed

Transliterator IDs with unknown BCP47 IDs #3891

skius opened this issue Aug 18, 2023 · 7 comments
Assignees
Labels
C-unicode Component: Props, sets, tries

Comments

@skius
Copy link
Member

skius commented Aug 18, 2023

Currently, transliterator IDs are represented in DataLocales by putting their -t- ID into the aux key of und (#3765). This works for transliterators that exist at datagen-time and are not overridden by custom code-based impls at runtime.

Issues arise in the following:

A user writes a rule-based transliterator A-C, that recursively calls B-C. B-C does not exist at datagen, because the user wants to provide their ML implementation at runtime.
When compiling A-C, an ID for B-C needs to be stored. Transliterators available at datagen time (e.g., shipped ones in ICU4X1) use the mentioned und+...-t- ID... representation with the aux key. B-C does not have this metadata available, because the user does not put this transliterator through the datagen pipeline (B-C is ML), so there is no metadata needed.
How do we refer to B-C in the datastruct for A-C? Keeping in mind that at runtime, the user will need to be able to hook into dataloading using this ID with the lookup function they pass to Transliterator::new.

Assuming we have some non-BCP47 representation for these non-existent-at-datagen-time transliterators, maybe und+x-LegacyID or similar, we run into issues when users want to override an existing (e.g., shipped) ICU4X datagen transliterator. Because there is no strong signal during compilation that a certain transliterator will be overriden, the compiler compiles the BCP47 ID, so at runtime, the lookup function also needs to support the BCP47 ID.

One fix for this is to take "this transform does not exist at datagen time" as a signal for "I want to override it during runtime", i.e., X-Z and Y-Z (which is used by X-Z) are shipped transliterators, the user wants to custom-impl Y-Z, so they have to delete the source files for Y-Z. Now during runtime the lookup function only needs to support one kind of ID, one that's easily derivable from the Legacy ID (e.g., und+x-LegacyID). A downside to this approach is that it's (potentially?) cumbersome for the user, and quite confusing to debug if they forget to delete the shipped Y-Z source file.

(BCP47 <> Legacy ID conversion is not obvious which is why this is an issue in the first place)

Discuss with:

Discuss:

  1. How should B-C be compiled? (Transliterators for which there is no BCP47 ID available)
  2. What should the interface of the lookup function be? FnMut(source: &str, target: &str, variant: Option<&str>) -> impl Translit? (aka LegacyID)
  3. How to solve the issue with Y-Z? (Overriding shipped transliterators)

Footnotes

  1. visible ones, internal ones have a different representation

@skius skius added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries labels Aug 18, 2023
@sffc
Copy link
Member

sffc commented Aug 21, 2023

  1. How should B-C be compiled? (Transliterators for which there is no BCP47 ID available)

Do you have an example? I thought the idea with BCP-47 is that it is generalizable enough to represent anything you throw at it.

  1. What should the interface of the lookup function be? FnMut(source: &str, target: &str, variant: Option<&str>) -> impl Translit? (aka LegacyID)

Why not just FnMut(id: &Locale) -> Option<impl Translit> ?

  1. How to solve the issue with Y-Z? (Overriding shipped transliterators)

We should check the lookup function first, and then load from the data if we didn't find anything. I don't have a problem when a power user injects a runtime transliterator that is redundant with one from the data file. They are a power user so it's fine to expect them to know how to clean up their datagen to not generate redundant transliterators. Since everything is being handled with aux keys, it should be easy to filter these using the same machinery for filtering other aux keys.

@skius
Copy link
Member Author

skius commented Aug 21, 2023

Do you have an example? I thought the idea with BCP-47 is that it is generalizable enough to represent anything you throw at it.

Right, I should have gone into more detail there:

The current approach does not do any programmatic conversion from legacy IDs (that are used in rule sources) to BCP47 IDs (that we use in the data struct). The way we can still use the BCP47 IDs is by relying on CLDR metadata and store a mapping using that metadata. BCP47 metadata is stored in the alias and backwardAlias fields: https://github.com/unicode-org/cldr/blob/main/common/transforms/de-ASCII.xml. Sample transliterators that do not have BCP47 metadata available are:

We do not do any programmatic conversion because the conversion seems a bit underspecified, and legacy IDs need data for full length + 4-length script names (Latn vs Latin), since legacy IDs can support both. If we have a programmatic (and user-understandable) conversion from legacy IDs, this issue is a non-issue. I suppose this might be the best way to solve this, but it might be associated with some UTS#35 work.

@sffc
Copy link
Member

sffc commented Aug 21, 2023

OK. Well I think the main thing we should do is to make any public APIs, including the plugin API, use BCP-47, because otherwise these internal CLDR strings get exposed to clients, which we should try to avoid.

For any transforms that don't have BCP-47 equivalents, can you just make up what they should be? There probably aren't more than a few dozen at most.

So, like, at runtime, there should not be any legacy IDs found anywhere. I still am not clear on why you think we should keep them around.

This is how we deal with IANA time zone names. We use BCP-47 time zone IDs everywhere at runtime, and we export a separate class that converts from IANA ot BCP-47 (#3499).

@skius
Copy link
Member Author

skius commented Aug 22, 2023

OK. Well I think the main thing we should do is to make any public APIs, including the plugin API, use BCP-47, because otherwise these internal CLDR strings get exposed to clients, which we should try to avoid.

My suggestion in the OP is to have some easily convertible representation for legacy IDs (e.g., x-B-C) that would allow public construction from just "B-C" or "B" and "C", which is not an internal CLDR string because that's exactly the notation appearing in transform rule source syntax.

For any transforms that don't have BCP-47 equivalents, can you just make up what they should be? There probably aren't more than a few dozen at most.
So, like, at runtime, there should not be any legacy IDs found anywhere. I still am not clear on why you think we should keep them around.

This works for CLDR ones, like the linked Bengali-InterIndic, but not for cases where the user writes a rule based transliterator that uses a custom transliterator like :: B-C1, which does not exist at all during datagen. Currently, the only way we can get a BCP47 ID from B-C in this scenario is to programmatically convert the arbitrary legacy ID B-C to BCP47.

However, we could require the user to provide (at datagen time) BCP47 IDs for all custom transliterators they intend to use, which would also fix the problem.

A quick note why programmatic legacy ID => BCP47 ID conversion is nontrivial:

Legacy ID BCP47 according to Unicode Comment
Greek-Latin/UNGEGN und-Latn-t-und-grek-m0-ungegn und language
Greek-Latin/BGN el-Latn-t-el-m0-bgn el language on source and target
Russian-Latin/BGN ru-Latn-t-ru-m0-bgn ru language for both source (understandable, it's Russian), but also target (why, it's Latin?)

If there's a better definition of the conversion that the one at the very top here: https://unicode.org/reports/tr35/tr35-general.html#Transforms, I haven't found it. I'm not saying it's impossible to programmatically convert, but I'm not sure if we want to invest the upstream work to get clarity on the conversion process right now?

Footnotes

  1. assuming UTS#35 is not outdated, it would be more like :: x-B-x-C or :: x_B-x_C, because the literal B and C are not valid locales.

@sffc
Copy link
Member

sffc commented Aug 22, 2023

Discussion notes:

  • @eggrobin - Since ICU only supports the legacy APIs we need to make sure there is a path for a compatible transition between ICU and ICU4X
  • @skius - Users can add custom transliterator data files at datagen time, and if they do, we need to make sure that we know how to convert those IDs
  • We may need a supplemental data file to include any ID mappings that don't come from XML
  • We will discuss these topics again with ICU-TC and CLDR-TC
  • @sffc - When reading the recursive IDs from the rule file can we try to parse them as both BCP-47 and legacy IDs?
  • @skius - Maybe but that is not clearly spec compliant
  • @robertbastian - You can create an empty rule file with a metadata file that contains the custom legacy ID to BCP-47 ID mapping
  • @sffc - If datagen gets messy, that's easy to clean up later. I'd rather have clean runtime and messy datagen.

Resolution:

  • CLDR transforms that don't yet have a modern ID should get one that first lives in ICU4X and should eventually be upstreamed
  • If the user wants a runtime-only transliterator, they can create a dummy transliterator at datagen time that has metadata for the ID mapping but no source

LGTM: @sffc @skius @robertbastian @eggrobin

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 22, 2023
@skius skius self-assigned this Aug 22, 2023
@skius
Copy link
Member Author

skius commented Aug 29, 2023

The code part is done, users are only exposed to BCP-47-T Locale's. Documenting the approach users should follow if the edge cases documented here apply to them as part of #3776

@skius skius closed this as completed Aug 29, 2023
@eggrobin
Copy link
Member

Since ICU only supports the legacy APIs we need to make sure there is a path for a compatible transition between ICU and ICU4X

On that subject, the path is to make ICU support the new IDs, and then to migrate CLDR data. See ICU-22474.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries
Projects
None yet
Development

No branches or pull requests

3 participants