Checklist for Transliteration #3736

skius · 2023-07-25T11:55:30Z

Opening this issue to keep track of outstanding issues/features to land experimental transliteration. Things marked as (2) can be done after an initial (end-to-end) version has landed.

Prework
- Modern UnicodeSet support - (UnicodeSet parsing with new spec #3670)
- Figure out ICU transliteration behavior
Data
- [ ] Add data source for transform rules (@robertbastian)
- [ ] After that, adjust the compiled_data config.json.
- [ ] After that, remove workaround in download-repo-sources.rs
~~- [ ] (2) Once we know more about the properties used in CLDR data, update Support (loose) string-to-property-map matching in icu_properties #3559~~
Datagen
- Design the data struct - [WIP] Transliteration data structs #3627
  - Rethink cursor serialization - could also be a hardcoded and reserved single code point for cursors without placeholders off either end of the text (e.g., <CURSOR>), and could be <CURSOR><offset as char> for cursors with an offset/placeholders off either end of the text. Likely always need the <offset as char> data for easier deserialization, so for an inline cursor we would have <CURSOR>\0 - Compile transliterator cursors #3937
  - Change segments representation - Precompute segments' index fields for transliterators #3943
    ~~- [ ] Polish data struct - Polish transliterator data struct #3850~~
    - [ ] (2) Inlining UnicodeSets derived from properties can lead to duplication of data across transliterators (e.g., two transliterators that both use [:Lowercase:] would have a lot of duplicate data). Maybe think about some constant-sized serialization special case for properties, e.g., just serializing "property: Lowercase", and then loading that property with the property's datamarker, meaning only some ID for Lowercase is duplicated across transliterators, not the actual property data, which lives with the usual property provider.
- Parse source files for rules - Add Parsing for Rule-Based Transliterators #3730
  ~~- [ ] (2) Test parse error messages~~
  ~~- [ ] (2) Remove associated #[allow(unused)]~~
  - (2) Handle (ignore) ICU transform rule pragmas, e.g., use variable range ... - Ignore ICU pragmas when parsing transform rules #3995
- Compile parsed sources into datastructs - Transliterator data struct generation #3824, Add rule-group generation to transliterator compiler #3822
  - Validate rule items are well-formed per direction - Compile-time transliterator validation #3819
  - Decide handling of bidirectional sources - Deduplicate Transliterator VarTables across directions #3646
  - Remove associated TODOs in parse.rs - Transliterator parser cleanup #3827
  - Return list of dependency locales (strings) - Transliterator DatagenProvider #3877
  - Check that PUA range suffices for all specials (including backrefs and anchors) - Transliterator data struct generation #3824
  - Check for backref encoding overflows - Transliterator parser cleanup #3827
  - Add compilation for cursors - Compile transliterator cursors #3937
  - Add compilation for recursive transliterators (SingleId) by passing/creating the (datagen-global) mapping from legacy ID to internal ID - Transliterator DatagenProvider #3877
    ~~- [ ] (2) Errors with source location (adding a usize to parse::* types, whether directly or with a generic "SpanWrapper", is probably the easiest way)~~
    ~~- [ ] (2) Similarly, logging with source location/source text?~~
    ~~- [ ] (2) Decide what (non-critical) validation checks should be performed. Questions include:~~
    ~~- Enforce no special replacers (backreferences, function calls, cursors) in (implicitly ignored) target contexts?~~
    ~~- Enforce empty target contexts for unidirectional rules?~~
    ~~- Enforce no anchors on target-side and no cursors on source-side for unidirectional rules?~~
    ~~- [ ] (2) Decide if validation should be done for both directions even if source file defines only one direction~~
    ~~- [ ] (2) Similarly, decide if non-source-file-defined directions are even allowed (is a < b a valid rule in a forward transliterator?)~~
- Add the right crate features to transliteration and transliterator_parser.
  - transliteration - Transliterator DatagenProvider #3877
  - transliterator_parser
- Add bakeddata support to transliteration - Transliterator DatagenProvider #3877
  ~~- [ ] (2) Unify parse+compile tests (currently difficult to judge where an edge case is tested)~~
- Datagen glue code - Transliterator DatagenProvider #3877
  - Parse provided metadata sources (ID, visibility, ...) and pass to compilation
  - Provide direction (part of metadata) to parsing/compilation, and act accordingly
    - Parse: log warnings (or even error?) when a rule uses an unspecified direction (e.g. b < a with forward metadata)
    - Compile: compile datastructs only for the directions the metadata specifies
      ~~- [ ] (2) Build and use dependency graph for transliterators for datagen~~
  - Skippable in the initial version by generating data for everything
    ~~- [ ] (2) Generate data for transitive dependencies when specifying a certain locale~~
    ~~- [ ] (2) Handle (potentially) special-cased datagen for builtin transliterators (Upper, Title, Lower, ...)~~
    ~~- [ ] (2) Some builtin transliterators might also not require any data, like Any-Remove~~
Locales
- Design internal representation of transliterator IDs (source/target) - Internal representation for Transliteration IDs #3765
  - Parsing in DataLocale (add aux: String and update strict_cmp, ...)
  - Hardcode the reverse of special transliterators (see these) - Add hardcoded reverses for hardcoded transliterators #3994
  - Representation of custom transliterators - Transliterator IDs with unknown BCP47 IDs #3891
    ~~- [ ] (2) Parsing for legacy UTS#35 IDs (e.g., und_Source-und_Target)~~
  - Skippable in the initial version by matching on metadata purely during datagen (keep track of legacy ID => transliterator map during datagen/parsing - Transliterator DatagenProvider #3877
Runtime
- Implement data struct - Add experimental transliteration component #3775
  - (2) Detailed documentation of the format - Document transliteration data struct format #3776
    ~~- [ ] Open issue detailing discussed API~~
    ~~- [ ] Separate fallback chain for single locales for transliteration~~
    ~~- [ ] Lockstep transliterator fallback mechanism (UTS#35) in Transliterator constructor~~
    ~~- [ ] Answer: How to do fallback on composite special+regular source/target locales? - Transliterator fallback #3950~~
    ~~- [ ] Can users override internal transliterators? - Overriding of internal transliterators #3911~~
    ~~- [ ] Add hardcoded ICU transliterators (e.g., Any-Hex, see special classes that exist in ICU4J) - Implement hardcoded ICU transliterators #3910~~
    ~~- [ ] Handle BCP-47 for them at runtime + datagen time - Invent BCP47 IDs for hardcoded transliterators #3909~~
- Use (copy-pasted, offline preprocessed) CLDR testData for integration tests.
  ~~- [ ] Once we have this, how do we keep this up to date?~~
  ~~- [ ] Steal ICU4C tests~~
Polish
~~- [ ] Add detailed comments to parser~~
~~- [ ] Factor out escape parsing together with unicodeset_parser's escape handling somehow~~
~~- [ ] Data-gen optimizations - Apply transliterator data struct optimizations at datagen-time #3825~~
~~- [ ] DatagenProvider impl only compute transliterator map once instead of per load~~
~~- [ ] Pretty-print intermediate structs back to source syntax (would allow for round-trip testing)~~
~~- [ ] Apply parse error conclusions from Decide expressiveness of UnicodeSet parsing errors #3558~~
- Add a &'static str to PEK::Internal for better bug error messages - Add &'static str with debug information to internal transliterator parse errors #3996
  ~~- [ ] Test all data structs built from our data to be "valid" in the sense that the VarTable layout is applicable to the encoded rules~~
  ~~- [ ] Add benchmarking for transliteration and resolve optimization comments in codebase - Optimize Transliteration runtime (and add better benchmarks) #3957~~
  ~~- [ ] Cleanup - Cleanup transliteration runtime #3958~~

The text was updated successfully, but these errors were encountered:

skius · 2023-09-03T18:14:36Z

Initial transliteration support has landed. See here for the road to stabilization: #3961

skius added T-core Type: Required functionality C-unicode Component: Props, sets, tries S-large Size: A few weeks (larger feature, major refactoring) labels Jul 25, 2023

skius self-assigned this Jul 25, 2023

robertbastian added this to the 1.4 Blocking ⟨P1⟩ milestone Jul 25, 2023

Hywan mentioned this issue Jul 27, 2023

feat(ui): Implement the “fuzzy match room name” filter matrix-org/matrix-rust-sdk#2335

Merged

This was referenced Aug 19, 2023

Transliterator DatagenProvider #3877

Merged

Compile transliterator cursors #3937

Merged

This was referenced Aug 26, 2023

Precompute segments' index fields for transliterators #3943

Merged

Transliterator runtime #3946

Merged

skius closed this as completed Sep 3, 2023

robertbastian removed this from the 1.4 Blocking ⟨P1⟩ milestone Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checklist for Transliteration #3736

Checklist for Transliteration #3736

skius commented Jul 25, 2023 •

edited

Loading

skius commented Sep 3, 2023

Checklist for Transliteration #3736

Checklist for Transliteration #3736

Comments

skius commented Jul 25, 2023 • edited Loading

skius commented Sep 3, 2023

skius commented Jul 25, 2023 •

edited

Loading