You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this issue to keep track of outstanding issues/features to land experimental transliteration. Things marked as (2) can be done after an initial (end-to-end) version has landed.
Rethink cursor serialization - could also be a hardcoded and reserved single code point for cursors without placeholders off either end of the text (e.g., <CURSOR>), and could be <CURSOR><offset as char> for cursors with an offset/placeholders off either end of the text. Likely always need the <offset as char> data for easier deserialization, so for an inline cursor we would have <CURSOR>\0 - Compile transliterator cursors #3937
Change segments representation - Precompute segments' index fields for transliterators #3943 - [ ] Polish data struct - Polish transliterator data struct #3850 - [ ] (2) Inlining UnicodeSets derived from properties can lead to duplication of data across transliterators (e.g., two transliterators that both use [:Lowercase:] would have a lot of duplicate data). Maybe think about some constant-sized serialization special case for properties, e.g., just serializing "property: Lowercase", and then loading that property with the property's datamarker, meaning only some ID for Lowercase is duplicated across transliterators, not the actual property data, which lives with the usual property provider.
Add compilation for recursive transliterators (SingleId) by passing/creating the (datagen-global) mapping from legacy ID to internal ID - Transliterator DatagenProvider #3877 - [ ] (2) Errors with source location (adding a usize to parse::* types, whether directly or with a generic "SpanWrapper", is probably the easiest way) - [ ] (2) Similarly, logging with source location/source text? - [ ] (2) Decide what (non-critical) validation checks should be performed. Questions include: - Enforce no special replacers (backreferences, function calls, cursors) in (implicitly ignored) target contexts? - Enforce empty target contexts for unidirectional rules? - Enforce no anchors on target-side and no cursors on source-side for unidirectional rules? - [ ] (2) Decide if validation should be done for both directions even if source file defines only one direction - [ ] (2) Similarly, decide if non-source-file-defined directions are even allowed (is a < b a valid rule in a forward transliterator?)
Add the right crate features to transliteration and transliterator_parser.
Add bakeddata support to transliteration - Transliterator DatagenProvider #3877 - [ ] (2) Unify parse+compile tests (currently difficult to judge where an edge case is tested)
Parse provided metadata sources (ID, visibility, ...) and pass to compilation
Provide direction (part of metadata) to parsing/compilation, and act accordingly
Parse: log warnings (or even error?) when a rule uses an unspecified direction (e.g. b < a with forward metadata)
Compile: compile datastructs only for the directions the metadata specifies - [ ] (2) Build and use dependency graph for transliterators for datagen
Skippable in the initial version by generating data for everything - [ ] (2) Generate data for transitive dependencies when specifying a certain locale - [ ] (2) Handle (potentially) special-cased datagen for builtin transliterators (Upper, Title, Lower, ...) - [ ] (2) Some builtin transliterators might also not require any data, like Any-Remove
Skippable in the initial version by matching on metadata purely during datagen (keep track of legacy ID => transliterator map during datagen/parsing - Transliterator DatagenProvider #3877
Use (copy-pasted, offline preprocessed) CLDR testData for integration tests. - [ ] Once we have this, how do we keep this up to date? - [ ] Steal ICU4C tests
Polish - [ ] Add detailed comments to parser - [ ] Factor out escape parsing together with unicodeset_parser's escape handling somehow - [ ] Data-gen optimizations - Apply transliterator data struct optimizations at datagen-time #3825 - [ ] DatagenProvider impl only compute transliterator map once instead of per load - [ ] Pretty-print intermediate structs back to source syntax (would allow for round-trip testing) - [ ] Apply parse error conclusions from Decide expressiveness of UnicodeSet parsing errors #3558
Opening this issue to keep track of outstanding issues/features to land experimental transliteration. Things marked as (2) can be done after an initial (end-to-end) version has landed.
- [ ] Add data source for transform rules (@robertbastian)- [ ] After that, adjust the compiled_data
config.json
.- [ ] After that, remove workaround in
download-repo-sources.rs
- [ ] (2) Once we know more about the properties used in CLDR data, update Support (loose) string-to-property-map matching inicu_properties
#3559<CURSOR>
), and could be<CURSOR><offset as char>
for cursors with an offset/placeholders off either end of the text. Likely always need the<offset as char>
data for easier deserialization, so for an inline cursor we would have<CURSOR>\0
- Compile transliterator cursors #3937- [ ] Polish data struct - Polish transliterator data struct #3850- [ ] (2) Inlining UnicodeSets derived from properties can lead to duplication of data across transliterators (e.g., two transliterators that both use[:Lowercase:]
would have a lot of duplicate data). Maybe think about some constant-sized serialization special case for properties, e.g., just serializing "property: Lowercase", and then loading that property with the property's datamarker, meaning only some ID forLowercase
is duplicated across transliterators, not the actual property data, which lives with the usual property provider.- [ ] (2) Test parse error messages- [ ] (2) Remove associated#[allow(unused)]
use variable range ...
- Ignore ICU pragmas when parsing transform rules #3995parse.rs
- Transliterator parser cleanup #3827SingleId
) by passing/creating the (datagen-global) mapping from legacy ID to internal ID - Transliterator DatagenProvider #3877- [ ] (2) Errors with source location (adding ausize
toparse::*
types, whether directly or with a generic "SpanWrapper", is probably the easiest way)- [ ] (2) Similarly, logging with source location/source text?- [ ] (2) Decide what (non-critical) validation checks should be performed. Questions include:- Enforce no special replacers (backreferences, function calls, cursors) in (implicitly ignored) target contexts?- Enforce empty target contexts for unidirectional rules?- Enforce no anchors on target-side and no cursors on source-side for unidirectional rules?- [ ] (2) Decide if validation should be done for both directions even if source file defines only one direction- [ ] (2) Similarly, decide if non-source-file-defined directions are even allowed (isa < b
a valid rule in aforward
transliterator?)transliteration
andtransliterator_parser
.transliteration
- Transliterator DatagenProvider #3877transliterator_parser
bakeddata
support totransliteration
- Transliterator DatagenProvider #3877- [ ] (2) Unify parse+compile tests (currently difficult to judge where an edge case is tested)b < a
withforward
metadata)- [ ] (2) Build and use dependency graph for transliterators for datagen- [ ] (2) Generate data for transitive dependencies when specifying a certain locale- [ ] (2) Handle (potentially) special-cased datagen for builtin transliterators (Upper, Title, Lower, ...)- [ ] (2) Some builtin transliterators might also not require any data, like Any-Removesource/target
) - Internal representation for Transliteration IDs #3765aux: String
and updatestrict_cmp
, ...)- [ ] (2) Parsing for legacy UTS#35 IDs (e.g.,und_Source-und_Target
)legacy ID => transliterator
map during datagen/parsing - Transliterator DatagenProvider #3877- [ ] Open issue detailing discussed API- [ ] Separate fallback chain for single locales for transliteration- [ ] Lockstep transliterator fallback mechanism (UTS#35) in Transliterator constructor- [ ] Answer: How to do fallback on composite special+regular source/target locales? - Transliterator fallback #3950- [ ] Can users override internal transliterators? - Overriding of internal transliterators #3911- [ ] Add hardcoded ICU transliterators (e.g., Any-Hex, see special classes that exist in ICU4J) - Implement hardcoded ICU transliterators #3910- [ ] Handle BCP-47 for them at runtime + datagen time - Invent BCP47 IDs for hardcoded transliterators #3909- [ ] Once we have this, how do we keep this up to date?- [ ] Steal ICU4C tests- [ ] Add detailed comments to parser- [ ] Factor out escape parsing together withunicodeset_parser
's escape handling somehow- [ ] Data-gen optimizations - Apply transliterator data struct optimizations at datagen-time #3825- [ ] DatagenProvider impl only compute transliterator map once instead of perload
- [ ] Pretty-print intermediate structs back to source syntax (would allow for round-trip testing)- [ ] Apply parse error conclusions from Decide expressiveness of UnicodeSet parsing errors #3558&'static str
toPEK::Internal
for better bug error messages - Add &'static str with debug information to internal transliterator parse errors #3996- [ ] Test all data structs built from our data to be "valid" in the sense that the VarTable layout is applicable to the encoded rules- [ ] Add benchmarking for transliteration and resolve optimization comments in codebase - Optimize Transliteration runtime (and add better benchmarks) #3957- [ ] Cleanup - Cleanup transliteration runtime #3958The text was updated successfully, but these errors were encountered: