Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic hardcoded Any-Hex transliterators #3965

Merged
merged 101 commits into from
Sep 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
e0dcb85
wip
skius Aug 24, 2023
8c79370
wip
skius Aug 25, 2023
0a86ca8
wip
skius Aug 25, 2023
59ddc1f
Merge branch 'main' into tl-runtime
skius Aug 25, 2023
9976099
wip
skius Aug 25, 2023
be4359b
fmt
skius Aug 25, 2023
60ce069
wip
skius Aug 25, 2023
cc8532a
simple things are working :)
skius Aug 25, 2023
8b52bc2
add test transform rules
skius Aug 25, 2023
7713ec6
test in bakeddata
skius Aug 25, 2023
15397da
more
skius Aug 25, 2023
e11279a
wip
skius Aug 25, 2023
6e8eed1
switch to input str wrapper for anchor support
skius Aug 25, 2023
5097871
support segments + backrefs
skius Aug 25, 2023
0006bb9
quantifiers, but broken because of dynamic segment numbering
skius Aug 26, 2023
2419524
Squash of segments with indices
skius Aug 26, 2023
2e80fa9
make quantifiers work (including with segments)
skius Aug 26, 2023
d384afc
add function call + Any-Remove test
skius Aug 26, 2023
57b2e1a
setup for function call replacement
skius Aug 26, 2023
9f29e9c
add ignore_len to replaceable and therefore make function calls work
skius Aug 26, 2023
7bafbe6
fmt
skius Aug 26, 2023
9abee53
useless assert
skius Aug 26, 2023
d820a5c
regenerate testdata
skius Aug 26, 2023
5d8f97e
Merge branch 'main' into tl-runtime
skius Aug 26, 2023
9fcfffe
fix cursor placeholder filter interaction
skius Aug 26, 2023
2b90c2c
fix rule replacement application
skius Aug 26, 2023
b289160
fix run generation
skius Aug 26, 2023
035cb9f
add empty match test
skius Aug 26, 2023
2809374
fmt
skius Aug 26, 2023
9ff0945
typo
skius Aug 26, 2023
3f818f5
switch from raw_cursor to visible_content-relative cursor
skius Aug 26, 2023
70275a3
refactor set_ignore_len into constructor
skius Aug 26, 2023
3b60b3f
no unnecessary allocations
skius Aug 26, 2023
7e7cb4d
fix safety violations with cursor offset
skius Aug 26, 2023
7a14dd6
add replacement size hints
skius Aug 26, 2023
67d0775
reuse work when checking for encoded chars
skius Aug 26, 2023
66be38b
add custom transliterator support
skius Aug 26, 2023
437443b
add notes
skius Aug 26, 2023
2b3d91e
make normalization tests pass, but baked is probably wrong
skius Aug 26, 2023
2c12542
use other crates' baked providers
skius Aug 26, 2023
fe5ed45
add cldr testData tests for supported locales
skius Aug 27, 2023
a270ce5
300k lines is maybe a bit much
skius Aug 27, 2023
41b1fd0
add notes
skius Aug 27, 2023
a03f2f8
add convenience function
skius Aug 27, 2023
2255447
cleanup
skius Aug 27, 2023
d1f880d
cleanup
skius Aug 27, 2023
3ccc422
cursoroffset::byte safety
skius Aug 27, 2023
283ba60
notes
skius Aug 27, 2023
cd7091d
switch to safer Matcher API
skius Aug 27, 2023
5e0e3b2
refactor RepMatcher to avoid some index recomputations
skius Aug 27, 2023
7d1fb9e
benchmarking
skius Aug 27, 2023
5370bc2
remove eprintlns
skius Aug 27, 2023
ba16b35
enforce no key matching after post matching
skius Aug 27, 2023
88d037d
avoid potential cursor safety issue
skius Aug 27, 2023
c88e2f4
remove duplicate license header
skius Aug 27, 2023
7822602
improve perf by '25%'
skius Aug 27, 2023
0ef369f
fix backref bug
skius Aug 27, 2023
3932436
safe replaceable-from-insertable getting
skius Aug 28, 2023
c33bc8d
replaceable api changes
skius Aug 28, 2023
c5f521b
factor out partially-invisible-vec semantics into Hide
skius Aug 28, 2023
74e1e8f
doc changes
skius Aug 28, 2023
cf82d85
remove last unsafe in transliterator module
skius Aug 28, 2023
f942dfb
more docs for replaceable.rs
skius Aug 28, 2023
5ba561d
Merge branch 'main' into tl-runtime
skius Aug 28, 2023
6286c28
regenerate testdata
skius Aug 28, 2023
8dff4d9
more comments
skius Aug 28, 2023
7e08042
trait docs
skius Aug 28, 2023
595fd38
more notes
skius Aug 28, 2023
738b19c
fmt
skius Aug 28, 2023
3f293de
doc additions
skius Aug 29, 2023
85b9f23
docs
skius Aug 29, 2023
f7ad0a5
turn non-actionable todos into thoughts and questions
skius Aug 29, 2023
690d62b
todos
skius Aug 29, 2023
99fb3d5
todos
skius Aug 29, 2023
f607ad3
more todos
skius Aug 29, 2023
529b598
todos
skius Aug 29, 2023
feb1aeb
todos => thought
skius Aug 29, 2023
683256d
clippy lints
skius Aug 29, 2023
dd94b00
doc + std errors + fmt
skius Aug 29, 2023
1445763
license?
skius Aug 29, 2023
5c22c95
license?
skius Aug 29, 2023
97a3610
fix failing test
skius Aug 29, 2023
07be76d
fix license header
skius Aug 29, 2023
cf4839a
update lib docs
skius Aug 29, 2023
c48fcec
link to locale
skius Aug 29, 2023
81ba8be
add hardcoded Hex example
skius Aug 29, 2023
d36bb58
add XML, Perl and Plain hex variants
skius Aug 29, 2023
2beb660
fix typo
skius Aug 29, 2023
c58dacd
Merge branch 'tl-runtime' into hex-translit
skius Aug 29, 2023
230f3b9
typo
skius Aug 29, 2023
633cb4b
Merge branch 'tl-runtime' into hex-translit
skius Aug 29, 2023
e2e2d07
don't use unstable ilog2
skius Aug 30, 2023
57ee4fa
add docs
skius Aug 30, 2023
3745d5c
as_replaceable make_contiguous thoughts
skius Aug 30, 2023
c14389e
typo
skius Aug 30, 2023
fe3e6bc
Merge branch 'tl-runtime' into hex-translit
skius Aug 30, 2023
f83058a
Merge branch 'main' into hex-translit
skius Sep 3, 2023
dcab898
fmt-free Hex transliterator
skius Sep 3, 2023
c7e8023
comments
skius Sep 3, 2023
927c1d0
bad merge...
skius Sep 3, 2023
a9887bd
doc tutorial lock
skius Sep 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/tutorials/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Large diffs are not rendered by default.

86 changes: 86 additions & 0 deletions experimental/transliteration/src/transliterator/hardcoded.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! This module defines implementations for code-based transliterators that are part of
//! transform rules.

use crate::transliterator::replaceable::{Forward, Replaceable, Utf8Matcher};

/// A transliterator that replaces every character with its `case`-case hexadecimal representation,
/// 0-padded to `min_length`, and surrounded by `prefix` and `suffix`.
#[derive(Debug)]
pub(super) struct HexTransliterator {
prefix: &'static str,
suffix: &'static str,
min_length: u8,
case: Case,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(super) enum Case {
Upper,
Lower,
}

impl HexTransliterator {
pub(super) fn new(
prefix: &'static str,
suffix: &'static str,
min_length: u8,
case: Case,
) -> Self {
Self {
prefix,
suffix,
min_length,
case,
}
}

pub(super) fn transliterate(&self, mut rep: Replaceable) {
while !rep.is_finished() {
let mut matcher = rep.start_match();
// Thought: ok this fully specified path is annoying, maybe a separate API surface is
// better for Forward vs Reverse matching.
let c = Utf8Matcher::<Forward>::next_char(&matcher);
// there must always be a char, because we just checked that `rep` is not finished yet.
let c = c.unwrap();
Utf8Matcher::<Forward>::match_and_consume_char(&mut matcher, c);
let mut dest = matcher.finish_match();

let c_u32 = c as u32;
// rounding-up division by 4
let length = (u32::BITS - c_u32.leading_zeros() + 3) / 4;
let padding = self.min_length.saturating_sub(length as u8);
dest.apply_size_hint(
self.prefix.len() + padding as usize + length as usize + self.suffix.len(),
);

dest.push_str(self.prefix);
for _ in 0..padding {
dest.push_str("0");
}
let mut remaining_c = c_u32;
// temporary buffer because forward iteration through a u32's bytes is easier and
// we need the reverse order
let mut buf = [0; 6];
for slot in buf.iter_mut() {
if c_u32 == 0 {
break;
}
*slot = match remaining_c & 0xF {
x @ 0x0..=0x9 => b'0' + x as u8,
x @ 0xA..=0xF if self.case == Case::Lower => b'a' + (x - 0xA) as u8,
x => b'A' + (x - 0xA) as u8,
};
remaining_c >>= 4;
}
// only `length` hex digits are actually from the char
for c in buf[..length as usize].iter().rev() {
dest.push(*c as char);
}
dest.push_str(self.suffix);
}
}
}
37 changes: 37 additions & 0 deletions experimental/transliteration/src/transliterator/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

mod hardcoded;
#[allow(clippy::indexing_slicing, clippy::unwrap_used)] // TODO(#3958): Remove.
mod replaceable;

use crate::provider::{FunctionCall, Rule, RuleULE, SimpleId, VarTable};
use crate::provider::{RuleBasedTransliterator, Segment, TransliteratorRulesV1Marker};
use crate::transliterator::hardcoded::Case;
use crate::TransliteratorError;
use alloc::boxed::Box;
use alloc::string::{String, ToString};
Expand Down Expand Up @@ -123,6 +125,7 @@ enum InternalTransliterator {
RuleBased(DataPayload<TransliteratorRulesV1Marker>),
Composing(ComposingTransliterator),
Decomposing(DecomposingTransliterator),
Hex(hardcoded::HexTransliterator),
Null,
Remove,
Dyn(Box<dyn CustomTransliterator>),
Expand All @@ -135,6 +138,7 @@ impl InternalTransliterator {
// TODO(#3910): internal hardcoded transliterators
Self::Composing(t) => t.transliterate(rep, env),
Self::Decomposing(t) => t.transliterate(rep, env),
Self::Hex(t) => t.transliterate(rep),
Self::Null => (),
Self::Remove => rep.replace_modifiable_with_str(""),
Self::Dyn(custom) => {
Expand Down Expand Up @@ -392,6 +396,21 @@ impl Transliterator {
)),
"any-null" => Ok(InternalTransliterator::Null),
"any-remove" => Ok(InternalTransliterator::Remove),
"any-hex-unicode" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("U+", "", 4, Case::Upper),
)),
"any-hex-rust" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("\\u{", "}", 2, Case::Lower),
)),
"any-hex-xml" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("&#x", ";", 1, Case::Upper),
)),
"any-hex-perl" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("\\x{", "}", 1, Case::Upper),
)),
"any-hex-plain" => Ok(InternalTransliterator::Hex(
hardcoded::HexTransliterator::new("", "", 4, Case::Upper),
)),
s => Err(DataError::custom("unavailable transliterator")
.with_debug_context(s)
.into()),
Expand Down Expand Up @@ -1364,4 +1383,22 @@ mod tests {
let output = "aa";
assert_eq!(t.transliterate(input.to_string()), output);
}

#[test]
fn test_hex_rust() {
let t = Transliterator::try_new("und-t-und-s0-test-d0-test-m0-hexrust".parse().unwrap())
.unwrap();
let input = "\0äa\u{10FFFF}❤!";
let output = r"\u{00}\u{e4}\u{61}\u{10ffff}\u{2764}\u{21}";
assert_eq!(t.transliterate(input.to_string()), output);
}

#[test]
fn test_hex_unicode() {
let t = Transliterator::try_new("und-t-und-s0-test-d0-test-m0-hexuni".parse().unwrap())
.unwrap();
let input = "\0äa\u{10FFFF}❤!";
let output = "U+0000U+00E4U+0061U+10FFFFU+2764U+0021";
assert_eq!(t.transliterate(input.to_string()), output);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -427,6 +427,10 @@ impl<'a, 'b> RepMatcher<'a, 'b, true> {

// we can only finish matching the key once
impl<'a, 'b> RepMatcher<'a, 'b, false> {
pub(super) fn finish_match(self) -> Insertable<'a, 'b> {
Insertable::from_matcher(self.finish_key())
}

pub(super) fn finish_key(self) -> RepMatcher<'a, 'b, true> {
RepMatcher {
rep: self.rep,
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions provider/datagen/tests/data/postcard/fingerprints.csv
Original file line number Diff line number Diff line change
Expand Up @@ -2057,6 +2057,8 @@ transliterator/rules@1, und+und-t-und-Beng-d0-intindic, 2621B, 24a04df29d08559d
transliterator/rules@1, und+und-t-und-Latn-d0-ascii, 27110B, c66743617e3238ff
transliterator/rules@1, und+und-t-und-d0-test-m0-cursfilt-s0-test, 93B, ac67e05bc986cd23
transliterator/rules@1, und+und-t-und-d0-test-m0-emtymach-s0-test, 105B, 12b65cade4ce4468
transliterator/rules@1, und+und-t-und-d0-test-m0-hexrust-s0-test, 98B, b8802989a6bfec0f
transliterator/rules@1, und+und-t-und-d0-test-m0-hexuni-s0-test, 104B, 4335c71013bd81d
transliterator/rules@1, und+und-t-und-d0-test-m0-niels-s0-test, 1800B, 6a560a4143a4b60c
transliterator/rules@1, und+und-t-und-d0-test-m0-rectesta-s0-test, 370B, af652bcb33e1038b
transliterator/rules@1, und+und-t-und-d0-test-m0-rectestr-s0-test, 281B, 51be7571fd233bd6
Expand Down
Loading