Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliterator runtime #3946

Merged
merged 95 commits into from
Aug 31, 2023
Merged

Transliterator runtime #3946

merged 95 commits into from
Aug 31, 2023

Conversation

skius
Copy link
Member

@skius skius commented Aug 26, 2023

Depends on #3943
Part of #3736

Implements the runtime for transliterators:

  • Loading data structs
  • Transliterating on data structs
  • Transliterating some hardcoded transliterators
  • Transliterating with components (NFC, NFD...)
  • Transliterating with user-provided transliterators
  • Utility structs for in-place transliteration

Has tests:

  • the part of the CLDR testData suite that is supported by the locales currently in ICU4X, but it's easy to add the full suite once we have the full CLDR transform data
  • custom unit tests

Has a setup for benchmarks. This is very much just the structure, the benchmarks are not well-thought out yet. #3957

I will turn the commented-out eprintln's into log::info!'s in a follow-up: #3958

Note to reviewers: The commit history contains a lot of old designs, starting from the main module being unsafe + allocate-y to now the main module being safe + less allocate-y. Probably best to review the files individually instead of going through the commits.

commit b62edfe
Author: Niels Saurer <nielssaurer@google.com>
Date:   Sat Aug 26 02:30:56 2023 +0200

    regenerate test+bakeddata

commit ba7a333
Author: Niels Saurer <nielssaurer@google.com>
Date:   Sat Aug 26 02:23:33 2023 +0200

    add idx to segments

commit bddf819
Author: Robert Bastian <robertbastian@users.noreply.github.com>
Date:   Fri Aug 25 21:45:28 2023 +0200

    pen Cleaning up the workspace (unicode-org#3938)

commit beae962
Author: Shane F. Carr <shane@unicode.org>
Date:   Fri Aug 25 12:22:25 2023 -0700

    Add some scripts and regions to HelloWorldProvider (unicode-org#3936)
/// The used indices in method parameters are all compatible with each other.
// Thought: I don't think this needs to be called *Utf8* matcher, maybe just Matcheable

pub(super) trait Utf8Matcher<D: MatchDirection>: Debug {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
Looking for advice for a different API than implementing two traits (Utf8M<Forward> and <Reverse>) with the same method API surface for the same type. Leads to conflicts that need to be fully qualified such as here (in a follow-up PR): https://github.com/unicode-org/icu4x/pull/3965/files#diff-6dd89dfb5721915aafe24a449bc4d2d499df79d63e2f7687e2eccd9384307cddR44

Maybe just explicit method names? match_str_forward, match_str_reverse, etc?

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not looked at replaceable yet

experimental/transliteration/benches/bench.rs Outdated Show resolved Hide resolved
experimental/transliteration/src/lib.rs Outdated Show resolved Hide resolved
experimental/transliteration/Cargo.toml Show resolved Hide resolved
experimental/transliteration/src/provider.rs Outdated Show resolved Hide resolved
experimental/transliteration/src/transliterator/mod.rs Outdated Show resolved Hide resolved
experimental/transliteration/src/transliterator/mod.rs Outdated Show resolved Hide resolved
experimental/transliteration/src/transliterator/mod.rs Outdated Show resolved Hide resolved
experimental/transliteration/src/transliterator/mod.rs Outdated Show resolved Hide resolved
if idx < next_base {
return Some(VarTableElement::Quantifier(
QuantifierKind::ZeroOrOne,
&self.quantifiers_opt[idx],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy will want to know that these indexing operations are non-panicking

@skius
Copy link
Member Author

skius commented Aug 30, 2023

@Manishearth if you want to, I'd appreciate your comments on the use of unsafe/the general memory-safety in replaceable.rs

@robertbastian
Copy link
Member

I have a rough understanding of what's going on in replaceable.rs, so I'll merge this for now. Will need to do a more in-depth (safety) review before graduation.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@robertbastian robertbastian merged commit cb32488 into unicode-org:main Aug 31, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants