-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core): segmented marker normalization 🙀 #10369
Comments
a generic algorithm that handles |
Thinking alound here.. (fyi @mcdurdin ): (yes this calls the {
marker_map map;
std::cout << __FILE__ << ":" << __LINE__ << " - support 2-segment markers " << std::endl;
// e\m{1}`\m{2}_E\m{3}`\m{4}_
const std::u32string src =
U"e\uFFFF\u0008\u0001\u0300\uFFFF\u0008\u0002\u0320E\uFFFF\u0008\u0003\u0300\uFFFF\u0008\u0004\u0320";
// e\m{2}_\m{1}`E\m{4}_\m{3}`
const std::u32string expect =
U"e\uFFFF\u0008\u0002\u0320\uFFFF\u0008\u0001\u0300E\uFFFF\u0008\u0004\u0320\uFFFF\u0008\u0003\u0300";
std::u32string dst = src;
assert(normalize_nfd_markers_segment(dst, map));
if (dst != expect) {
std::cout << "dst: " << Debug_UnicodeString(dst) << std::endl;
std::cout << "exp: " << Debug_UnicodeString(expect) << std::endl;
}
zassert_string_equal(dst, expect);
marker_map expm({{0x300, 0x1L}, {0x320, 0x2L}, {0x300, 0x3L}, {0x320, 0x4L}});
assert_marker_map_equal(map, expm);
} the stream of markers may be parsed as shown on the
to:
so then how can the 'add back markers' operate?
OK. Might work for this case. But what about this one as an input string:
The problem is that, when adding back markers, how do we know that So, I think we need segmentation. Then the problem looks like this: (Character boundaries shown by
The middle segment has no markers. So now processed in three segments, each one will already work. The branch I started on with the 'linear' marker_map above will handle duplicates and doubled markers well. But with the segment boundaries splitting up the problem, we don't have to worry about the marker being on the wrong segment. |
|
Fixes #9999. Note TODO items: - [ ] Renormalize cached_context across action boundary. Blocked by #10369. - [ ] Add extra tests for surrogate pairs - [ ] Move set_context_from_string into helper module - [ ] if we don't apply normalization, we still need to fixup the app_context, to keep it coherent with cached_context (or at least we need to verify that app_context is never used in this situation)
- split out parse_next_marker() - use NFD safe boundaries to segment marker interaction For: #10369
- easier to understand loop - comments For: #10369
The marker normalization needs to break into 'segments' of user perceived character boundaries. For example,
e\m{a}\u{0301}a\m{b}\u{301}
the a and b are in two different segments and so do not interact.The text was updated successfully, but these errors were encountered: