feat(core): segmented marker normalization 🙀 #10369

srl295 · 2024-01-12T03:29:10Z

The marker normalization needs to break into 'segments' of user perceived character boundaries. For example, e\m{a}\u{0301}a\m{b}\u{301} the a and b are in two different segments and so do not interact.

The text was updated successfully, but these errors were encountered:

srl295 · 2024-01-12T03:36:05Z

a generic algorithm that handles \m{a}\u{0301}\m{b}\u{0301} (which will be unchanged) should handle the above case also.

#10369

srl295 · 2024-01-16T02:34:11Z

a generic algorithm that handles \m{a}\u{0301}\m{b}\u{0301} (which will be unchanged) should handle the above case also.

Thinking alound here.. (fyi @mcdurdin ):

(yes this calls the _segment() api but it's illustrative)

  {
    marker_map map;
    std::cout << __FILE__ << ":" << __LINE__ << " - support 2-segment markers " << std::endl;
    // e\m{1}`\m{2}_E\m{3}`\m{4}_
    const std::u32string src =
        U"e\uFFFF\u0008\u0001\u0300\uFFFF\u0008\u0002\u0320E\uFFFF\u0008\u0003\u0300\uFFFF\u0008\u0004\u0320";
    // e\m{2}_\m{1}`E\m{4}_\m{3}`
    const std::u32string expect =
        U"e\uFFFF\u0008\u0002\u0320\uFFFF\u0008\u0001\u0300E\uFFFF\u0008\u0004\u0320\uFFFF\u0008\u0003\u0300";
    std::u32string dst = src;
    assert(normalize_nfd_markers_segment(dst, map));
    if (dst != expect) {
      std::cout << "dst: " << Debug_UnicodeString(dst) << std::endl;
      std::cout << "exp: " << Debug_UnicodeString(expect) << std::endl;
    }
    zassert_string_equal(dst, expect);
    marker_map expm({{0x300, 0x1L}, {0x320, 0x2L}, {0x300, 0x3L}, {0x320, 0x4L}});
    assert_marker_map_equal(map, expm);
  }

the stream of markers may be parsed as shown on the expm (expected markers) stream. However, the text is reordered by normalization from:

e U+300 U+320 E U+300 U+320

to:

e U+320 U+300 E U+320 U+300

so then how can the 'add back markers' operate?

expm = {{0x300, 0x1L}, {0x320, 0x2L}, {0x300, 0x3L}, {0x320, 0x4L}}

so going from the right, it will first hit the U+300.
It could search and find {0x300, 0x3L} entry and use that and remove it, ok
That leaves the list as {{0x300, 0x1L}, {0x320, 0x2L}, /*deleted*/, {0x320, 0x4L}}
now when we hit the U+320 how do we know which one applies? Both of them? If we keep the 'deleted' state, then /*deleted*/ could be a boundary then we consume only the {0x320,0x4L}
That leaves the list as {{0x300, 0x1L}, {0x320, 0x2L}, /*deleted*/, /*deleted*/}
next we hit the 0x300, and repeat..

OK. Might work for this case.

But what about this one as an input string:

e \m{1} U+300 \m{2} U+320 a U+300 U+320 E \m{3} U+300 \m{4} U+320

The problem is that, when adding back markers, how do we know that \m{1} and \m{2} apply to e’s NSM, and not a’s? There’s no information to distinguish.

So, I think we need segmentation.

Then the problem looks like this: (Character boundaries shown by |)

e \m{1} U+300 \m{2} U+320 | a U+300 U+320 | E \m{3} U+300 \m{4} U+320

The middle segment has no markers. So now processed in three segments, each one will already work.

The branch I started on with the 'linear' marker_map above will handle duplicates and doubled markers well. But with the segment boundaries splitting up the problem, we don't have to worry about the marker being on the wrong segment.

srl295 · 2024-01-16T02:51:34Z

è̠à̠È̠ segments to è̠ à̠ È̠

#10369

Fixes #9999. Note TODO items: - [ ] Renormalize cached_context across action boundary. Blocked by #10369. - [ ] Add extra tests for surrogate pairs - [ ] Move set_context_from_string into helper module - [ ] if we don't apply normalization, we still need to fixup the app_context, to keep it coherent with cached_context (or at least we need to verify that app_context is never used in this situation)

#10369

- split out parse_next_marker() - use NFD safe boundaries to segment marker interaction For: #10369

- easier to understand loop - comments For: #10369

- remove unused function For: #10369

For: #10369

srl295 added this to the A17S30 milestone Jan 12, 2024

srl295 self-assigned this Jan 12, 2024

keymanapp-test-bot bot added core/ Keyman Core feat labels Jan 12, 2024

srl295 mentioned this issue Jan 15, 2024

feat(core): unescape u 🙀 #10356

Merged

srl295 added a commit that referenced this issue Jan 16, 2024

feat(core): WIP attempt at cross segment markers 🙀

dac7dd9

#10369

srl295 mentioned this issue Jan 16, 2024

feat(core): cross-segment markers 🙀 #10394

Merged

srl295 added a commit that referenced this issue Jan 17, 2024

feat(core): charwise segments for markers 🙀

b0fd4f4

#10369

srl295 mentioned this issue Jan 17, 2024

feat(developer): normalization on the developer side 🙀 #10317

Closed

2 tasks

This was referenced Jan 17, 2024

feat(core): infrastructure for normalization of output 🌱 #10403

Merged

feat(core): renormalize cached_context across action boundary with markers #10422

Closed

srl295 added a commit that referenced this issue Jan 18, 2024

feat(core): WIP attempt at cross segment markers 🙀

990635e

#10369

srl295 added a commit that referenced this issue Jan 18, 2024

feat(core): ldml marker: refactor, use nfd props 🙀

3ccffe9

- split out parse_next_marker() - use NFD safe boundaries to segment marker interaction For: #10369

srl295 added a commit that referenced this issue Jan 18, 2024

feat(core): ldml marker: improve segment algorithm 🙀

2c56223

- easier to understand loop - comments For: #10369

srl295 added a commit that referenced this issue Jan 18, 2024

feat(core): ldml marker segments: fix build errs 🙀

64c808c

- remove unused function For: #10369

srl295 linked a pull request Jan 18, 2024 that will close this issue

feat(core): cross-segment markers 🙀 #10394

Merged

srl295 added a commit that referenced this issue Jan 19, 2024

feat(core): ldml marker: assert that all markers were re-added 🙀

ade29e0

For: #10369

srl295 closed this as completed in #10394 Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): segmented marker normalization 🙀 #10369

feat(core): segmented marker normalization 🙀 #10369

srl295 commented Jan 12, 2024 •

edited

Loading

srl295 commented Jan 12, 2024

srl295 commented Jan 16, 2024 •

edited

Loading

srl295 commented Jan 16, 2024

feat(core): segmented marker normalization 🙀 #10369

feat(core): segmented marker normalization 🙀 #10369

Comments

srl295 commented Jan 12, 2024 • edited Loading

srl295 commented Jan 12, 2024

srl295 commented Jan 16, 2024 • edited Loading

srl295 commented Jan 16, 2024

srl295 commented Jan 12, 2024 •

edited

Loading

srl295 commented Jan 16, 2024 •

edited

Loading