Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): segmented marker normalization 🙀 #10369

Closed
srl295 opened this issue Jan 12, 2024 · 3 comments · Fixed by #10394
Closed

feat(core): segmented marker normalization 🙀 #10369

srl295 opened this issue Jan 12, 2024 · 3 comments · Fixed by #10394
Assignees
Labels
core/ Keyman Core feat
Milestone

Comments

@srl295
Copy link
Member

srl295 commented Jan 12, 2024

The marker normalization needs to break into 'segments' of user perceived character boundaries. For example, e\m{a}\u{0301}a\m{b}\u{301} the a and b are in two different segments and so do not interact.

@srl295 srl295 added this to the A17S30 milestone Jan 12, 2024
@srl295 srl295 self-assigned this Jan 12, 2024
@keymanapp-test-bot keymanapp-test-bot bot added core/ Keyman Core feat labels Jan 12, 2024
@srl295
Copy link
Member Author

srl295 commented Jan 12, 2024

a generic algorithm that handles \m{a}\u{0301}\m{b}\u{0301} (which will be unchanged) should handle the above case also.

@srl295
Copy link
Member Author

srl295 commented Jan 16, 2024

a generic algorithm that handles \m{a}\u{0301}\m{b}\u{0301} (which will be unchanged) should handle the above case also.

Thinking alound here.. (fyi @mcdurdin ):

(yes this calls the _segment() api but it's illustrative)

  {
    marker_map map;
    std::cout << __FILE__ << ":" << __LINE__ << " - support 2-segment markers " << std::endl;
    // e\m{1}`\m{2}_E\m{3}`\m{4}_
    const std::u32string src =
        U"e\uFFFF\u0008\u0001\u0300\uFFFF\u0008\u0002\u0320E\uFFFF\u0008\u0003\u0300\uFFFF\u0008\u0004\u0320";
    // e\m{2}_\m{1}`E\m{4}_\m{3}`
    const std::u32string expect =
        U"e\uFFFF\u0008\u0002\u0320\uFFFF\u0008\u0001\u0300E\uFFFF\u0008\u0004\u0320\uFFFF\u0008\u0003\u0300";
    std::u32string dst = src;
    assert(normalize_nfd_markers_segment(dst, map));
    if (dst != expect) {
      std::cout << "dst: " << Debug_UnicodeString(dst) << std::endl;
      std::cout << "exp: " << Debug_UnicodeString(expect) << std::endl;
    }
    zassert_string_equal(dst, expect);
    marker_map expm({{0x300, 0x1L}, {0x320, 0x2L}, {0x300, 0x3L}, {0x320, 0x4L}});
    assert_marker_map_equal(map, expm);
  }

the stream of markers may be parsed as shown on the expm (expected markers) stream. However, the text is reordered by normalization from:

e U+300 U+320 E U+300 U+320

to:

e U+320 U+300 E U+320 U+300

so then how can the 'add back markers' operate?

expm = {{0x300, 0x1L}, {0x320, 0x2L}, {0x300, 0x3L}, {0x320, 0x4L}}

  1. so going from the right, it will first hit the U+300.
  2. It could search and find {0x300, 0x3L} entry and use that and remove it, ok
  3. That leaves the list as {{0x300, 0x1L}, {0x320, 0x2L}, /*deleted*/, {0x320, 0x4L}}
  4. now when we hit the U+320 how do we know which one applies? Both of them? If we keep the 'deleted' state, then /*deleted*/ could be a boundary then we consume only the {0x320,0x4L}
  5. That leaves the list as {{0x300, 0x1L}, {0x320, 0x2L}, /*deleted*/, /*deleted*/}
  6. next we hit the 0x300, and repeat..

OK. Might work for this case.


But what about this one as an input string:

e \m{1} U+300 \m{2} U+320 a U+300 U+320 E \m{3} U+300 \m{4} U+320

The problem is that, when adding back markers, how do we know that \m{1} and \m{2} apply to e’s NSM, and not a’s? There’s no information to distinguish.

So, I think we need segmentation.

Then the problem looks like this: (Character boundaries shown by |)

e \m{1} U+300 \m{2} U+320 | a U+300 U+320 | E \m{3} U+300 \m{4} U+320

The middle segment has no markers. So now processed in three segments, each one will already work.

The branch I started on with the 'linear' marker_map above will handle duplicates and doubled markers well. But with the segment boundaries splitting up the problem, we don't have to worry about the marker being on the wrong segment.

@srl295
Copy link
Member Author

srl295 commented Jan 16, 2024

è̠à̠È̠ segments to è̠ à̠ È̠

image

mcdurdin added a commit that referenced this issue Jan 17, 2024
Fixes #9999.

Note TODO items:
- [ ] Renormalize cached_context across action boundary. Blocked by #10369.
- [ ] Add extra tests for surrogate pairs
- [ ] Move set_context_from_string into helper module
- [ ] if we don't apply normalization, we still need to fixup the
      app_context, to keep it coherent with cached_context (or at least
      we need to verify that app_context is never used in this
      situation)
srl295 added a commit that referenced this issue Jan 18, 2024
- split out parse_next_marker()
- use NFD safe boundaries to segment marker interaction

For: #10369
srl295 added a commit that referenced this issue Jan 18, 2024
- easier to understand loop
- comments

For: #10369
srl295 added a commit that referenced this issue Jan 18, 2024
@srl295 srl295 linked a pull request Jan 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ Keyman Core feat
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant