Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): support normalization in Core 🌱 #9999

Closed
7 tasks done
mcdurdin opened this issue Nov 14, 2023 · 0 comments · Fixed by #10390 or #10403
Closed
7 tasks done

feat(core): support normalization in Core 🌱 #9999

mcdurdin opened this issue Nov 14, 2023 · 0 comments · Fixed by #10390 or #10403
Assignees
Labels
core/ Keyman Core feat
Milestone

Comments

@mcdurdin
Copy link
Member

mcdurdin commented Nov 14, 2023

Impacted modules

  • Core Consumers:
    • Keyman Engine for Windows
    • Keyman Engine for macOS
    • Keyman Engine for Linux
    • Keyman Developer Debugger
  • Keyman Core
  • Core Keyboard Processors:
    • LDML Keyboard Processor
  • Unit tests

Steps

  1. Add a Core API to convert km_core_actions to km_core_action_item* (mirrors action_item_list_to_actions_object)

  2. Update Engines to always call km_core_state_context_set_if_needed, and stop managing their own cached context.

  3. LDML processor can then generate km_core_actions and existing engines can consume it with no further modification.

  4. Normalization support for LDML keyboards:

  • input context (km_core_state_set_context_if_needed)
  • and action output (km_core_state_get_actions and actions_object_to_action_item_list)

17.0 target API usage is:

  • kmn: emit queue
  • ldml: emit struct
  • windows: read queue
  • mac: read struct
  • linux: read queue
  • debugger: read queue
  • ldml unit tests: read struct
  • Core will translate only as needed, so this way, the big blocker (debugger) can wait until 18.0.

17.0 TODO

Move to km_core_state_context_set_if_needed

Longer term

  1. Move to action struct in Engine for Windows, Developer Debugger, Engine for Linux (#10353) {tech debt}
  • Removes dead wood apis
  • Possible difficulties with the Keyman Developer debugger (will need sideband data on cached context as debugger is, and must continue to be, marker-aware)
  1. Move to action struct for output from kmn Keyboard Processor. Then the concept of Action Queue moves to an internal component of kmn Keyboard Processor (and the debugger will query kmn keyboard processor directly).

Implementation notes

in km_core_state_context_set_if_needed we'll need to do something like this:

  // TODO: normalization
  km_core_cp* text;
  switch(get_keyboard_processor_normalization()) {
    case NFD: text = normalize_nfd(application_context); break;
    case NFC: text = normalize_nfc(application_context); break;
    default: /*NFW*/ text = application_context;
  }

In km_core_state_get_actions and actions_object_to_action_item_list, Core will need to track the context back to the last base codepoint (per Unicode spec) to give a starting anchor for normalization of output.

Core will need to maintain a copy of the exact text from the app as well as the NFD cached context for LDML, and need to sync those up to determine the number of codepoints to delete from the app. Remember that the input context is in NFU "Normalization Form Unknown" -- and may well have mixed content, and we need to know exactly how many codepoints to delete from that mixed content.

Application context synchronization

Note: this discussion is working in UTF32; if working in UTF16 then consideration needs to be given to whether counts are in codepoints or codeunits.

Given an application context in NFU and a cached context in NFD (for the purposes of this discussion, stripped of markers), we need to be able to calculate the number of characters to delete.

  • The transform function has two parameters: del (number of characters) and ins (string to insert).
  • cached_context is always NFD.
  • app_context is always NFU.

The LDML keyboard processor will return a number of NFD codepoints to delete. However, this number cannot be passed straight to the app, because the normalization of the string may not not match up. We also need to consider normalization with string concatenation. This requires a two step process:

  1. Normalize insertion side of the concatenation boundary (https://unicode.org/reports/tr15/#Concatenation):
  • remove del characters from cached_context
  • while cached_context is not empty, and the last character of cached_context is not a stable character, prepend it to the ins and increment del.
let chopped_cached_context = cached_context.substr(0, cached_context.length - del)
while(chopped_cached_context != "" && !is_stable(chopped_cached_context.last())) {
  let ch = chopped_cached_context.pop()
  ins.prepend(ch)
  del++
}
  1. Calculate the number of characters to delete from app_context, and additional fixups that need to be prepended to the inserted text ins because the change happened over a normalization boundary:
let app_del = 0
let new_app_context = app_context
let chopped_cached_context = cached_context.substr(0, cached_context.length - del)

while(nfd(new_app_context).length > chopped_cached_context.length) {
  new_app_context.pop()
  app_del++
}

ins.prepend(substr(chopped_cached_context, nfd(new_app_context).length))
  • app_del will then be the number of characters to delete from the document
  • ins will be the text to insert

UTF-16 Considerations

Core has enough data to provide the number of codepoints to delete and number of UTF-16 codeunits to delete. Recommend that this information is provided in the action object to consumers, as that covers both of our deletion use cases (injecting backspaces vs direct string manipulation).

@keymanapp-test-bot keymanapp-test-bot bot added core/ Keyman Core feat labels Nov 14, 2023
@mcdurdin mcdurdin self-assigned this Nov 14, 2023
@mcdurdin mcdurdin added this to the A17S26 milestone Nov 23, 2023
@darcywong00 darcywong00 modified the milestones: A17S26, A17S27 Nov 27, 2023
@mcdurdin mcdurdin modified the milestones: A17S27, A17S28 Dec 6, 2023
@mcdurdin mcdurdin modified the milestones: A17S28, A17S30 Dec 13, 2023
mcdurdin added a commit that referenced this issue Jan 15, 2024
First half of #9999.

Adds support for normalization (to NFD) of input app context into the
cached context. The keyboard processor will work with the NFD cached
context.

Adds unit tests for the normalization as part of the LDML keyboard
processor test suite.

TODO:
* Comparing modified cached context to app context to determine the
  transform required to send to the app
* Handling illegal unicode and unpaired surrogates on input context
mcdurdin added a commit that referenced this issue Jan 15, 2024
@mcdurdin mcdurdin changed the title feat(core): support normalization in Core feat(core): support normalization in Core 🌱 Jan 16, 2024
mcdurdin added a commit that referenced this issue Jan 16, 2024
mcdurdin added a commit that referenced this issue Jan 16, 2024
Relates to #9999.
Fixes #10384.

The context API endpoints should no longer be considered as part of the
standard Core API. The only consumers that have a need to access these
APIs are the IMX integration in Engine for Windows, and the Keyman
Developer Debugger.

These symbols are currently used by Developer:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_state_context()`
* `km_core_context_set()`
* `km_core_context_clear()`

These symbols are currently used by Windows IMX:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_context_items_dispose()`
* `km_core_context_item_list_size()`
* `km_core_state_get_intermediate_context()`

The following functions and symbols are moving to
keyman_core_api_context.h:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_state_context()` function
* `km_core_state_get_intermediate_context()` function
* `km_core_context_set()` function
* `km_core_context_clear()` function
* `km_core_context_get()` function
* `km_core_context_items_from_utf16()` function
* `km_core_context_items_from_utf8()` function
* `km_core_context_items_to_utf8()` function
* `km_core_context_items_to_utf16()` function
* `km_core_context_items_to_utf32()` function
* `km_core_context_items_dispose()` function
* `km_core_context_length()` function
* `km_core_context_append()` function
* `km_core_context_shrink()` function
* `km_core_context_item_list_size()` function
mcdurdin added a commit that referenced this issue Jan 16, 2024
mcdurdin added a commit that referenced this issue Jan 16, 2024
Relates to #9999.

Establishes functions, unit test sources, normalization entry point and
an effectively no-op unit test for normalization support.
mcdurdin added a commit that referenced this issue Jan 17, 2024
Fixes #9999.

Note TODO items:
- [ ] Renormalize cached_context across action boundary. Blocked by #10369.
- [ ] Add extra tests for surrogate pairs
- [ ] Move set_context_from_string into helper module
- [ ] if we don't apply normalization, we still need to fixup the
      app_context, to keep it coherent with cached_context (or at least
      we need to verify that app_context is never used in this
      situation)
mcdurdin added a commit that referenced this issue Jan 17, 2024
mcdurdin added a commit that referenced this issue Jan 17, 2024
Relates to #9999.
Fixes #10384.

The context API endpoints should no longer be considered as part of the
standard Core API. The only consumers that have a need to access these
APIs are the IMX integration in Engine for Windows, and the Keyman
Developer Debugger.

These symbols are currently used by Developer:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_state_context()`
* `km_core_context_set()`
* `km_core_context_clear()`

These symbols are currently used by Windows IMX:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_context_items_dispose()`
* `km_core_context_item_list_size()`
* `km_core_state_get_intermediate_context()`

The following functions and symbols are moving to
keyman_core_api_context.h:
* `km_core_context` struct
* `km_core_context_type` enum
* `km_core_context_item` struct
* `KM_CORE_CONTEXT_ITEM_END` macro
* `km_core_state_context()` function
* `km_core_state_get_intermediate_context()` function
* `km_core_context_set()` function
* `km_core_context_clear()` function
* `km_core_context_get()` function
* `km_core_context_items_from_utf16()` function
* `km_core_context_items_from_utf8()` function
* `km_core_context_items_to_utf8()` function
* `km_core_context_items_to_utf16()` function
* `km_core_context_items_to_utf32()` function
* `km_core_context_items_dispose()` function
* `km_core_context_length()` function
* `km_core_context_append()` function
* `km_core_context_shrink()` function
* `km_core_context_item_list_size()` function
mcdurdin added a commit that referenced this issue Jan 17, 2024
@mcdurdin mcdurdin linked a pull request Jan 18, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ Keyman Core feat
Projects
None yet
2 participants