System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

WenceyWang · 2018-11-06T17:24:44Z

Windows have a soft keyboard that can type Emoji directly.

Emoji is a set of chars that cannot be stored in a single char while System.ConsoleKeyInfo uses a char to store the content of the pressed key.

public char KeyChar { get; }

In my test, System.Console.ReadKey will return a System.ConsoleKeyInfo which KeyChar is the first char of the emoji (a sequence of surrogate pair, might 10+) and I have no way to get the other chars.

The problem of ReadKey is it may return the first half of a surrogate pair and the next ReadKey call will return the next keypress, not the remaining part of the surrogate pair.

This makes rubbish out and impossible to work around in user code.

This problem also applies to these keyboards for script language.

The text was updated successfully, but these errors were encountered:

Clockwork-Muse · 2018-11-06T19:38:46Z

...this can't be unique to emoji, presumably? Shouldn't some upper-range characters and combining-characters suffer the same fate?

Also, I'm assuming this would be triggered by copy/paste input, so I don't think it's just soft keyboards we have to worry about.

carlossanlop · 2020-04-01T17:50:45Z

This is a problem with surrogate pairs in general. It's not clear how we would address this issue without introducing some form of a breaking change.

Serentty · 2020-07-03T04:25:03Z

To me, it seems like the logical course of action would be to introduce a new API based around Rune instead of char, and gradually deprecate ReadKey() in favour of ReadRune(). The large existing code bases would mean it would probably never be able to be removed completely, but the deprecation strikethrough in Visual Studio could help encourage the use of that new API.

Now that I think about it, it would probably also make sense to have a key-reading API that returns a string, since many languages such as Guarani use characters which one might input with a single keypress, and yet are composed of multiple code points. Having ReadRune() is more pressing though, I think. Not handling combining characters is a matter of functionality, but treating code points as 16-bit is just a matter of being outdated.

However, both seem to be an issue for PSReadLine, which leads to a longstanding bug in PowerShell.

BDisp · 2020-11-10T19:23:07Z

Is there a way to get surrogate pair from System.ConsoleKeyInfo. I'm only getting the first char. Or some method to get the next chars?

Edit:
Console.KeyAvailable is the answer. For surrogate pair we need to use Rune but sometimes I only have his length value after write and would be better to get it before.

WenceyWang · 2021-04-11T16:36:17Z

To me, it seems like the logical course of action would be to introduce a new API based around Rune instead of char, and gradually deprecate ReadKey() in favour of ReadRune(). The large existing code bases would mean it would probably never be able to be removed completely, but the deprecation strikethrough in Visual Studio could help encourage the use of that new API.

Now that I think about it, it would probably also make sense to have a key-reading API that returns a string, since many languages such as Guarani use characters which one might input with a single keypress, and yet are composed of multiple code points. Having ReadRune() is more pressing though, I think. Not handling combining characters is a matter of functionality, but treating code points as 16-bit is just a matter of being outdated.

However, both seem to be an issue for PSReadLine, which leads to a longstanding bug in PowerShell.

ReadKey is important for us. As there are many keys on a keyboard that are not directly mapped to Unicode characters and we do need to know if shift, ctrl, or alt key is pressed.

The design of key code and key char is dated to the IBM PC keyboard controller. This is not the *nix way of letting the console device and program itself escape these keys presses and pass them as a stream.

I think the proper way is to add a string or ReadOnlySpan typed property in System.ConsoleKeyInfo to contain all content caused by a key pressing, which will not break currently working code and let new code be aware of the Unicode keyboard.

GrabYourPitchforks · 2021-04-12T00:18:49Z

I'm assuming the scenario is "I want to be able to get an entire grapheme cluster via ReadKey." (See this doc for the difference between char, Rune, and "grapheme cluster".) There are going to be a few complications with this scenario, regardless of the proposed API.

If you return char or Rune, you may only be returning part of the information. As mentioned earlier, char by itself isn't sufficient to represent supplementary code points. And Rune by itself isn't sufficient to represent many Emoji, which often consist of a base character + extra modifiers like skin tone, gender, etc.

An alternative solution might be to return string, but this runs into an interesting edge case. Say that I type an 'e' character, then I paste the accent modifier so that it is displayed as "é" (that's 2 chars!). What should the behavior here be? Should ReadKey return "e" by itself because that's what I typed first, then return "\u0301" on the second call because that's the accent modifier I pasted after typing 'e'? Similar situation to entering a base Emoji code point, then later entering a skin tone modifier. The second entry affects how the first one was intended to be displayed.

Now, is this acceptable? Maybe your application doesn't interpret these as individual to-be-displayed characters and stitches them all together after the fact anyway. But if you're stitching them together, presumably you could stitch together char-by-char using the current API, and no string-by-string / Rune-by-Rune API is needed.

This is one of those weird things where the requested change need to be laid out very specifically. Things that might seem obvious to one person might not seem obvious to another, and it could have a ripple effect which upends the proposed solution.

WenceyWang · 2021-04-12T05:09:03Z

The problem of ReadKey is it may return the first half of a surrogate pair and drop the next char, this making rubbish out and impossible to workaround in user code.

In short, I think we should at least let ReadKey somehow working correctly without breaking the current running code.

I think it's OK to get "e" and then "\u0301" by ReadKey if they are inputted as two key presses.

The problem is that we do need a keypress-by-keypress API. which makes programs able to respond to keypresses instead of char-based stream input. This is the way we handle console input on Windows. https://docs.microsoft.com/en-us/windows/console/readconsoleinput will return us an array of key press or mouse click events instead of char stream.

Use Console.Read cannot read key code or modifier key status etc. We can not rely on escape sequence.

I have submitted my API suggestion at #51085 which I think makes sense for us.

WenceyWang · 2021-04-12T05:27:39Z

@DHowett I think the problem is rooted at https://docs.microsoft.com/en-us/windows/console/key-event-record-str which uses a WCHAR to store translated Unicode character which for now the input can be a surrogate pair or sequence of Unicode codepoint.

DHowett · 2021-04-12T12:24:12Z

Other applications seem capable of handling surrogate pairs in the WCHAR-typed field of two KEY_EVENT_RECORDs just fine.

BDisp · 2021-04-12T12:34:19Z

@DHowett that it's true as we can handling any surrogate pairs of two or more KEY_EVENT_RECORDs, but how to deal when the same unicode code returns different string length with other font types?

DHowett · 2021-04-12T12:36:09Z

The font cannot change the length of a string.

If you're talking about the column count (perceived space taken up by the string of printed to a console), that's just off topic for this issue :). That's also one of the harder issues in terminal emulation.

GrabYourPitchforks · 2021-04-13T05:01:49Z

If you really want your mind to melt, spec out what behavior your app will have when the user hits BACKSPACE immediately after entering a complex multi-scalar emoji. :)

WenceyWang · 2021-04-13T08:52:51Z

Let me repeat the issue again:

The problem of ReadKey is it may return the first half of a surrogate pair and the next ReadKey call will return the next keypress, not the remaining part of the surrogate pair.

This makes rubbish out and impossible to work around in user code.

BDisp · 2021-04-13T09:03:24Z

@WenceyWang you may need to deal with escape sequences to get all the bytes needed for the surrogate pair.

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 1, 2020

WenceyWang mentioned this issue Apr 11, 2021

Let System.ConsoleKeyInfo able to represent keypress in the Unicode world #51085

Closed

WenceyWang changed the title ~~System.ConsoleKeyInfo can not handle Emoji key on soft keyboard~~ System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences Apr 12, 2021

adamsitnik mentioned this issue May 6, 2021

[Discussion] System.Console re-design #52374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

WenceyWang commented Nov 6, 2018 •

edited

Loading

Clockwork-Muse commented Nov 6, 2018

carlossanlop commented Apr 1, 2020

Serentty commented Jul 3, 2020 •

edited

Loading

BDisp commented Nov 10, 2020 •

edited

Loading

WenceyWang commented Apr 11, 2021 •

edited

Loading

GrabYourPitchforks commented Apr 12, 2021

WenceyWang commented Apr 12, 2021 •

edited

Loading

WenceyWang commented Apr 12, 2021 •

edited

Loading

DHowett commented Apr 12, 2021

BDisp commented Apr 12, 2021

DHowett commented Apr 12, 2021

GrabYourPitchforks commented Apr 13, 2021

WenceyWang commented Apr 13, 2021

BDisp commented Apr 13, 2021

System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

Comments

WenceyWang commented Nov 6, 2018 • edited Loading

Clockwork-Muse commented Nov 6, 2018

carlossanlop commented Apr 1, 2020

Serentty commented Jul 3, 2020 • edited Loading

BDisp commented Nov 10, 2020 • edited Loading

WenceyWang commented Apr 11, 2021 • edited Loading

GrabYourPitchforks commented Apr 12, 2021

WenceyWang commented Apr 12, 2021 • edited Loading

WenceyWang commented Apr 12, 2021 • edited Loading

DHowett commented Apr 12, 2021

BDisp commented Apr 12, 2021

DHowett commented Apr 12, 2021

GrabYourPitchforks commented Apr 13, 2021

WenceyWang commented Apr 13, 2021

BDisp commented Apr 13, 2021

WenceyWang commented Nov 6, 2018 •

edited

Loading

Serentty commented Jul 3, 2020 •

edited

Loading

BDisp commented Nov 10, 2020 •

edited

Loading

WenceyWang commented Apr 11, 2021 •

edited

Loading

WenceyWang commented Apr 12, 2021 •

edited

Loading

WenceyWang commented Apr 12, 2021 •

edited

Loading