Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences #27828

Open
WenceyWang opened this issue Nov 6, 2018 · 14 comments

Comments

@WenceyWang
Copy link

WenceyWang commented Nov 6, 2018

Windows have a soft keyboard that can type Emoji directly.

image

Emoji is a set of chars that cannot be stored in a single char while System.ConsoleKeyInfo uses a char to store the content of the pressed key.

public char KeyChar { get; }

In my test, System.Console.ReadKey will return a System.ConsoleKeyInfo which KeyChar is the first char of the emoji (a sequence of surrogate pair, might 10+) and I have no way to get the other chars.

The problem of ReadKey is it may return the first half of a surrogate pair and the next ReadKey call will return the next keypress, not the remaining part of the surrogate pair.

This makes rubbish out and impossible to work around in user code.

This problem also applies to these keyboards for script language.

@Clockwork-Muse
Copy link
Contributor

...this can't be unique to emoji, presumably? Shouldn't some upper-range characters and combining-characters suffer the same fate?

Also, I'm assuming this would be triggered by copy/paste input, so I don't think it's just soft keyboards we have to worry about.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@carlossanlop
Copy link
Member

This is a problem with surrogate pairs in general. It's not clear how we would address this issue without introducing some form of a breaking change.

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 1, 2020
@Serentty
Copy link
Contributor

Serentty commented Jul 3, 2020

To me, it seems like the logical course of action would be to introduce a new API based around Rune instead of char, and gradually deprecate ReadKey() in favour of ReadRune(). The large existing code bases would mean it would probably never be able to be removed completely, but the deprecation strikethrough in Visual Studio could help encourage the use of that new API.

Now that I think about it, it would probably also make sense to have a key-reading API that returns a string, since many languages such as Guarani use characters which one might input with a single keypress, and yet are composed of multiple code points. Having ReadRune() is more pressing though, I think. Not handling combining characters is a matter of functionality, but treating code points as 16-bit is just a matter of being outdated.

However, both seem to be an issue for PSReadLine, which leads to a longstanding bug in PowerShell.

@BDisp
Copy link

BDisp commented Nov 10, 2020

Is there a way to get surrogate pair from System.ConsoleKeyInfo. I'm only getting the first char. Or some method to get the next chars?

Edit:
Console.KeyAvailable is the answer. For surrogate pair we need to use Rune but sometimes I only have his length value after write and would be better to get it before.

@WenceyWang
Copy link
Author

WenceyWang commented Apr 11, 2021

To me, it seems like the logical course of action would be to introduce a new API based around Rune instead of char, and gradually deprecate ReadKey() in favour of ReadRune(). The large existing code bases would mean it would probably never be able to be removed completely, but the deprecation strikethrough in Visual Studio could help encourage the use of that new API.

Now that I think about it, it would probably also make sense to have a key-reading API that returns a string, since many languages such as Guarani use characters which one might input with a single keypress, and yet are composed of multiple code points. Having ReadRune() is more pressing though, I think. Not handling combining characters is a matter of functionality, but treating code points as 16-bit is just a matter of being outdated.

However, both seem to be an issue for PSReadLine, which leads to a longstanding bug in PowerShell.

ReadKey is important for us. As there are many keys on a keyboard that are not directly mapped to Unicode characters and we do need to know if shift, ctrl, or alt key is pressed.

The design of key code and key char is dated to the IBM PC keyboard controller. This is not the *nix way of letting the console device and program itself escape these keys presses and pass them as a stream.

I think the proper way is to add a string or ReadOnlySpan typed property in System.ConsoleKeyInfo to contain all content caused by a key pressing, which will not break currently working code and let new code be aware of the Unicode keyboard.

@GrabYourPitchforks
Copy link
Member

I'm assuming the scenario is "I want to be able to get an entire grapheme cluster via ReadKey." (See this doc for the difference between char, Rune, and "grapheme cluster".) There are going to be a few complications with this scenario, regardless of the proposed API.

If you return char or Rune, you may only be returning part of the information. As mentioned earlier, char by itself isn't sufficient to represent supplementary code points. And Rune by itself isn't sufficient to represent many Emoji, which often consist of a base character + extra modifiers like skin tone, gender, etc.

An alternative solution might be to return string, but this runs into an interesting edge case. Say that I type an 'e' character, then I paste the accent modifier so that it is displayed as "é" (that's 2 chars!). What should the behavior here be? Should ReadKey return "e" by itself because that's what I typed first, then return "\u0301" on the second call because that's the accent modifier I pasted after typing 'e'? Similar situation to entering a base Emoji code point, then later entering a skin tone modifier. The second entry affects how the first one was intended to be displayed.

Now, is this acceptable? Maybe your application doesn't interpret these as individual to-be-displayed characters and stitches them all together after the fact anyway. But if you're stitching them together, presumably you could stitch together char-by-char using the current API, and no string-by-string / Rune-by-Rune API is needed.

This is one of those weird things where the requested change need to be laid out very specifically. Things that might seem obvious to one person might not seem obvious to another, and it could have a ripple effect which upends the proposed solution.

@WenceyWang
Copy link
Author

WenceyWang commented Apr 12, 2021

The problem of ReadKey is it may return the first half of a surrogate pair and drop the next char, this making rubbish out and impossible to workaround in user code.

In short, I think we should at least let ReadKey somehow working correctly without breaking the current running code.

I think it's OK to get "e" and then "\u0301" by ReadKey if they are inputted as two key presses.

The problem is that we do need a keypress-by-keypress API. which makes programs able to respond to keypresses instead of char-based stream input. This is the way we handle console input on Windows. https://docs.microsoft.com/en-us/windows/console/readconsoleinput will return us an array of key press or mouse click events instead of char stream.

Use Console.Read cannot read key code or modifier key status etc. We can not rely on escape sequence.

I have submitted my API suggestion at #51085 which I think makes sense for us.

@WenceyWang
Copy link
Author

WenceyWang commented Apr 12, 2021

@DHowett I think the problem is rooted at https://docs.microsoft.com/en-us/windows/console/key-event-record-str which uses a WCHAR to store translated Unicode character which for now the input can be a surrogate pair or sequence of Unicode codepoint.

@WenceyWang WenceyWang changed the title System.ConsoleKeyInfo can not handle Emoji key on soft keyboard System.ConsoleKeyInfo can not handle Unicode surrogate pair and Emoji Sequences Apr 12, 2021
@DHowett
Copy link

DHowett commented Apr 12, 2021

Other applications seem capable of handling surrogate pairs in the WCHAR-typed field of two KEY_EVENT_RECORDs just fine.

@BDisp
Copy link

BDisp commented Apr 12, 2021

@DHowett that it's true as we can handling any surrogate pairs of two or more KEY_EVENT_RECORDs, but how to deal when the same unicode code returns different string length with other font types?

@DHowett
Copy link

DHowett commented Apr 12, 2021

The font cannot change the length of a string.

If you're talking about the column count (perceived space taken up by the string of printed to a console), that's just off topic for this issue :). That's also one of the harder issues in terminal emulation.

@GrabYourPitchforks
Copy link
Member

If you really want your mind to melt, spec out what behavior your app will have when the user hits BACKSPACE immediately after entering a complex multi-scalar emoji. :)

@WenceyWang
Copy link
Author

Let me repeat the issue again:

The problem of ReadKey is it may return the first half of a surrogate pair and the next ReadKey call will return the next keypress, not the remaining part of the surrogate pair.

This makes rubbish out and impossible to work around in user code.

@BDisp
Copy link

BDisp commented Apr 13, 2021

@WenceyWang you may need to deal with escape sequences to get all the bytes needed for the surrogate pair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants