Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 String Literals - Draft Specification #2911

Open
gafter opened this issue Oct 25, 2019 · 4 comments
Open

UTF8 String Literals - Draft Specification #2911

gafter opened this issue Oct 25, 2019 · 4 comments

Comments

@gafter
Copy link
Member

gafter commented Oct 25, 2019

See also #184

We add language support for a new platform type, Utf8String (dotnet/corefxlab#2350). This name is tentative and subject to a decision by the corert team. For now we use the name Utf8String as a placeholder for whatever name it ends up being.

Section numbers below refer to the ECMA version of the specification.

The following sections are proposed to be added to the specification

9.2.N Utf8String (in section Types)

The type System.Utf8String is a sealed class type that inherits directly from object. In the remainder of the spec, we use the name Utf8String to refer to this specific type. Instances of Utf8String represent Unicode strings stored internally using the Unicode UTF-8 encoding (https://en.wikipedia.org/wiki/UTF-8).

11.2.N Utf8String conversion (in section Implicit Conversions)

An implicit conversion exists from a constant expression of type string to the type Utf8String. This conversion produces a null value if the expression's value is null. Otherwise the conversion produces an instance of Utf8String that represents the same sequence of Unicode codepoints. It is a compile-time error if the characters of the string constant cannot be represented as a valid Unicode UTF-8 sequence. This would occur, for example, if the input string constant contains unmatched surrogates. The result of the conversion is a constant expression of type Utf8String.

Concatenation

The following addition (no pun intended) is made to 12.9.5 Addition operator:

  • Utf8String concatenation:
    System.Utf8String operator +(System.Utf8String x, System.Utf8String y);
    System.Utf8String operator +(System.Utf8String x, object y);
    System.Utf8String operator +(object x, System.Utf8String y);

These overloads of the binary + operator perform Utf8String concatenation. If an operand of Utf8String concatenation is null, an empty Utf8String is substituted. Otherwise, any non-string operand is converted to its Utf8String representation by invoking the virtual ToString method inherited from type object and then encoding the result as a Utf8String. If ToString returns null, an empty Utf8String is substituted. If the string returned by ToString is not representable as a Utf8String, a System.ArgumentException is thrown.

The result of the Utf8String concatenation operator is a Utf8String that consists of the characters of the left operand followed by the characters of the right operand. The Utf8String concatenation operator never returns a null value. A System.OutOfMemoryException may be thrown if there is not enough memory available to allocate the resulting Utf8String.

Constant Expressions

The following changes are made to 12.20 Constant expressions:

Change this sentence

If a constant expression is a reference type, it must be the string type, a default value expression (§12.7.15) for some reference type, or the value of the expression must be null.

to this

If a constant expression is a reference type, it must be the string type, the Utf8String type, a default value expression (§12.7.15) for some reference type, or the value of the expression must be null.

We add the Utf8String conversion to the set of conversions permitted in a constant expression.

Open Issues

Concatenation

String concatenation may be somewhat problematic. A + operation between a Utf8String and a string would be ambiguous due to the presence of the following two operators:

  • Utf8String operator +(Utf8String x, object y);
  • string operator +(object x, string y);

It isn't clear what semantic are desired. Do we need concatenation for Utf8String values?

Interpolation

There is no easy way to use interpolation to get a Utf8String value. One approach would be to define a new interpolated string conversion from an interpolated string to the type Utf8String. That would permit us to issue a compile-time error if the format string contains unmatched surrogates.

@TonyValenti
Copy link

I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?

@CyrusNajmabadi
Copy link
Member

I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?

@TonyValenti there are likely many resources on the web about this. One thing to simply look into is what utf8 it and how it generally stores things in memory versus the 16-bit encodings that C#/Java have used since the start.

@gafter
Copy link
Member Author

gafter commented Oct 30, 2019

I suggest seeing dotnet/corefxlab#2350 for discussion of Utf8String.

@john-h-k
Copy link

john-h-k commented Nov 1, 2019

String concatenation may be somewhat problematic. A + operation between a Utf8String and a string would be ambiguous due to the presence of the following two operators:

Utf8String operator +(Utf8String x, object y);
string operator +(object x, string y);

It isn't clear what semantic are desired. Do we need concatenation for Utf8String values?

If we don't want the compiler simply special casing this, a Utf8String operator +(Utf8String x, string y) and equivalent reverse could fix that surely?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants