RFC 3349 precursors #120329

The parser already does a check-only unescaping which catches all errors. So the checking done in `from_token_lit` never hits. But literals causing warnings can still occur in `from_token_lit`. So the commit changes `str-escape.rs` to use byte string literals and C string literals as well, to give better coverage and ensure the new assertions in `from_token_lit` are correct.

The `CString` handling code is erroneously identical to the `ByteString` handling code.

The `T` type in these functions took me some time to understand, and I find the explicit `T` in the use of `from` makes the code easier to read, as does the `u8` annotation in `scan_escape`.

- Rename it as `MixedUnit`, because it will soon be used in more than just C string literals. - Change the `Byte` variant to `HighByte` and use it only for `\x80`..`\xff` cases. This fixes the old inexactness where ASCII chars could be encoded with either `Byte` or `Char`. - Add useful comments. - Remove `is_ascii`, in favour of `u8::is_ascii`.

I find it easier if they describe what's allowed, rather than what's forbidden. Also, consistent naming makes them easier to understand.

`unescape_literal` becomes `unescape_unicode`, and `unescape_c_string` becomes `unescape_mixed`. Because rfc3349 will mean that C string literals will no longer be the only mixed utf8 literals.

They can't contain `\x` escapes, which means they can't contain high bytes, which means we can used `unescape_unicode` instead of `unescape_mixed` to unescape them. This avoids unnecessary used of `MixedUnit`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 3349 precursors #120329

RFC 3349 precursors #120329

Commits on Jan 25, 2024