Reserve guarded string literals (RFC 3593) #123951

pitaj · 2024-04-15T04:02:44Z

Implementation for RFC 3593, including:

lexer / parser changes
diagnostics
migration lint
tests

We reserve #", ##", ###", ####, and any other string of four or more repeated #. This avoids infinite lookahead in the lexer, though we still use infinite lookahead in the parser to provide better forward compatibility diagnostics.

This PR does not implement any special lexing of the string internals:

strings preceded by one or more # are denied
regardless of the number of trailing #
string contents are lexed as if it was just a bare "string"

Tracking issue: #123735
RFC: rust-lang/rfcs#3593

rustbot · 2024-04-15T04:02:54Z

r? @estebank

rustbot has assigned @estebank.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

mattheww · 2024-04-22T19:49:44Z

In edition 2024 I'm seeing a panic inside cook_unicode if I give this an empty guarded string without enough closing hashes (eg #"" or ##""#).

mattheww · 2024-04-22T20:06:18Z

In edition 2024 if I give this a string like blah"xx" I get both a warning for an unknown prefix and a warning for an unprefixed guarded string literal.

I think it would be better if I got only the first warning.

pitaj · 2024-04-30T05:51:43Z

@mattheww good catch with the empty strings. I wasn't able to reproduce the "double-warning" issue with blah"xx", but just managed to do so with blah#"xx"#.

I think it's actually the correct behavior. When you fix the issue with blah #"xx"# as recommended, you'll end up getting the second error anyways. Better to get both at once, then you can make a more educated decision.

Regardless, that case will be exceedingly rare if it exists in the wild at all, so I'm not going to spend any time on it.

mattheww · 2024-04-30T18:56:32Z

My stress-tester is now giving an ICE for £#""# in editions older than 2024.

tests/ui/rust-2024/reserved-guarded-strings.stderr

mattheww · 2024-05-02T18:46:17Z

My lexer stress-tests don't find any problems with this now.

(They're checking for ICEs, cases where the lexer output changes in previous editions, and cases where the lexer output is not as expected in the 2024 edition.)

Experiment: Reserve guarded string literal syntax (RFC 3593) on all editions Purpose: crater run to see if we even need to make this change on an edition boundary. This syntax change applies to all editions, because the particular syntax `#"foo"#` is unlikely to exist in the wild. Subset of rust-lang#123951 Tracking issue: rust-lang#123735 RFC: rust-lang/rfcs#3593

pitaj · 2024-05-19T17:43:38Z

The crater run for #124605 found some real regressions. So that proves the effort made here for backwards compatibility is warranted.

compiler/rustc_lexer/src/lib.rs

traviscross · 2024-05-28T19:32:00Z

We reviewed this in the edition call today. What's the next step here, e.g. is this waiting on further review or on updates from @pitaj?

cc @estebank

estebank · 2024-05-28T23:43:40Z

@traviscross I'm just concerned with introducing the snapshotting behavior into the lexer. There's no precedent for that approach in that stage of the compiler. It might be that there's no alternative, but would like more people in @rust-lang/compiler to double check on the implementation.

oli-obk · 2024-05-29T10:16:08Z

cc @Nilstrieb @compiler-errors fun lexer things

Noratrieb · 2024-05-29T15:23:14Z

forwarding my ping to @nnethercote

compiler/rustc_lint/src/context/diagnostics.rs

Co-authored-by: Esteban Kuber <estebank@users.noreply.github.com>

traviscross · 2024-08-13T19:26:25Z

r? @petrochenkov

@petrochenkov, you reviewed #113476 which took a similar approach to the one here. Do you mind having a look at this?

This is an edition item, and we're trying to finish and mark off the ones that we can.

petrochenkov · 2024-08-14T13:07:43Z

tests/ui/rust-2024/reserved-guarded-strings-via-macro-2.rs

+fn main() {
+    // Ok:
+    m2021::number_of_tokens_in_a_guarded_string_literal!();
+    m2021::number_of_tokens_in_a_guarded_unterminated_string_literal!();


The token counts (3 and 2) are produced by these macros, but ignored.
They should either be checked (preferably), or not produced.

Ah, I see, they are checked in a different test.
I think the assert can be put into the macros themselves, then you will not need a separate test.

petrochenkov · 2024-08-14T14:11:59Z

The PR description needs an update.

petrochenkov · 2024-08-14T14:26:20Z

The current implementation still has an infinite lexer lookahead, even if it's called from report_guarded_str and not from advance_token.

Suppose we don't have to care about backward compatibility or editions.
How are the "guarded string literals" supposed to be lexed in that case?

I don't see how it can be done without the infinite lookahead.
###############"abc"############### is a string, and ############### without the following string is still 15 separate hashes, but you cannot discern between them without looking arbitrarily far.
This is different from r# and friends where you know that it's certainly not just hashes, but rather a string or an error.

traviscross · 2024-08-14T14:48:42Z

@rustbot labels +I-lang-nominated

Interesting question. That may be one for us as a design consideration. Let's nominate it.

pitaj · 2024-08-14T15:42:24Z

@petrochenkov

The current implementation still has an infinite lexer lookahead, even if it's called from report_guarded_str and not from advance_token.

The infinite lookahead is only for better diagnostics. The current implementation will issue a hard error (or forward compatibility warning) on any of the following:

#"
##"
###
####+

There is no case in edition 2024+ that GuardedStrMaybe can be broken into individual pounds.

(If there aren't already tests for this I should add some)

Suppose we don't have to care about backward compatibility or editions.
How are the "guarded string literals" supposed to be lexed in that case?

Like this implementation, they can choose to error on #", ##", or any length of repeated # greater than or equal to 3. This only requires 3-lookahead. (Or they could choose a greater arbitrary N).

###############"abc"############### is a string, and ############### without the following string is still 15 separate hashes, but you cannot discern between them without looking arbitrarily far.

This implementation considers the second an invalid guarded string literal.

Technically this implementation doesn't follow either RFC exactly. The lang team should consider the implication of reserving any ###+ for this purpose. From a quick search on GitHub, it doesn't appear to exist in the wild.

joshtriplett · 2024-08-14T16:56:09Z

We reviewed this in today's @rust-lang/lang meeting. Given @pitaj's proposal to fully reserve three or more hashes, which would only need 3-lookahead (not infinite lookahead), we'd like to go forward with this.

We're relying here on the assumption that ### doesn't occur in the wild; if it does we'd want to re-evaluate this. (If ## didn't occur in the wild we'd be fine with reserving that too; if ###+ turns up in the wild, we can change it to ####.)

To be explicit: we're also fine reserving three-or-more hashes in all editions if they don't occur in the wild, if that's easier and reduces edition-dependence.

traviscross · 2024-08-14T22:47:35Z

We were a bit unclear in the lang meeting today about the current behavior of the PR. Specifically, we were unclear whether it was proposing to reserve ### in Rust 2024 or in all editions.

In testing this branch, I see the following:

//@ compile-flags: -Zunstable-options
//@ revisions: e2021 e2024
//@ [e2021] edition:2021
//@ [e2024] edition:2024
//@ [e2021] check-pass

macro_rules! m {
    ( _, $($x:tt)* ) => {};
    ( 1, $x:tt ) => {};
    ( 2, $x:tt $y:tt ) => {};
    ( 3, $x:tt $y:tt $z:tt ) => {};
}

fn main() {
    m!(1, #);
    m!(2, ##);
    m!(3, ###);
    m!(_, ####);
    //[e2024]~^ ERROR invalid string literal
}

That is, it seems to reserve four or more, and only on Rust 2024.

@pitaj: Some questions:

Does the impl mean to reserve 3 or more as the comment above suggested?
Why 3? If we're doing this over an edition anyway, how much churn would we incur by reserving two or more instead?
Do we gain anything, in terms of avoiding lookahead, by making a reservation like this in all editions?

pitaj · 2024-08-15T00:34:06Z

Does the impl mean to reserve 3 or more as the comment above suggested?

Sorry, it's been a moment. Misunderstanding my own code, I fell for the off-by-one error. I should have said 4. My apologies for the confusion. I had the amount of lookahead correct, but failed to account for the initial #.

Why 3 4? If we're doing this over an edition anyway, how much churn would we incur by reserving two or more instead?

A set of 4 is what you get using the maximum lookahead of 3 currently implemented in the lexer. I'm not opposed to reserving 2+. I can do a code search and see what I find, but I strongly suspect it would be dwarfed by the number of #" out there (already affected by the changes here).

Do we gain anything, in terms of avoiding lookahead, by making a reservation like this in all editions?

I think it would just be converting future-compatibility warnings into hard errors. I don't see how it would improve the lookahead story because we'd still want to provide good diagnostics.

pitaj · 2024-08-15T19:00:07Z

I'm not opposed to reserving 2+. I can do a code search and see what I find

Not exhaustive by any means, but I looked for a while and couldn't find a single example on GitHub.

traviscross · 2024-08-18T22:32:56Z

@pitaj: Based on that, and how we'd felt about it in discussion, I'd suggest changing this to reserve two or more unprefixed # tokens in Rust 2024.

(This is still nominated, and we'll confirm in triage that this sounds right to everyone.)

@petrochenkov: Would this and the comments from @pitaj above resolve the questions you had raised?

traviscross · 2024-08-21T16:38:36Z

@rustbot labels -I-lang-nominated

We discussed this in lang triage today and confirmed that we would like to reserve the 2+ #s in Rust 2024.

petrochenkov · 2024-08-22T15:37:36Z

Ok, so ## is always considered a string, similarly to how r# is always considered a string, and bare ####### becomes impossible. That resolves the question.

petrochenkov · 2024-08-22T15:46:02Z

compiler/rustc_lexer/src/lib.rs

@@ -393,6 +410,27 @@ impl Cursor<'_> {
                TokenKind::Literal { kind: literal_kind, suffix_start }
            }

+            '#' if matches!(self.first(), '"' | '#') => {
+                match (self.first(), self.second(), self.third()) {


This needs to be updated for reserving 2+ hashes (#123951 (comment)).

petrochenkov · 2024-08-22T15:50:16Z

compiler/rustc_lexer/src/lib.rs

+    ///
+    /// Used for reserving "guarded strings" (RFC 3598) in edition 2024.
+    /// Split into the component tokens on older editions.
+    GuardedStrMaybe,


I suggest removing Maybe and moving this to LiteralKind for consistency, next to LiteralKind::CStr.
CStr isn't called "maybe" even if it's not always a C string (it is not on old editions).

GuardedStr -> GuardedStrPrefix though.
There's a difference with C strings, C strings get to parser fully lexed, but for guarded strings only lex prefix, and then retrieve the rest through a callback in parser.

petrochenkov · 2024-08-22T16:02:42Z