Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve guarded string literals (RFC 3593) #123951

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

pitaj
Copy link
Contributor

@pitaj pitaj commented Apr 15, 2024

Implementation for RFC 3593, including:

  • lexer / parser changes
  • diagnostics
  • migration lint
  • tests

We reserve #", ##", ###", ####, and any other string of four or more repeated #. This avoids infinite lookahead in the lexer, though we still use infinite lookahead in the parser to provide better forward compatibility diagnostics.

This PR does not implement any special lexing of the string internals:

  • strings preceded by one or more # are denied
  • regardless of the number of trailing #
  • string contents are lexed as if it was just a bare "string"

Tracking issue: #123735
RFC: rust-lang/rfcs#3593

@rustbot
Copy link
Collaborator

rustbot commented Apr 15, 2024

r? @estebank

rustbot has assigned @estebank.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Apr 15, 2024
@rust-log-analyzer

This comment has been minimized.

@rustbot

This comment was marked as off-topic.

@mattheww
Copy link
Contributor

In edition 2024 I'm seeing a panic inside cook_unicode if I give this an empty guarded string without enough closing hashes (eg #"" or ##""#).

@mattheww
Copy link
Contributor

In edition 2024 if I give this a string like blah"xx" I get both a warning for an unknown prefix and a warning for an unprefixed guarded string literal.

I think it would be better if I got only the first warning.

@rust-log-analyzer

This comment has been minimized.

@pitaj
Copy link
Contributor Author

pitaj commented Apr 30, 2024

@mattheww good catch with the empty strings. I wasn't able to reproduce the "double-warning" issue with blah"xx", but just managed to do so with blah#"xx"#.

I think it's actually the correct behavior. When you fix the issue with blah #"xx"# as recommended, you'll end up getting the second error anyways. Better to get both at once, then you can make a more educated decision.

Regardless, that case will be exceedingly rare if it exists in the wild at all, so I'm not going to spend any time on it.

@mattheww
Copy link
Contributor

My stress-tester is now giving an ICE for £#""# in editions older than 2024.

@bors

This comment was marked as resolved.

@pitaj pitaj force-pushed the reserve-guarded-strings branch from bb1c797 to 8f57684 Compare May 2, 2024 04:05
@rust-log-analyzer

This comment has been minimized.

@mattheww
Copy link
Contributor

mattheww commented May 2, 2024

My lexer stress-tests don't find any problems with this now.

(They're checking for ICEs, cases where the lexer output changes in previous editions, and cases where the lexer output is not as expected in the 2024 edition.)

@bors

This comment was marked as resolved.

bors added a commit to rust-lang-ci/rust that referenced this pull request May 14, 2024
Experiment: Reserve guarded string literal syntax (RFC 3593) on all editions

Purpose: crater run to see if we even need to make this change on an edition boundary.

This syntax change applies to all editions, because the particular syntax `#"foo"#` is unlikely to exist in the wild.

Subset of rust-lang#123951

Tracking issue: rust-lang#123735
RFC: rust-lang/rfcs#3593
@pitaj
Copy link
Contributor Author

pitaj commented May 19, 2024

The crater run for #124605 found some real regressions. So that proves the effort made here for backwards compatibility is warranted.

@bors

This comment was marked as resolved.

@traviscross
Copy link
Contributor

We reviewed this in the edition call today. What's the next step here, e.g. is this waiting on further review or on updates from @pitaj?

cc @estebank

@estebank
Copy link
Contributor

@traviscross I'm just concerned with introducing the snapshotting behavior into the lexer. There's no precedent for that approach in that stage of the compiler. It might be that there's no alternative, but would like more people in @rust-lang/compiler to double check on the implementation.

@oli-obk
Copy link
Contributor

oli-obk commented May 29, 2024

cc @Nilstrieb @compiler-errors fun lexer things

@Noratrieb
Copy link
Member

forwarding my ping to @nnethercote

Co-authored-by: Esteban Kuber <estebank@users.noreply.github.com>
@traviscross
Copy link
Contributor

r? @petrochenkov

@petrochenkov, you reviewed #113476 which took a similar approach to the one here. Do you mind having a look at this?

This is an edition item, and we're trying to finish and mark off the ones that we can.

@rustbot rustbot assigned petrochenkov and unassigned nnethercote Aug 13, 2024
fn main() {
// Ok:
m2021::number_of_tokens_in_a_guarded_string_literal!();
m2021::number_of_tokens_in_a_guarded_unterminated_string_literal!();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token counts (3 and 2) are produced by these macros, but ignored.
They should either be checked (preferably), or not produced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, they are checked in a different test.
I think the assert can be put into the macros themselves, then you will not need a separate test.

@petrochenkov
Copy link
Contributor

The PR description needs an update.

@petrochenkov
Copy link
Contributor

The current implementation still has an infinite lexer lookahead, even if it's called from report_guarded_str and not from advance_token.

Suppose we don't have to care about backward compatibility or editions.
How are the "guarded string literals" supposed to be lexed in that case?

I don't see how it can be done without the infinite lookahead.
###############"abc"############### is a string, and ############### without the following string is still 15 separate hashes, but you cannot discern between them without looking arbitrarily far.
This is different from r# and friends where you know that it's certainly not just hashes, but rather a string or an error.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 14, 2024
@traviscross
Copy link
Contributor

@rustbot labels +I-lang-nominated

Interesting question. That may be one for us as a design consideration. Let's nominate it.

@rustbot rustbot added the I-lang-nominated Nominated for discussion during a lang team meeting. label Aug 14, 2024
@pitaj
Copy link
Contributor Author

pitaj commented Aug 14, 2024

@petrochenkov

The current implementation still has an infinite lexer lookahead, even if it's called from report_guarded_str and not from advance_token.

The infinite lookahead is only for better diagnostics. The current implementation will issue a hard error (or forward compatibility warning) on any of the following:

#"
##"
###
####+

There is no case in edition 2024+ that GuardedStrMaybe can be broken into individual pounds.

(If there aren't already tests for this I should add some)

Suppose we don't have to care about backward compatibility or editions.
How are the "guarded string literals" supposed to be lexed in that case?

Like this implementation, they can choose to error on #", ##", or any length of repeated # greater than or equal to 3. This only requires 3-lookahead. (Or they could choose a greater arbitrary N).

###############"abc"############### is a string, and ############### without the following string is still 15 separate hashes, but you cannot discern between them without looking arbitrarily far.

This implementation considers the second an invalid guarded string literal.


Technically this implementation doesn't follow either RFC exactly. The lang team should consider the implication of reserving any ###+ for this purpose. From a quick search on GitHub, it doesn't appear to exist in the wild.

@joshtriplett
Copy link
Member

We reviewed this in today's @rust-lang/lang meeting. Given @pitaj's proposal to fully reserve three or more hashes, which would only need 3-lookahead (not infinite lookahead), we'd like to go forward with this.

We're relying here on the assumption that ### doesn't occur in the wild; if it does we'd want to re-evaluate this. (If ## didn't occur in the wild we'd be fine with reserving that too; if ###+ turns up in the wild, we can change it to ####.)

To be explicit: we're also fine reserving three-or-more hashes in all editions if they don't occur in the wild, if that's easier and reduces edition-dependence.

@traviscross
Copy link
Contributor

We were a bit unclear in the lang meeting today about the current behavior of the PR. Specifically, we were unclear whether it was proposing to reserve ### in Rust 2024 or in all editions.

In testing this branch, I see the following:

//@ compile-flags: -Zunstable-options
//@ revisions: e2021 e2024
//@ [e2021] edition:2021
//@ [e2024] edition:2024
//@ [e2021] check-pass

macro_rules! m {
    ( _, $($x:tt)* ) => {};
    ( 1, $x:tt ) => {};
    ( 2, $x:tt $y:tt ) => {};
    ( 3, $x:tt $y:tt $z:tt ) => {};
}

fn main() {
    m!(1, #);
    m!(2, ##);
    m!(3, ###);
    m!(_, ####);
    //[e2024]~^ ERROR invalid string literal
}

That is, it seems to reserve four or more, and only on Rust 2024.

@pitaj: Some questions:

  • Does the impl mean to reserve 3 or more as the comment above suggested?
  • Why 3? If we're doing this over an edition anyway, how much churn would we incur by reserving two or more instead?
  • Do we gain anything, in terms of avoiding lookahead, by making a reservation like this in all editions?

@pitaj
Copy link
Contributor Author

pitaj commented Aug 15, 2024

Does the impl mean to reserve 3 or more as the comment above suggested?

Sorry, it's been a moment. Misunderstanding my own code, I fell for the off-by-one error. I should have said 4. My apologies for the confusion. I had the amount of lookahead correct, but failed to account for the initial #.

Why 3 4? If we're doing this over an edition anyway, how much churn would we incur by reserving two or more instead?

A set of 4 is what you get using the maximum lookahead of 3 currently implemented in the lexer. I'm not opposed to reserving 2+. I can do a code search and see what I find, but I strongly suspect it would be dwarfed by the number of #" out there (already affected by the changes here).

Do we gain anything, in terms of avoiding lookahead, by making a reservation like this in all editions?

I think it would just be converting future-compatibility warnings into hard errors. I don't see how it would improve the lookahead story because we'd still want to provide good diagnostics.

@pitaj
Copy link
Contributor Author

pitaj commented Aug 15, 2024

I'm not opposed to reserving 2+. I can do a code search and see what I find

Not exhaustive by any means, but I looked for a while and couldn't find a single example on GitHub.

@traviscross
Copy link
Contributor

@pitaj: Based on that, and how we'd felt about it in discussion, I'd suggest changing this to reserve two or more unprefixed # tokens in Rust 2024.

(This is still nominated, and we'll confirm in triage that this sounds right to everyone.)

@petrochenkov: Would this and the comments from @pitaj above resolve the questions you had raised?

@traviscross
Copy link
Contributor

@rustbot labels -I-lang-nominated

We discussed this in lang triage today and confirmed that we would like to reserve the 2+ #s in Rust 2024.

@rustbot rustbot removed the I-lang-nominated Nominated for discussion during a lang team meeting. label Aug 21, 2024
@petrochenkov petrochenkov added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 22, 2024
@petrochenkov
Copy link
Contributor

Ok, so ## is always considered a string, similarly to how r# is always considered a string, and bare ####### becomes impossible. That resolves the question.

@@ -393,6 +410,27 @@ impl Cursor<'_> {
TokenKind::Literal { kind: literal_kind, suffix_start }
}

'#' if matches!(self.first(), '"' | '#') => {
match (self.first(), self.second(), self.third()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated for reserving 2+ hashes (#123951 (comment)).

///
/// Used for reserving "guarded strings" (RFC 3598) in edition 2024.
/// Split into the component tokens on older editions.
GuardedStrMaybe,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing Maybe and moving this to LiteralKind for consistency, next to LiteralKind::CStr.
CStr isn't called "maybe" even if it's not always a C string (it is not on old editions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GuardedStr -> GuardedStrPrefix though.
There's a difference with C strings, C strings get to parser fully lexed, but for guarded strings only lex prefix, and then retrieve the rest through a callback in parser.

&& start > self.start_pos
&& matches!(self.src.as_bytes()[self.src_index(start) - 1], b'#' | b'"')
{
return self.split_guarded_str_maybe(start, str_before);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like a common situation and the early return shouldn't be important for performance.
So it would be simpler to wrap the skipping condition around the buffer_lint below (and then inline and remove fn split_guarded_str_maybe).

@@ -243,6 +244,7 @@ impl<'psess, 'src> StringReader<'psess, 'src> {
let prefix_span = self.mk_sp(start, lit_start);
return (Token::new(self.ident(start), prefix_span), preceded_by_whitespace);
}
rustc_lexer::TokenKind::GuardedStrMaybe => self.report_guarded_str(start, str_before),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rustc_lexer::TokenKind::GuardedStrMaybe => self.report_guarded_str(start, str_before),
rustc_lexer::TokenKind::GuardedStrMaybe => self.maybe_report_guarded_str(start, str_before),

We do not always report a guarded str error/lint here.

return self.split_guarded_str_maybe(start, str_before);
}

let mut maybe_cursor = Cursor::new(str_before);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut maybe_cursor = Cursor::new(str_before);
let mut cursor = Cursor::new(str_before);

It is definitely a cursor, no?

if edition2024 {
let expn_data = span.ctxt().outer_expn_data();

let sugg = if expn_data.is_root() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sugg = if expn_data.is_root() {
let sugg = if !span.from_expansion() {

"unterminated double quote string",
)
.with_code(E0765)
.emit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a double error here, one for reservation and another for unterminated string, one would be enough (with the fatal one as a priority).

let mut maybe_cursor = Cursor::new(str_before);

let (span, space_span, unterminated) =
if let Some(rustc_lexer::GuardedStr { n_hashes, terminated, token_len }) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: this would look better as a match.

self.pos = end;
}

let unterminated = if edition2024 && !terminated { Some(str_start) } else { None };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let unterminated = if edition2024 && !terminated { Some(str_start) } else { None };
let unterminated = if !terminated { Some(str_start) } else { None };

Redundant condition, the result is only used on 2024 edition anyway.

pub struct GuardedStr {
pub n_hashes: u32,
pub terminated: bool,
pub token_len: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token lengh like this doesn't seem to be used in other literals.
Perhaps it is redundant and the end position can be calculated without it, using the cursor data?

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 22, 2024
@Skgland
Copy link
Contributor

Skgland commented Aug 22, 2024

I was being reminded of #118825 by this.

As such I am wondering: Does expanding to two or more hashes rather than three or more interact properly with code examples?

@nnethercote
Copy link
Contributor

I try not to be a negative Nancy, and I rarely weigh in on language design issues. But I want to state clearly that I have a bad feeling about guarded strings, both the idea and the implementation. To echo my comments from above: yet another string literal type, when we already have a lot. All for what I consider a very minor functionality advancement that will have relatively few uses.

Implementation complexity is a particular interest of mine. I spend a lot of time cleaning up code in the compiler, making complex things simpler and more streamlined. I have merged multiple PRs doing exactly that kind of cleanup with the code that handles string literals. It's already more complex than you might expect. And the complexity this PR adds -- including edition-specific special casing of a certain number of leading # characters, yuk -- is exactly the kind of thing that makes my spidey senses tingle and that I try to fix or improve. And I believe implementation complexity is a useful feedback mechanism to language design. If something is difficult to implement, it's going to also be difficult to specify, and describe, and educate users on. It will also be a permanent knot of complexity in the language and implementation.

A follow-up question to this paragraph is "are there any changes we can make that would improve the situation?" But I feel like the answer is "not really". It's all a consequence of guarded string literals being sufficiently different to, and not interacting cleanly with, the existing string literals. If we were designing Rust from scratch it would be a different story.

If I were king of Rust I would just veto the feature entirely.

@fee1-dead fee1-dead self-requested a review August 23, 2024 05:11
@bors
Copy link
Contributor

bors commented Sep 8, 2024

☔ The latest upstream changes (presumably #130091) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F-unprefixed_guarded_strings `#![feature(unprefixed_guarded_strings)]` S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.