-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reserve guarded string literals (RFC 3593) #123951
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment was marked as off-topic.
This comment was marked as off-topic.
In edition 2024 I'm seeing a panic inside |
In edition 2024 if I give this a string like I think it would be better if I got only the first warning. |
This comment has been minimized.
This comment has been minimized.
@mattheww good catch with the empty strings. I wasn't able to reproduce the "double-warning" issue with I think it's actually the correct behavior. When you fix the issue with Regardless, that case will be exceedingly rare if it exists in the wild at all, so I'm not going to spend any time on it. |
My stress-tester is now giving an ICE for |
This comment was marked as resolved.
This comment was marked as resolved.
bb1c797
to
8f57684
Compare
This comment has been minimized.
This comment has been minimized.
My lexer stress-tests don't find any problems with this now. (They're checking for ICEs, cases where the lexer output changes in previous editions, and cases where the lexer output is not as expected in the 2024 edition.) |
This comment was marked as resolved.
This comment was marked as resolved.
Experiment: Reserve guarded string literal syntax (RFC 3593) on all editions Purpose: crater run to see if we even need to make this change on an edition boundary. This syntax change applies to all editions, because the particular syntax `#"foo"#` is unlikely to exist in the wild. Subset of rust-lang#123951 Tracking issue: rust-lang#123735 RFC: rust-lang/rfcs#3593
The crater run for #124605 found some real regressions. So that proves the effort made here for backwards compatibility is warranted. |
029bc5a
to
01505bc
Compare
This comment was marked as resolved.
This comment was marked as resolved.
@traviscross I'm just concerned with introducing the snapshotting behavior into the lexer. There's no precedent for that approach in that stage of the compiler. It might be that there's no alternative, but would like more people in @rust-lang/compiler to double check on the implementation. |
cc @Nilstrieb @compiler-errors fun lexer things |
forwarding my ping to @nnethercote |
Co-authored-by: Esteban Kuber <estebank@users.noreply.github.com>
@petrochenkov, you reviewed #113476 which took a similar approach to the one here. Do you mind having a look at this? This is an edition item, and we're trying to finish and mark off the ones that we can. |
fn main() { | ||
// Ok: | ||
m2021::number_of_tokens_in_a_guarded_string_literal!(); | ||
m2021::number_of_tokens_in_a_guarded_unterminated_string_literal!(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The token counts (3 and 2) are produced by these macros, but ignored.
They should either be checked (preferably), or not produced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, they are checked in a different test.
I think the assert can be put into the macros themselves, then you will not need a separate test.
The PR description needs an update. |
The current implementation still has an infinite lexer lookahead, even if it's called from Suppose we don't have to care about backward compatibility or editions. I don't see how it can be done without the infinite lookahead. |
The infinite lookahead is only for better diagnostics. The current implementation will issue a hard error (or forward compatibility warning) on any of the following:
There is no case in edition 2024+ that (If there aren't already tests for this I should add some)
Like this implementation, they can choose to error on
This implementation considers the second an invalid guarded string literal. Technically this implementation doesn't follow either RFC exactly. The lang team should consider the implication of reserving any |
We reviewed this in today's @rust-lang/lang meeting. Given @pitaj's proposal to fully reserve three or more hashes, which would only need 3-lookahead (not infinite lookahead), we'd like to go forward with this. We're relying here on the assumption that To be explicit: we're also fine reserving three-or-more hashes in all editions if they don't occur in the wild, if that's easier and reduces edition-dependence. |
We were a bit unclear in the lang meeting today about the current behavior of the PR. Specifically, we were unclear whether it was proposing to reserve In testing this branch, I see the following: //@ compile-flags: -Zunstable-options
//@ revisions: e2021 e2024
//@ [e2021] edition:2021
//@ [e2024] edition:2024
//@ [e2021] check-pass
macro_rules! m {
( _, $($x:tt)* ) => {};
( 1, $x:tt ) => {};
( 2, $x:tt $y:tt ) => {};
( 3, $x:tt $y:tt $z:tt ) => {};
}
fn main() {
m!(1, #);
m!(2, ##);
m!(3, ###);
m!(_, ####);
//[e2024]~^ ERROR invalid string literal
} That is, it seems to reserve four or more, and only on Rust 2024. @pitaj: Some questions:
|
Sorry, it's been a moment. Misunderstanding my own code, I fell for the off-by-one error. I should have said 4. My apologies for the confusion. I had the amount of lookahead correct, but failed to account for the initial
A set of 4 is what you get using the maximum lookahead of 3 currently implemented in the lexer. I'm not opposed to reserving 2+. I can do a code search and see what I find, but I strongly suspect it would be dwarfed by the number of
I think it would just be converting future-compatibility warnings into hard errors. I don't see how it would improve the lookahead story because we'd still want to provide good diagnostics. |
Not exhaustive by any means, but I looked for a while and couldn't find a single example on GitHub. |
@pitaj: Based on that, and how we'd felt about it in discussion, I'd suggest changing this to reserve two or more unprefixed (This is still nominated, and we'll confirm in triage that this sounds right to everyone.) @petrochenkov: Would this and the comments from @pitaj above resolve the questions you had raised? |
@rustbot labels -I-lang-nominated We discussed this in lang triage today and confirmed that we would like to reserve the 2+ |
Ok, so |
@@ -393,6 +410,27 @@ impl Cursor<'_> { | |||
TokenKind::Literal { kind: literal_kind, suffix_start } | |||
} | |||
|
|||
'#' if matches!(self.first(), '"' | '#') => { | |||
match (self.first(), self.second(), self.third()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated for reserving 2+ hashes (#123951 (comment)).
/// | ||
/// Used for reserving "guarded strings" (RFC 3598) in edition 2024. | ||
/// Split into the component tokens on older editions. | ||
GuardedStrMaybe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing Maybe
and moving this to LiteralKind
for consistency, next to LiteralKind::CStr
.
CStr
isn't called "maybe" even if it's not always a C string (it is not on old editions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GuardedStr
-> GuardedStrPrefix
though.
There's a difference with C strings, C strings get to parser fully lexed, but for guarded strings only lex prefix, and then retrieve the rest through a callback in parser.
&& start > self.start_pos | ||
&& matches!(self.src.as_bytes()[self.src_index(start) - 1], b'#' | b'"') | ||
{ | ||
return self.split_guarded_str_maybe(start, str_before); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like a common situation and the early return shouldn't be important for performance.
So it would be simpler to wrap the skipping condition around the buffer_lint
below (and then inline and remove fn split_guarded_str_maybe
).
@@ -243,6 +244,7 @@ impl<'psess, 'src> StringReader<'psess, 'src> { | |||
let prefix_span = self.mk_sp(start, lit_start); | |||
return (Token::new(self.ident(start), prefix_span), preceded_by_whitespace); | |||
} | |||
rustc_lexer::TokenKind::GuardedStrMaybe => self.report_guarded_str(start, str_before), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rustc_lexer::TokenKind::GuardedStrMaybe => self.report_guarded_str(start, str_before), | |
rustc_lexer::TokenKind::GuardedStrMaybe => self.maybe_report_guarded_str(start, str_before), |
We do not always report a guarded str error/lint here.
return self.split_guarded_str_maybe(start, str_before); | ||
} | ||
|
||
let mut maybe_cursor = Cursor::new(str_before); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut maybe_cursor = Cursor::new(str_before); | |
let mut cursor = Cursor::new(str_before); |
It is definitely a cursor, no?
if edition2024 { | ||
let expn_data = span.ctxt().outer_expn_data(); | ||
|
||
let sugg = if expn_data.is_root() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let sugg = if expn_data.is_root() { | |
let sugg = if !span.from_expansion() { |
"unterminated double quote string", | ||
) | ||
.with_code(E0765) | ||
.emit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a double error here, one for reservation and another for unterminated string, one would be enough (with the fatal one as a priority).
let mut maybe_cursor = Cursor::new(str_before); | ||
|
||
let (span, space_span, unterminated) = | ||
if let Some(rustc_lexer::GuardedStr { n_hashes, terminated, token_len }) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: this would look better as a match
.
self.pos = end; | ||
} | ||
|
||
let unterminated = if edition2024 && !terminated { Some(str_start) } else { None }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let unterminated = if edition2024 && !terminated { Some(str_start) } else { None }; | |
let unterminated = if !terminated { Some(str_start) } else { None }; |
Redundant condition, the result is only used on 2024 edition anyway.
pub struct GuardedStr { | ||
pub n_hashes: u32, | ||
pub terminated: bool, | ||
pub token_len: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token lengh like this doesn't seem to be used in other literals.
Perhaps it is redundant and the end position can be calculated without it, using the cursor data?
I was being reminded of #118825 by this. As such I am wondering: Does expanding to two or more hashes rather than three or more interact properly with code examples? |
I try not to be a negative Nancy, and I rarely weigh in on language design issues. But I want to state clearly that I have a bad feeling about guarded strings, both the idea and the implementation. To echo my comments from above: yet another string literal type, when we already have a lot. All for what I consider a very minor functionality advancement that will have relatively few uses. Implementation complexity is a particular interest of mine. I spend a lot of time cleaning up code in the compiler, making complex things simpler and more streamlined. I have merged multiple PRs doing exactly that kind of cleanup with the code that handles string literals. It's already more complex than you might expect. And the complexity this PR adds -- including edition-specific special casing of a certain number of leading A follow-up question to this paragraph is "are there any changes we can make that would improve the situation?" But I feel like the answer is "not really". It's all a consequence of guarded string literals being sufficiently different to, and not interacting cleanly with, the existing string literals. If we were designing Rust from scratch it would be a different story. If I were king of Rust I would just veto the feature entirely. |
☔ The latest upstream changes (presumably #130091) made this pull request unmergeable. Please resolve the merge conflicts. |
Implementation for RFC 3593, including:
We reserve
#"
,##"
,###"
,####
, and any other string of four or more repeated#
. This avoids infinite lookahead in the lexer, though we still use infinite lookahead in the parser to provide better forward compatibility diagnostics.This PR does not implement any special lexing of the string internals:
#
are denied#
"string"
Tracking issue: #123735
RFC: rust-lang/rfcs#3593