Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

fmease · 2024-08-07T16:16:17Z

Inspired by #128717 (comment). CC @jieyouxu.

Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question Mark → U+003B Semicolon; U+066B Arabic Decimal Separator → U+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1) or + BytePos(1) where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,, ;, ). However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

sigh
we have too many hard coded +1/-1 in the compiler
-- @compiler-errors

So it might be worth linting against these error-prone BytePos & Span manipulations.
I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.

Uplifted from #128717 (comment):

It's so easy to find these kinds of ICEs:

One simply needs to look for /(\+|-) BytePos\(\d+\)/ inside compiler/,
Figure out which ASCII character is meant
Open one's favorite Unicode table website or program that can list Unicode-confusables
Pick a confusable whose .len_utf8() is >1 and 💥

Example ICE: #128717.

Example ICE: I just found this a minute ago while reviewing an unrelated PR:

This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.

fn f() {}

fn main() {
    f(0,1❩;
}

Leads to:

thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32

Code in question:

rust/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs

Line 1144 in c9687a9

call_expr.span.with_lo(call_expr.span.hi() - BytePos(1))

Discussions

https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Internal.20lint.20for.20.22non-Unicode-aware.22.20BytePos.20manipulations.3F

Related issues

The text was updated successfully, but these errors were encountered:

Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc `@fmease` and rust-lang#128790

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? `@fmease`

Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ``@fmease`` and rust-lang#128790

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ``@fmease``

Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ```@fmease``` and rust-lang#128790

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ```@fmease```

Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ````@fmease```` and rust-lang#128790

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ````@fmease````

Rollup merge of rust-lang#128865 - jieyouxu:unicurd, r=Urgau Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ````@fmease````

Rollup merge of rust-lang#128864 - jieyouxu:funnicode, r=Urgau Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ````@fmease```` and rust-lang#128790

jieyouxu · 2024-08-24T16:50:34Z

As for suggestions:

Suggestions should try to derive spans from existing spans using codepoint-boundary-aware operations available on Span, like Span::{to, between, until} etc.
If suggestions really need to manipulate by codepoint granularity, consider using codepoint-boundary-aware helpers on SourceMap: SourceMap::{start_point, end_point, next_point, span_through_char, span_until_whitespace, span_until_non_whitespace, span_until_char, find_width_of_character_at_span}.
If a suggestion really, really cannot get away with constructing the desired span from previous methods, then the BytePos manipulations need to carefully consider if a parser recovery scheme can reach the suggestion, or in the case of string literals and macros, can the user provide input that would cause codepoint boundary issues.

compiler-errors · 2024-08-24T16:58:36Z

Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like async needs to be tracked for suggestions downstream), consider modifying the AST directly to track that span. But please only do this if it's generally useful, and not just needed for a single suggestion, and also be sure to audit the code for existing suggestions to simplify.

jieyouxu added the A-diagnostics Area: Messages for errors, warnings, and lints label Aug 7, 2024

fmease changed the title ~~Create an internal lint for detecting "non-Unicode-aware" BytePos manipulations~~ Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations Aug 7, 2024

workingjubilee mentioned this issue Aug 8, 2024

ICE: assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32 #128845

Closed

jieyouxu added the D-unicode-unaware Diagnostics: Diagnostics that are unaware of unicode and trigger codepoint boundary assertions label Aug 9, 2024

This was referenced Aug 9, 2024

Use SourceMap::end_point instead of - BytePos(1) in arg removal suggestion #128864

Merged

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries #128865

Merged

jieyouxu mentioned this issue Aug 24, 2024

ICE: assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32 #129503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

fmease commented Aug 7, 2024 •

edited

Loading

jieyouxu commented Aug 24, 2024 •

edited

Loading

compiler-errors commented Aug 24, 2024

Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations #128790

Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations #128790

Comments

fmease commented Aug 7, 2024 • edited Loading

Discussions

Related issues

jieyouxu commented Aug 24, 2024 • edited Loading

compiler-errors commented Aug 24, 2024

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

Create an internal lint for detecting "Unicode-unaware" `BytePos` & `Span` manipulations #128790

fmease commented Aug 7, 2024 •

edited

Loading

jieyouxu commented Aug 24, 2024 •

edited

Loading