Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations #128790

Open
fmease opened this issue Aug 7, 2024 · 2 comments
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-unicode Area: Unicode C-feature-request Category: A feature request, i.e: not implemented / a PR. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. D-unicode-unaware Diagnostics: Diagnostics that are unaware of unicode and trigger codepoint boundary assertions P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@fmease
Copy link
Member

fmease commented Aug 7, 2024

Inspired by #128717 (comment). CC @jieyouxu.


Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question MarkU+003B Semicolon; U+066B Arabic Decimal SeparatorU+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the Span of a supposed Rust token/lexeme "maps to" .

In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like - BytePos(1) or + BytePos(1) where the code guesses that the code point before/after corresponds to a certain token/lexeme like ,, ;, ). However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).

sigh
we have too many hard coded +1/-1 in the compiler
-- @compiler-errors


So it might be worth linting against these error-prone BytePos & Span manipulations.
I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.

This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.


Uplifted from #128717 (comment):

It's so easy to find these kinds of ICEs:

  1. One simply needs to look for /(\+|-) BytePos\(\d+\)/ inside compiler/,
  2. Figure out which ASCII character is meant
  3. Open one's favorite Unicode table website or program that can list Unicode-confusables
  4. Pick a confusable whose .len_utf8() is >1 and 💥

Example ICE: #128717.

Example ICE: I just found this a minute ago while reviewing an unrelated PR:

This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.

fn f() {}

fn main() {
    f(0,1;
}

Leads to:

thread 'rustc' panicked at compiler/rustc_span/src/lib.rs:2119:17:
assertion failed: bpos.to_u32() >= mbc.pos.to_u32() + mbc.bytes as u32

Code in question:

call_expr.span.with_lo(call_expr.span.hi() - BytePos(1))

Discussions

Related issues

@fmease fmease added A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-unicode Area: Unicode P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-feature-request Category: A feature request, i.e: not implemented / a PR. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Aug 7, 2024
@jieyouxu jieyouxu added the A-diagnostics Area: Messages for errors, warnings, and lints label Aug 7, 2024
@fmease fmease changed the title Create an internal lint for detecting "non-Unicode-aware" BytePos manipulations Create an internal lint for detecting "Unicode-unaware" BytePos & Span manipulations Aug 7, 2024
@jieyouxu jieyouxu added the D-unicode-unaware Diagnostics: Diagnostics that are unaware of unicode and trigger codepoint boundary assertions label Aug 9, 2024
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion

Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of
the edge cases is having to account for an arg that has a closing delimiter `)` following it.
However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)`
character. This assumption was proven incorrect, because we recover from Unicode-confusable
delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint
that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a
codepoint boundary assertion.

This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries.

Fixes rust-lang#128717.

cc `@fmease` and rust-lang#128790
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries

Previously we would try to issue a suggestion for `let x <op>= 1`, i.e.
a compound assignment within a `let` binding, to remove the `<op>`. The
suggestion code unfortunately incorrectly assumed that the `<op>` is an
exactly-1-byte ASCII character, but this assumption is incorrect because
we also recover Unicode-confusables like `➖=` as `-=`. In this example,
the suggestion code used a `+ BytePos(1)` to calculate the span of the
`<op>` codepoint that looks like `-` but the mult-byte Unicode
look-alike would cause the suggested removal span to be inside a
multi-byte codepoint boundary, triggering a codepoint boundary
assertion.

The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries.

Fixes rust-lang#128845.

cc rust-lang#128790

r? `@fmease`
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion

Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of
the edge cases is having to account for an arg that has a closing delimiter `)` following it.
However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)`
character. This assumption was proven incorrect, because we recover from Unicode-confusable
delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint
that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a
codepoint boundary assertion.

This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries.

Fixes rust-lang#128717.

cc ``@fmease`` and rust-lang#128790
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries

Previously we would try to issue a suggestion for `let x <op>= 1`, i.e.
a compound assignment within a `let` binding, to remove the `<op>`. The
suggestion code unfortunately incorrectly assumed that the `<op>` is an
exactly-1-byte ASCII character, but this assumption is incorrect because
we also recover Unicode-confusables like `➖=` as `-=`. In this example,
the suggestion code used a `+ BytePos(1)` to calculate the span of the
`<op>` codepoint that looks like `-` but the mult-byte Unicode
look-alike would cause the suggested removal span to be inside a
multi-byte codepoint boundary, triggering a codepoint boundary
assertion.

The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries.

Fixes rust-lang#128845.

cc rust-lang#128790

r? ``@fmease``
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Aug 9, 2024
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion

Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of
the edge cases is having to account for an arg that has a closing delimiter `)` following it.
However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)`
character. This assumption was proven incorrect, because we recover from Unicode-confusable
delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint
that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a
codepoint boundary assertion.

This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries.

Fixes rust-lang#128717.

cc ```@fmease``` and rust-lang#128790
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Aug 9, 2024
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries

Previously we would try to issue a suggestion for `let x <op>= 1`, i.e.
a compound assignment within a `let` binding, to remove the `<op>`. The
suggestion code unfortunately incorrectly assumed that the `<op>` is an
exactly-1-byte ASCII character, but this assumption is incorrect because
we also recover Unicode-confusables like `➖=` as `-=`. In this example,
the suggestion code used a `+ BytePos(1)` to calculate the span of the
`<op>` codepoint that looks like `-` but the mult-byte Unicode
look-alike would cause the suggested removal span to be inside a
multi-byte codepoint boundary, triggering a codepoint boundary
assertion.

The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries.

Fixes rust-lang#128845.

cc rust-lang#128790

r? ```@fmease```
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion

Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of
the edge cases is having to account for an arg that has a closing delimiter `)` following it.
However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)`
character. This assumption was proven incorrect, because we recover from Unicode-confusable
delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint
that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a
codepoint boundary assertion.

This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries.

Fixes rust-lang#128717.

cc ````@fmease```` and rust-lang#128790
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2024
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries

Previously we would try to issue a suggestion for `let x <op>= 1`, i.e.
a compound assignment within a `let` binding, to remove the `<op>`. The
suggestion code unfortunately incorrectly assumed that the `<op>` is an
exactly-1-byte ASCII character, but this assumption is incorrect because
we also recover Unicode-confusables like `➖=` as `-=`. In this example,
the suggestion code used a `+ BytePos(1)` to calculate the span of the
`<op>` codepoint that looks like `-` but the mult-byte Unicode
look-alike would cause the suggested removal span to be inside a
multi-byte codepoint boundary, triggering a codepoint boundary
assertion.

The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries.

Fixes rust-lang#128845.

cc rust-lang#128790

r? ````@fmease````
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 9, 2024
Rollup merge of rust-lang#128865 - jieyouxu:unicurd, r=Urgau

Ensure let stmt compound assignment removal suggestion respect codepoint boundaries

Previously we would try to issue a suggestion for `let x <op>= 1`, i.e.
a compound assignment within a `let` binding, to remove the `<op>`. The
suggestion code unfortunately incorrectly assumed that the `<op>` is an
exactly-1-byte ASCII character, but this assumption is incorrect because
we also recover Unicode-confusables like `➖=` as `-=`. In this example,
the suggestion code used a `+ BytePos(1)` to calculate the span of the
`<op>` codepoint that looks like `-` but the mult-byte Unicode
look-alike would cause the suggested removal span to be inside a
multi-byte codepoint boundary, triggering a codepoint boundary
assertion.

The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries.

Fixes rust-lang#128845.

cc rust-lang#128790

r? ````@fmease````
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 9, 2024
Rollup merge of rust-lang#128864 - jieyouxu:funnicode, r=Urgau

Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion

Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of
the edge cases is having to account for an arg that has a closing delimiter `)` following it.
However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)`
character. This assumption was proven incorrect, because we recover from Unicode-confusable
delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint
that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a
codepoint boundary assertion.

This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries.

Fixes rust-lang#128717.

cc ````@fmease```` and rust-lang#128790
@jieyouxu
Copy link
Member

jieyouxu commented Aug 24, 2024

As for suggestions:

  • Suggestions should try to derive spans from existing spans using codepoint-boundary-aware operations available on Span, like Span::{to, between, until} etc.
  • If suggestions really need to manipulate by codepoint granularity, consider using codepoint-boundary-aware helpers on SourceMap: SourceMap::{start_point, end_point, next_point, span_through_char, span_until_whitespace, span_until_non_whitespace, span_until_char, find_width_of_character_at_span}.
  • If a suggestion really, really cannot get away with constructing the desired span from previous methods, then the BytePos manipulations need to carefully consider if a parser recovery scheme can reach the suggestion, or in the case of string literals and macros, can the user provide input that would cause codepoint boundary issues.

@compiler-errors
Copy link
Member

Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like async needs to be tracked for suggestions downstream), consider modifying the AST directly to track that span. But please only do this if it's generally useful, and not just needed for a single suggestion, and also be sure to audit the code for existing suggestions to simplify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-unicode Area: Unicode C-feature-request Category: A feature request, i.e: not implemented / a PR. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. D-unicode-unaware Diagnostics: Diagnostics that are unaware of unicode and trigger codepoint boundary assertions P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants