Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

antlr's lack of unicode is disturbing #15679

Closed
emberian opened this issue Jul 15, 2014 · 4 comments · Fixed by #24620
Closed

antlr's lack of unicode is disturbing #15679

emberian opened this issue Jul 15, 2014 · 4 comments · Fixed by #24620
Labels
A-grammar Area: The grammar of Rust A-testsuite Area: The testsuite used to check the correctness of rustc

Comments

@emberian
Copy link
Member

It's impossible to correctly form XID_start and XID_continue in antlr, so for now the reference lexer just ignores unicode entirely.

@emberian
Copy link
Member Author

In particular, it only knows about UCS2. You need to encode characters outside of the BMP using surrogates. However, the range syntax ('a' .. 'b') only accepts one character in each string, and encoding the surrogates requires two.

@emberian
Copy link
Member Author

Note that this is also going to affect spans, since spans are in bytes but antlr gives them to us in characters. We can possibly correct this by translating the BytePos to CharPos.

@fhahn
Copy link
Contributor

fhahn commented Jan 14, 2015

I am currently looking into this. I've added two additional files which contain the definitions of XID_Start and XID_Continue, like @jbclements used in https://github.com/jbclements/rust-antlr.

To generate the rules, I used noidejs, because https://github.com/mathiasbynens/unicode-4.0.0 provides all codepoints for XID_Start and XID_Continue as arrays (and I did not find something similar for Python)

But when I try to convert a BytePos of the spans to a CharPos, I get the following panic

'assertion failed: bpos.to_uint() >= mbc.pos.to_uint() + mbc.bytes', /home/flo/projects/rust/rust/src/libsyntax/codemap.rs:487

Here is the Rust code I use to convert them: fhahn@e85d830#diff-cc371bf2fbfcbeb87f125d3c45fd8fc3R237

I would really appreciate any hints what could be wrong.

Note that at the moment I did only include symbols up to \uFFFF, but according to antlr/antlr4#276 surrogates could be specified like

'\uD812\uDC34'..'\uD813\uDC56'

becomes

( '\uD812' '\uDC34'..'\uDFFF'
| '\uD813' '\uDC00'..'\uDC56'
)

@pczarn
Copy link
Contributor

pczarn commented Jan 17, 2015

@fhahn, I'm going to send you a patch with other symbols. But I have an issue where antlr sees 2 surrogates, but Rust's span counts them as one.

fhahn added a commit to fhahn/rust that referenced this issue Jan 17, 2015
pczarn pushed a commit to pczarn/rust that referenced this issue Apr 19, 2015
bors added a commit that referenced this issue Apr 21, 2015
tormol added a commit to tormol/rust-ascii that referenced this issue May 7, 2016
The [issue it refers to](rust-lang/rust#15679) was closed a year ago.
tormol added a commit to tormol/rust-ascii that referenced this issue May 8, 2016
The [issue it refers to](rust-lang/rust#15679) was closed a year ago.
tormol added a commit to tormol/rust-ascii that referenced this issue May 9, 2016
The [issue it refers to](rust-lang/rust#15679) was closed a year ago.
bors added a commit to rust-lang-ci/rust that referenced this issue Nov 13, 2023
Downgrade `unused_variables` to experimental

I feel problems like rust-lang#15679 are common.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: The grammar of Rust A-testsuite Area: The testsuite used to check the correctness of rustc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants