antlr's lack of unicode is disturbing #15679

emberian · 2014-07-15T00:42:00Z

It's impossible to correctly form XID_start and XID_continue in antlr, so for now the reference lexer just ignores unicode entirely.

emberian · 2014-07-15T00:47:46Z

In particular, it only knows about UCS2. You need to encode characters outside of the BMP using surrogates. However, the range syntax ('a' .. 'b') only accepts one character in each string, and encoding the surrogates requires two.

emberian · 2014-07-15T01:32:41Z

Note that this is also going to affect spans, since spans are in bytes but antlr gives them to us in characters. We can possibly correct this by translating the BytePos to CharPos.

fhahn · 2015-01-14T23:00:40Z

I am currently looking into this. I've added two additional files which contain the definitions of XID_Start and XID_Continue, like @jbclements used in https://github.com/jbclements/rust-antlr.

To generate the rules, I used noidejs, because https://github.com/mathiasbynens/unicode-4.0.0 provides all codepoints for XID_Start and XID_Continue as arrays (and I did not find something similar for Python)

But when I try to convert a BytePos of the spans to a CharPos, I get the following panic

'assertion failed: bpos.to_uint() >= mbc.pos.to_uint() + mbc.bytes', /home/flo/projects/rust/rust/src/libsyntax/codemap.rs:487

Here is the Rust code I use to convert them: fhahn@e85d830#diff-cc371bf2fbfcbeb87f125d3c45fd8fc3R237

I would really appreciate any hints what could be wrong.

Note that at the moment I did only include symbols up to \uFFFF, but according to antlr/antlr4#276 surrogates could be specified like

'\uD812\uDC34'..'\uD813\uDC56'

becomes

( '\uD812' '\uDC34'..'\uDFFF'
| '\uD813' '\uDC00'..'\uDC56'
)

pczarn · 2015-01-17T18:14:34Z

@fhahn, I'm going to send you a patch with other symbols. But I have an issue where antlr sees 2 surrogates, but Rust's span counts them as one.

…comparison, closes rust-lang#15679

Fixes #15679 Fixes #15878 Fixes #15882 Closes #15883

The [issue it refers to](rust-lang/rust#15679) was closed a year ago.

Downgrade `unused_variables` to experimental I feel problems like rust-lang#15679 are common.

emberian added A-grammar labels Jul 15, 2014

emberian mentioned this issue Jul 21, 2014

Model lexer is still wrong #15883

Closed

7 tasks

fhahn added a commit to fhahn/rust that referenced this issue Jan 17, 2015

Add proper XID_Start and XID_Continue rules and use CharPos for span …

1a4a679

…comparison, closes rust-lang#15679

pczarn pushed a commit to pczarn/rust that referenced this issue Apr 19, 2015

Add proper XID_Start and XID_Continue rules and use CharPos for span …

be43713

…comparison, closes rust-lang#15679

pczarn mentioned this issue Apr 20, 2015

Model lexer: Fix remaining issues #24620

Merged

bors added a commit that referenced this issue Apr 21, 2015

Auto merge of #24620 - pczarn:model-lexer-issues, r=cmr

7397bdc

Fixes #15679 Fixes #15878 Fixes #15882 Closes #15883

bors closed this as completed in #24620 Apr 21, 2015

tormol added a commit to tormol/rust-ascii that referenced this issue May 7, 2016

Remove commented out code and a fixed FIXME

060b270

The [issue it refers to](rust-lang/rust#15679) was closed a year ago.

tormol added a commit to tormol/rust-ascii that referenced this issue May 8, 2016

Remove commented out code and a fixed FIXME

1de1a93

The [issue it refers to](rust-lang/rust#15679) was closed a year ago.

tormol added a commit to tormol/rust-ascii that referenced this issue May 9, 2016

Remove commented out code and a fixed FIXME

bdd816d

The [issue it refers to](rust-lang/rust#15679) was closed a year ago.

bors added a commit to rust-lang-ci/rust that referenced this issue Nov 13, 2023

Auto merge of rust-lang#15693 - HKalbasi:unused-var, r=HKalbasi

0840038

Downgrade `unused_variables` to experimental I feel problems like rust-lang#15679 are common.

RyanGlScott mentioned this issue Aug 27, 2024

language-rust lexer rejects Unicode symbols that rustc accepts GaloisInc/language-rust#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

antlr's lack of unicode is disturbing #15679

antlr's lack of unicode is disturbing #15679

emberian commented Jul 15, 2014

emberian commented Jul 15, 2014

emberian commented Jul 15, 2014

fhahn commented Jan 14, 2015

pczarn commented Jan 17, 2015

antlr's lack of unicode is disturbing #15679

antlr's lack of unicode is disturbing #15679

Comments

emberian commented Jul 15, 2014

emberian commented Jul 15, 2014

emberian commented Jul 15, 2014

fhahn commented Jan 14, 2015

pczarn commented Jan 17, 2015