Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 and text 2.0 #211

Open
pnotequalnp opened this issue Mar 15, 2022 · 2 comments
Open

UTF-8 and text 2.0 #211

pnotequalnp opened this issue Mar 15, 2022 · 2 comments

Comments

@pnotequalnp
Copy link

As the documentation says, Alex works over a stream of UTF-8 encoded bytes, retrieved one at a time by alexGetByte.

Lexer specifications are written in terms of Unicode characters, but Alex works internally on a UTF-8 encoded byte sequence.

Depending on how you use Alex, the fact that Alex uses UTF-8 encoding internally may or may not affect you. If you use one of the wrappers (below) that takes input from a Haskell String, then the UTF-8 encoding is handled automatically. However, if you take input from a ByteString, then it is your responsibility to ensure that the input is properly UTF-8 encoded.

From an external viewpoint as a consumer (I am not familiar with how Alex is implemented), this seems like a strange design decision to me. If my source is already in a UTF-8 format like String or Data.Text(.Lazy).Text (with the new text 2.0 release), it seems that in order to run Alex on it, I (or a wrapper) would have to write logic to decode the UTF-8 content into individual bytes, just so that Alex can immediately re-encode it back into UTF-8 internally.

So I'm wondering if there's a reason that Alex needs to work over bytes and not Chars, or if that was perhaps done to support lexing ByteStrings directly, without having to first unpack the data into UTF-8 Strings or decode into UTF-16 Texts (with text <2.0), which would be unnecessary overhead either way.

If Alex has to work over bytes for internal reasons, I think it would be a good idea to implement new wrappers for the UTF-8 Text types, since I'd imagine that would be a pretty common use case.

Otherwise, would it be possible to expose an alexGetChar-based interface that simply skips the UTF-8 decoding portion of Alex's internal logic, which would be more ergonomic and efficient for UTF-8 based types?

@andreasabel
Copy link
Member

@pnotequalnp : I haven't looked in detail, but I think Alex generates arrays indexed by 256-bit characters to make swift automata transitions. That wouldn't work with unicode characters for the sheer size of such arrays.

@dpwiz
Copy link

dpwiz commented Jul 7, 2022

Since text-2 uses UTF-8 byte arrays it should be possible to produce a byte-level automata and even zero-copy token slices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants