Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft Keywords and How to Implement Them #138

Open
isidentical opened this issue Jun 25, 2020 · 45 comments
Open

Soft Keywords and How to Implement Them #138

isidentical opened this issue Jun 25, 2020 · 45 comments

Comments

@isidentical
Copy link
Collaborator

isidentical commented Jun 25, 2020

I've been thinking for a day about how we can possibly implement PEP 622 in case of acceptance. While I was thinking, I drafted a piece of code, which might seem totally unreasonable (since it changes the parser to semi-LL(1) from LL(1)) but wanted to share it as a maybe last resort? I tried to document the strategy as inline comments, so feel free to read and comment about it:

https://github.com/davidhalter/parso/compare/master...isidentical:soft-keywords-very-very-bad-demo?expand=1

@isidentical
Copy link
Collaborator Author

CC: @davidhalter @bgw

@davidhalter
Copy link
Owner

@isidentical Interesting work. I'm not sure we should go this route. An alternative way to deal with "soft keywords" is using something like name: NAME | 'match'. That would of course need a step where match is converted back to a name if it's not a keyword (i.e. if the parent is not a match_stmt). What do you think about that idea?

However I would probably prefer a PEG parser at that point. But at the same time I haven't really thought about that either.

PS: I hope that the PEP gets rejected. IMO this is just additional syntax without major benefits. If pattern matching was at least an expression, then I would maybe see it, but this way it's just useless (you can do most thing with if).

@isidentical
Copy link
Collaborator Author

What do you think about that idea?

Will look into that (and possibly draft a PoC), but at least for now, I don't think it would be the right fit.

However I would probably prefer a PEG parser at that point. But at the same time I haven't really thought about that either.

Yes, me too (if we can). As I said, this is just a workaround for that problem. I, unfortunately, can't start such an action but can participate whenever I'm free. (If anyone interested with the idea, and maybe some rust/c backend).

PS: I hope that the PEP gets rejected. IMO this is just additional syntax without major benefits. If pattern matching was at least an expression, then I would maybe see it, but this way it's just useless (you can do most thing with if).

I ~partially agree. I don't really mind if it gets accepted, but not as is. IMHO It needs major changes including very big simplifications. It took like hours and hours to finally comprehend the whole document, which I believe shows it is a very complex and dense piece of language addition at least in a single transaction. It might be considered to divide the PEP into parts and gradually include/exclude some parts of it. Let's wait and see it, no need to rush :P

@davidhalter
Copy link
Owner

davidhalter commented Jun 27, 2020

Yes, me too (if we can). As I said, this is just a workaround for that problem. I, unfortunately, can't start such an action but can participate whenever I'm free. (If anyone interested with the idea, and maybe some rust/c backend).

Why can't you start? Because it would be written in a lower level language?

It took like hours and hours to finally comprehend the whole document, which I believe shows it is a very complex and dense piece of language addition at least in a single transaction

Yeah, I also didn't read the whole thing, it's simply too long and complicated. But that might just be this iteration of the draft.

@isidentical
Copy link
Collaborator Author

Why can't you start? Because it would be written in a lower level language?

No, I'm currently in the process of getting a job for this summer. And not sure if I can dedicate some time to it, at least solely. If you / or anybody else tries to attempt it, I can help in my free times.

@isidentical
Copy link
Collaborator Author

@davidhalter would you check out (and comment if you have anything to say) this? https://bugs.python.org/issue40360

@davidhalter
Copy link
Owner

I have thought quite a bit about this. It's similar to your plan, but not exactly the same.

We can probably use our existing LL parser and make it PEG compatible with a small trick:
Backtracking in case of ambiguities (otherwise use normal LL). At the backtracking point just use the LL algorithm again for the plan.

This is very similar to PEG and would need some work, but it's probably not even that hard to do. The biggest challenge will probably be keywords/soft keywords and backtracking.

It seems like your solution is quite similar to this, but you seem to transition ahead to make it clear which plan to take. This works of course as well.

@isidentical
Copy link
Collaborator Author

Interesting, do you have time to come up with a draft that supports both soft keywords and the multiline context managers. Obivously, we don't need to have an ideal, an epitome strategy to parse, so we could just embed the backtracking into the LL parser. For the reference, I've written about it: https://tree.science/what-the-backtracking.html

@davidhalter
Copy link
Owner

davidhalter commented Oct 23, 2020

At the moment I have other priorities, but I will definitely come up with something before 3.10 releases if you don't do the work :)

My goal for parso is just to get it running so people can use it. My priorities lie with creating a parser that is both faster and better than parso (and is probably a different project).

@isidentical
Copy link
Collaborator Author

if you don't do the work :)

:-) [inserts a november song]

My goal for parso is just to get it running so people can use it. My priorities lie with creating a parser that is both faster and better than parso (and is probably a different project).

I like hacky ways 😎

@davidhalter
Copy link
Owner

davidhalter commented Oct 23, 2020

I think your general approach is pretty good, I would probably just use a slightly different approach in the generator (if possible): I would move some of the logic there to calculate the plans beforehand. But I haven't really thought about everything and your solution might be what we need.

@isidentical
Copy link
Collaborator Author

I started some stuff, but calculating the rules beforehand is a bit tricky. The thing that makes backtracking harder for us is error-recovery. If we'd assume all input is valid, we iterate all possible plans (for e.g optional parens for with-stmt) and try to perform a dry-run and see which one of the alternatives completed. And as soon as one completes, we could return it. But since we also do error recovery, we need to take on token at the time, feed all the plans, get the next possible plans, and complete those and repeat this until we find a plan that either takes more input than others.

The proposed patch does this on a very small scale, but for ambigous rules, I might have an alternate patch (still working on, but no luck so far). Basically what it does is, instead of keeping transitions as a 1-to-1 mapping between token_type -> DFA, it keeps them as 1-to-many mapping between token_type -> [DFA1, DFA2, ...]. And when we proceed to parser, we check how many dfas we have, if it is 1, then just return that, if it is more than 1, than try to repeat this process I just mentioned above.

@davidhalter
Copy link
Owner

I will answer later, but maybe you can answer something about CPython's PEG grammar. I don't really understand why the tilde in the following code is necessary. I have read the PEP about it and understand that it's usually used to avoid considering alternatives. But in this case there are no alternatives, so why is it there?

listcomp:
    | '[' named_expression ~ for_if_clauses ']'

@isidentical
Copy link
Collaborator Author

isidentical commented Nov 3, 2020

I presume you found it on the docs.python.org, which is the stripped version of the official grammar. That rule actually has 2 alternatives (one is only used for error reporting in cases like [a for a in *b], but since it is an error-only alternative it got stripped away. See the original rule. The reason that tilde is required there is, if we already parsed named_expression than that invalid_ alternative is no longer will needed since the only purpose of that alternative was checking whether the iterable is a starred item or not. So if the parser is able to recognize named_expression than it can continue forward without even considering alternatives and if it fails, it will raise a normal syntax error.

@davidhalter
Copy link
Owner

Thanks! I was indeed looking at the wrong file. I was aware of the other file, but I thought the docs.python.org one was more readable, so I mostly looked at that. Thanks for the explanation.

About the proposed patch: I'm currently trying a technique in Rust that might work for us as well. I will keep you posted.

@isidentical
Copy link
Collaborator Author

About the proposed patch: I'm currently trying a technique in Rust that might work for us as well. I will keep you posted.

👀

@isidentical
Copy link
Collaborator Author

isidentical commented Dec 10, 2020

The thing that makes backtracking harder for us is error-recovery

I was reading a paper about error-recovery for PEG parsers, especially the ones that used by IDEs for auto-completition and noticed that I was missing an important aspect regarding backtracking on ambiguities on my imitated fork for parso's LL(1) parser. I believe with something similar to the commit (~) operator in the Pegen, the problem that I mentioned in my former comment can be easily resolved.

@isidentical
Copy link
Collaborator Author

status update: PEP 634 accepted with soft keywords

@davidhalter
Copy link
Owner

Unfortunately, yes. :)

If we don't have a good solution for PEG/LL(*) parsing in autumn we can maybe also use error recovery. It could work well enough.

@isidentical
Copy link
Collaborator Author

Yeah, that is what I kind of used (not entirely). Do you have any opinions regarding my initial approach? I'm thinking about also introducing a ~ operator to our grammar, so that we can implement multiline with statement.

@davidhalter
Copy link
Owner

Give me one more month. I'm pretty far with my approach and would then show it to you.

@davidhalter
Copy link
Owner

@isidentical I'm a bit confused by how the new Grammar uses lookaheads. Negative lookaheads seem to be used in some places for speedups (maybe like 2-3 times), but in all other places they are pretty much useless and might be omitted. Do you happen to know why they are there?

@isidentical
Copy link
Collaborator Author

Can you show like a concrete example?

@davidhalter
Copy link
Owner

I realized that again I wasn't reading https://github.com/python/cpython/blob/master/Grammar/python.gram properly. I usually read https://docs.python.org/3.10/reference/grammar.html first to try to understand the grammar somewhat and then refer to the official grammar and try to see if I understood it right, but in this case I didn't see that most ! have an effect on raising SyntaxError.

@davidhalter
Copy link
Owner

davidhalter commented Mar 8, 2021

@isidentical I'm doing my first few benchmarks with my new Rust Parser. I'm using a 2 MB file that I repeat ten times. So 20 MB of code. Generally parso is like 30 times slower than both the CPython Parser and my Rust parser:

Type Time Peak Memory Usage
Parso 85s 0.85 GB
Python 3.9 parser.suite 2.9s 1.35 GB
Python 3.9 -X oldparser ast.parse 12s 1.48 GB
Python 3.9 ast.parse 12.6s 1.93 GB
Python 3.9 -X oldparser compile exec 19s 1.48 GB
Python 3.9 compile exec 19s 1.46 GB
My Rust Parser 2.6s 2.05s 0.185 GB

Numbers are not extremely precise. Just there for a general tendency.

I'm a bit confused by the memory usage. Can you explain this? Especially ast.parse seems to be pretty extreme. 20 MB of code should IMO not take up 1.93 GB of space.

Everything else seems to make sense. AST generation seems to take a lot of time though. My Rust parser still has some space for user definable stuff on the nodes, so that's essentially 250 MB that could be reduced even more if the nodes were compressed properly. I guess <200 MB would be feasible for a full syntax tree for 20 MB of code. It's especially interesting that a pure Python version (parso) of a syntax tree uses less RAM than the CPython version (however it's also extremely slow, not sure how much interning of values is going on there).

Is there a way to test the pure speed of the CPython PEG parser? I'm not interested in the AST stuff.

@isidentical
Copy link
Collaborator Author

I'm a bit confused by the memory usage. Can you explain this?

The difference seems quite big and the only thing that I can think is the memorization cost though it shouldn't cause that much of a difference. There is a way to ~dump memorization stats but it is not really explicit, though feel free to try (the data/xxl.py is just a 100000 file that consists from arithmetic stuff)
https://github.com/python/cpython/blob/bbba28212ce0f58096a4043f32442c6e727b74fc/Tools/peg_generator/Makefile#L66-L68

Is there a way to test the pure speed of the CPython PEG parser? I'm not interested in the AST stuff.

Since the parser's only output is the AST, I don't think there is an easy way to see that speed. The only overhead that I see between those is the serialization phase which shouldn't take much time. Though if you want to dive really deep you could use C profiler to profile the _PyPegen_parse function and get the actual time.

@davidhalter
Copy link
Owner

davidhalter commented Mar 10, 2021

Thanks for your comment! After an hour or two of optimizing I got it down from 2.9s to 2.6s and from about 250 MB to 185 MB. It will probably stay there for a while. I could reduce a tree node further to get to 150 MB and then compress it to 120 MB. Not sure if that's all worth it, so I won't do it for now. A tree with ten million nodes would then use 85 MB RAM + 20 MB code + program. Pretty cool :). As a comparison: The Python standard library has ~10 MB of source code.

I'm almost at the place where I can show my concepts to you. I feel like they would be pretty nice for parso as well, but they might be a bit of work.

And I'm definitely to lazy to profile _PyPegen_parse, but good hint :).

@bgw
Copy link
Contributor

bgw commented Mar 10, 2021

@davidhalter It's interesting that you say that you're working on a Rust parser. I've slowly been working on porting bits of LibCST to Rust for better performance (starting with Instagram/LibCST#452).

I've currently got a 1:1 port of CPython's tokenizer.c (code isn't published yet, but I can put it up if you want to look at it) to Rust, and I'm working on adding support for Parso's f-string splitting behavior. It doesn't currently support any sort of error recovery. I don't know if that would be useful to you. It'd be cool if we could continue to share some infrastructure.

I haven't started on trying to move the parser to Rust yet (though Python 3.10 gives it some urgency), but I've been considering rust-peg for the job. IMO, parsing between Parso and LibCST is different enough (LibCST doesn't need error recovery or incremental parsing) that I think it's fine if we diverge there.

@davidhalter
Copy link
Owner

It'd be cool if we could continue to share some infrastructure.

I have also been enjoying working with you. However: I'm considering releasing the software as commercial software, so that will probably have to wait for a bit. :) It will probably also be released open source and for free for small companies and individuals, but your company obviously isn't part of that group :).

@bgw
Copy link
Contributor

bgw commented Mar 14, 2021

I'm considering releasing the software as commercial software, so that will probably have to wait for a bit.

Up to you; commercial software can pull in PSF or MIT licensed dependencies! I've benefited immensely from your open-source work. It's totally fine if you want to take any of my code and use it in a commercial project, and it's okay if you chose not share changes back. I don't really see Jedi/Parso and Fixit/LibCST as competing products: LibCST makes no attempt to target as-you-type IDE use-cases.

It sounds like you already have a tokenizer, but if you're interested, here's my Rust tokenizer implementation. It pretty closely matches the behavior of Parso's tokenizer, except that it doesn't have any error recovery.
https://github.com/bgw/LibCST/blob/native/tokenizer/libcst_native/src/tokenize/core/mod.rs

Best of luck on your possible commercial software endeavor. A faster and more capable version of Parso/Jedi would be a no-brainer for lots of companies.

@davidhalter
Copy link
Owner

@bgw Thanks a lot for sharing. I'm really interested in comparing the two tokenizers in terms of performance. I might not do it now, but I will definitely compare it eventually.

@isidentical
Copy link
Collaborator Author

For what its worth, here is a (~decent looking) python parser written in Rust that I stumbled against a few days ago: https://github.com/ProgVal/rust-python-parser/

@bgw
Copy link
Contributor

bgw commented Apr 11, 2021

@isidentical, I've seen that. It does look decent, except:

  • It's GPLv3, which would require re-licensing anything that'd want to build on top of it. (Or someone would have to ask the author of that crate if they'd be willing to re-license their code)
  • Parsing and tokenization must happen in the same pass. Given how many tokenizer hacks Python has, that might make certain edge-cases hard to handle. Any design differences in your architecture compared to CPython make it possibly harder to support syntax changes in the future.
  • It uses the nom parser combinator library. Nom looks like a very impressive and well-maintained library, but there's some work (and room for mistakes) involved in converting the grammar from PEG format to something that will work with nom's combinators. I'm looking at rust-peg for this reason.
  • Even if we chose to use that as a starting point, it'd still be a lot of work to adapt that codebase to a lossless syntax tree like Parso and LibCST use.

If you're looking for a Python AST parser in Rust, you should also consider rustpython_parser. They do tokenization in a separate pass. They use lalrpop though, which may not be able to support the Python 3.10 match syntax or other future syntax features that depend on PEG parsing.

I started by trying to wrap RustPython's lexer for LibCST, but I wanted the Rust tokenizer to match with our Python tokenizer's behavior exactly so that it could be a drop-in replacement. There were enough differences that it ultimately ended up being easier for me to write my own implementation instead.

@davidhalter
Copy link
Owner

@bgw I agree. I'm not a big fan of parser combinators for language grammars. I like proper BNF grammars much more. Nom is definitely a good library, but I think language parsing is not what it's good for. I'm still a big fan of LL parsing, because of it's mathematical beauty :) It also lends itself better towards error recovery than LR parsing. I have never really understood why people would prefer it.

@isidentical I have given you access to my private project. Feel free to accept it It seems like it took me a bit too long to write this :). I will however quickly summarize how I got it working and what I would see as the ideal way forward for parso. This does not mean that your approach is bad at all, it just does more at runtime than it would necessarily need to do. And I would suspect that the approach you chose would be a bit slower than mine. In general however I would argue that our solutions are pretty similar. We both have a way to define soft keywords and then backtrack. The big difference is really where we calculate "alternative" plans.

My approach does pretty much the following:

  1. Normal LL parsing whenever possible, so all LL optimizations actually work.
  2. In handles direct left recursions by pushing something on the stack (this is not important for parso for now, so I won't explain it here).
  3. For PEG-like rules that have the same "first token" (see below where bar starts with the same token as "hi") we always start matching the first (in this case "hi" "baz"). If it does not work out we just continue with the second "branch". This works by saving a backtracking point in the tokenizer and in the tree. It's IMO not that hard and the position where this occurs can be precalculated.
foo: "hi" "baz" | bar "foo"
bar: "hi"+

What do you think about this approach? It is IMO interesting, because it does not need to calculate the superior_plan (that's how you called it in your patch) at runtime. It does it at "parser create time". :) It's a small optimization, but I think it's worth it. What do you think?

Please also let me know if you don't understand what I just tried to explain. It's pretty hard to explain it well and I don't feel like having done a good job. :)

@isidentical
Copy link
Collaborator Author

What do you think about this approach? It is IMO interesting, because it does not need to calculate the superior_plan (that's how you called it in your patch) at runtime. It does it at "parser create time". :) It's a small optimization, but I think it's worth it. What do you think?

I think it makes sense, especially considering that my approach was more of a brute-forcing rather than proper calculations.

@davidhalter
Copy link
Owner

Do you want to implement it with a different approach? I wouldn't be too unhappy, since I've already implemented it in another language :)

@isidentical
Copy link
Collaborator Author

Do you want to implement it with a different approach?

Nop, feel free to go with the pre-calculation. I might try to give it a shot, but pretty stacked up until the next month so if anybody is up to the implementation, it would be great!

@frenzymadness
Copy link

Hello. Do we have a plan for this feature needed for full Python 3.10 support?

@davidhalter
Copy link
Owner

Currently no. I know how this would work with an LL grammar and I'm happy to help people work on it, but for now I have other priorities. For Jedi this is mostly not really an issue, since autocompletions/goto will still work. The error listing is probably broken in 3.10 unfortunately. For everyone who wants to parse 3.10 grammar and work with that, I guess you are out of luck at the moment.

@stnley
Copy link

stnley commented Jul 2, 2022

@davidhalter Is there a possibility to sponsor this feature?

@davidhalter
Copy link
Owner

Yes. Just note that this is a few days of work (might even be weeks), so giving me 5$ is not going to cut it :)

@stnley
Copy link

stnley commented Jul 3, 2022

Totally understandable - if you don't mind I'll reach out via email?

@davidhalter
Copy link
Owner

Yes, sure.

@davidhalter
Copy link
Owner

@stnley Did my email answers reach you? I'm just rechecking here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@bgw @davidhalter @frenzymadness @isidentical @stnley and others