[close #64 #76] Refactor CodeLine add CleanDocument "sweep" class #78

schneems · 2021-10-11T01:45:59Z

Refactor CodeLine

Previously CodeLine knew how to "lex" itself. This strategy worked for many cases but failed pretty hard in cases where context can make the same line return drastically different results.

For example, by itself this line:

EOM

Will be "lex"-d into a constant while more lines:

foo = <<~EOM
  Hello world
EOM

It would be understood as a heredoc.

Distinct "clean" phase added via `CleanDocument`

Even after we've "injected" the lexical information into CodeLine based on lexing the entire document, there are still problems with special cases such as #64, where multiple lines represent a single logical operation. Heredocs are also painful to work with.

Previously, a HeredocBlockParse would put all heredocs into a CodeBlock at the beginning of the CodeSearch. There was also a TrailingSlashJoin and a method for removing code comments.

All of that functionality has been moved to a single class, CleanDocument, responsible for prepping the source to be in a good state to be fed into CodeSearch.

Join consecutive logical blocks in `CleanDocument`

To close #64, the CleanDocument can use information in the lex line to join multiple "consecutive" lines like:

User.
  where(name: 'schneems')

This code style is a problem because the second line looks like valid ruby code in isolation, but if it's removed, an error might be introduced (due to the trailing dot on User.)

More notes around this here https://gist.github.com/schneems/6a7d7f988d3329fb3bd4b5be3e2efc0c and #76.

Essentially it uses lex information to join into a single line:

"User.\nwhere(name: 'schneems')\n"

Once in a single line, the search algorithm cannot accidentally introduce an error into the document by removing part of it.

CaptureCodeContext

This class is now being billed as a "third" phase for re-introducing ambiguity where it should logically exist. Previously we were handling the case where a keyword was missing an end at the end of a block like:

class Dog
  def bark
end

It was also determined there's a logical inverse of this that wasn't being covered:

class Dog
  end
end

It's now handled.

For the capture_before_after_kws method, it was determined that it only needed to remove ambiguity if there's only one visible line. Adding this check cleaned up several outputs and didn't remove any critical (or ambiguous) lines.

Other

Add a bunch of docs all over the place
Add deprecation to requiring dead_end/fyi
Move LexValue class to its own file

## Refactor CodeLine Previously CodeLine knew how to "lex" itself. This strategy worked for many cases but failed pretty hard in cases where context can make the same line return drastically different results. For example, by itself this line: ``` EOM ``` Will be "lex"-d into a constant while more lines: ``` foo = <<~EOM Hello world EOM ``` It would be understood as a heredoc. ## Distinct "clean" phase added via `CleanDocument` Even after we've "injected" the lexical information into `CodeLine` based on lexing the entire document, there are still problems with special cases such as #64, where multiple lines represent a single logical operation. Heredocs are also painful to work with. Previously, a HeredocBlockParse would put all heredocs into a `CodeBlock` at the beginning of the `CodeSearch`. There was also a `TrailingSlashJoin` and a method for removing code comments. All of that functionality has been moved to a single class, `CleanDocument`, responsible for prepping the source to be in a good state to be fed into `CodeSearch`. ## Join consecutive logical blocks in `CleanDocument` To close #64, the `CleanDocument` can use information in the lex line to join multiple "consecutive" lines like: ``` User. where(name: 'schneems') ``` This code style is a problem because the second line looks like valid ruby code in isolation, but if it's removed, an error might be introduced (due to the trailing dot on `User.`) More notes around this here https://gist.github.com/schneems/6a7d7f988d3329fb3bd4b5be3e2efc0c and #76. Essentially it uses lex information to join into a single line: ``` "User.\nwhere(name: 'schneems')\n" ``` Once in a single line, the search algorithm cannot accidentally introduce an error into the document by removing part of it. ## CaptureCodeContext This class is now being billed as a "third" phase for re-introducing ambiguity where it should logically exist. Previously we were handling the case where a keyword was missing an end at the end of a block like: ``` class Dog def bark end ``` It was also determined there's a logical inverse of this that wasn't being covered: ``` class Dog end end ``` It's now handled. For the `capture_before_after_kws` method, it was determined that it only needed to remove ambiguity if there's only one visible line. Adding this check cleaned up several outputs and didn't remove any critical (or ambiguous) lines. ## Other - Add a bunch of docs all over the place - Add deprecation to requiring `dead_end/fyi` - Move LexValue class to its own file

## Perf difference Benchmarking with `ruby_buildpack.rb.txt`: Before: 0.195756 0.005201 0.200957 ( 0.202021) After: 0.148460 0.003634 0.152094 ( 0.152692) ## Profile code To generate profiler output, run: ``` $ DEBUG_PERF=1 bundle exec rspec spec/integration/dead_end_spec.rb ``` See the readme for more details. You can do that against the commit before this one and this one to see the difference. Before sha: 58a8d74 After sha: 0b4daca74fab5dc2979813461c0f2649951185e5 ## How I found the issue I was using ruby prof to try to find perf opportunities when I saw this output from the `CallStackPrinter`: ![](https://www.dropbox.com/s/7m6wi502fqoel2g/Screen%20Shot%202021-11-03%20at%204.03.10%20PM.png?raw=1) I noticed a lot of time spent in `CleanDocument`, which was curious as it's an N=1 process that happens before we do any looping or iteration. I also saw output from the `RubyProf::GraphHtmlPrinter` that we were creating many many LexValue objects. ## The fix The `CleanDocument` class removes lines with a comment. It first runs lex over the document, then uses the lex output to determine which lines have comments that it can remove. Once the comments are removed, the lexer is rerun as you can produce different results with comments removed (specifically `:on_ignored_nl`)—more info on this PR #78. Using lex data to remove comments works but is brute force. It means we must lex the document twice, which is expensive and generates many LexValue objects. Instead of removing lines based on lex values, we can remove lines based on regex. One caveat is that we can't distinguish via regex if someone is in a heredoc or not. If a line looks like it could be a heredoc string embed like: ```ruby source <<~EOM #{here} < ==== Looks like a comment when viewed without context EOM ``` Then we don't remove the line as it's doubtful someone will create a comment with a `#{` pattern in a place we care about (i.e., in-between method calls) like: ```ruby User #{ this would be weird .first ``` With this approach, we reduce lexing passes from N=2 to N=1. ## Profiler thoughts As a side note, this didn't show up as a hotspot in the perf tools. Looking back at the screenshots, it's not obvious that there was a problem here, or I could get a 1.3x speed boost by making this change. The only reason I investigated is I knew the code well and believed that it shouldn't be spending much time here. The amount reported isn't outrageously high, it's just surprising to me. Even running the profiler before and after, it's not clear it got faster based on the output: ![](https://www.dropbox.com/s/j8t7xd0i5unyim3/Screen%20Shot%202021-11-03%20at%204.11.55%20PM.png?raw=1) (Before is top & after is bottom) You can see before was 1.1264283200034697 while after was 0.8452127130003646 at the top. It's not clear why the output for `CleanDocument` shows 17% after when it was only 12% before. Side by side, it maybe makes a bit more sense in that the problem dropped in relative priority down the list. However, I'm still not entirely clear what the numbers are indicating: ![](https://www.dropbox.com/s/jgxitymh2wtoxpr/Screen%20Shot%202021-11-03%20at%204.30.08%20PM.png?raw=1) (Before left, After, right) You can see the HTML results for yourself https://www.dropbox.com/sh/kueagukss7ho2rm/AABk3V8FQXaFWugKI21ADZJ0a?dl=0

schneems force-pushed the schneems/trailing-leading-dot branch from ecc431d to baf8bf9 Compare October 11, 2021 02:41

schneems changed the title ~~[close #64] WIP~~ [close #64 #76] Refactor CodeLine add CleanDocument "sweep" class Oct 11, 2021

schneems marked this pull request as ready for review October 11, 2021 02:43

schneems mentioned this pull request Oct 11, 2021

Perf timeout #77

Closed

schneems merged commit 1a28a81 into main Oct 11, 2021

schneems mentioned this pull request Nov 4, 2021

~1.32x Faster checking with CleanDocument regex #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[close #64 #76] Refactor CodeLine add CleanDocument "sweep" class #78

[close #64 #76] Refactor CodeLine add CleanDocument "sweep" class #78

schneems commented Oct 11, 2021 •

edited

Loading

[close #64 #76] Refactor CodeLine add CleanDocument "sweep" class #78

[close #64 #76] Refactor CodeLine add CleanDocument "sweep" class #78

Conversation

schneems commented Oct 11, 2021 • edited Loading

Refactor CodeLine

Distinct "clean" phase added via CleanDocument

Join consecutive logical blocks in CleanDocument

CaptureCodeContext

Other

schneems commented Oct 11, 2021 •

edited

Loading

Distinct "clean" phase added via `CleanDocument`

Join consecutive logical blocks in `CleanDocument`