Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: use reverse post-order (RPO) traversal for morph #93246

Closed
9 of 12 tasks
AndyAyersMS opened this issue Oct 9, 2023 · 8 comments
Closed
9 of 12 tasks

JIT: use reverse post-order (RPO) traversal for morph #93246

AndyAyersMS opened this issue Oct 9, 2023 · 8 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Oct 9, 2023

Morph, like many JIT phases, visits all the blocks in a method by following the bbNext chain. There's a missed opportunity here to cheaply propagate some information from block to block.

For "forward" phases like morph it is often preferable to visit the blocks in reverse post-order (RPO). An RPO ensures that for most blocks all the predecessors of the block have been visited before the block itself.

Currently value numbering implements an RPO visit. It's also possible to create an RPO using fgDfsReversePostorder.

The initial goal of this work is to modify morph to rely on RPO, and then to enable a simple form of global assertion prop for Morph (aka "cross-block local assertion prop") that can push facts forward across block boundaries. #86822 has a prototype implementation. The main remaining challenges there are to make the RPO efficient and to properly handle cases where control flow is altered.

There will be extra TP cost from the RPO and from the assertion changes; the hope is that these are paid back by improved optimizations and that this entire change can be zero-cost.

Note for "backwards" phases post-order provides similar benefits (all successors likely visited before a block is visited).

cc @dotnet/jit-contrib

@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 9, 2023
@AndyAyersMS AndyAyersMS added this to the 9.0.0 milestone Oct 9, 2023
@ghost
Copy link

ghost commented Oct 9, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Morph, like many JIT phases, visits all the blocks in a method by following the bbNext chain. There's a missed opportunity here to cheaply propagate some information from block to block.

For "forward" phases like morph it is often preferable to visit the blocks in reverse post-order (RPO). An RPO ensures that for most blocks all the predecessors of the block have been visited before the block itself.

Currently value numbering implements an RPO visit. It's also possible to create an RPO using fgDfsReversePostorder.

The initial goal of this work is to modify morph to rely on RPO, and then to enable a simple form of global assertion prop for Morph (aka "cross-block local assertion prop") that can push facts forward across block boundaries. #86822 has a prototype implementation. The main remaining challenges there are to make the RPO efficient and to properly handle cases where control flow is altered.

  • (Q4 23) Implement RPO for Morph, and enable cross-block assertion prop
    • Revise local assertion prop to stop clearing the assertion table and track live assertions via bit vectors
    • Defer some of the control flow expansion (fgCreateGCPoll) from morph
    • Come up with a strategy to ensure that the remaining control flow changes in morph are "safe" in that they do not invalidate assumptions made earlier about the flow into a block
    • Implement the assertion merging algorithm
    • Revise morph to visit in RPO
    • Enable cross-block assertion prop
    • (stretch) avoid visiting unreachable blocks, and drop them from the RPO (perhaps delete the eagerly)
    • (stretch) refine RPO on the fly to avoid visiting blocks that become unreachable because of cross-block assertion prop
    • (stretch) do QMARK expansion before morph, remove bespoke handling in morph for QMARK assertions (see JIT: expand qmarks earlier #86778)
  • (stretch) Implement an efficient, reusable RPO traversal; look for other phases that can benefit from RPO
  • (stretch) Rework value numbering to use the reusable traversal

There will be extra TP cost from the RPO and from the assertion changes; the hope is that these are paid back by improved optimizations and that this entire change can be zero-cost.

Note for "backwards" phases post-order provides similar benefits (all successors likely visited before a block is visited).

cc @dotnet/jit-contrib

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: 9.0.0

@AndyAyersMS AndyAyersMS self-assigned this Oct 9, 2023
@JulieLeeMSFT JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Oct 9, 2023
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Oct 25, 2023
Instead of merging returns to the common return block in morph, do all the
merging in `fgAddInternal` (where we already did some merging). This removes
a case where morph would add a control flow edge in a way that might disrupt
an ongoing RPO.

Earlier merging also opens up the possibility of tail merging some of the copies
into the canonical return local, and possibly even some of the computations that
feed the copies.

Modify the flow alterations done by morph. Previously if a tail call was expressed
via a call to a `CORINFO_TAILCALL_HELPER`, morph would change the block kind to
`BBJ_RETURN` and then merge the return, changing the block kind to `BBJ_ALWAYS`.
Since merging now happens before moprh, morph needs to leave the block kind alone.

Generalize the post-tail-call sanity check in morph to recognize one new case
that can come up.

Contributes to dotnet#93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Oct 28, 2023
Adapt `fgValidateIRForTailCall` to use as a utility to verify that the
JIT has not added IR that would make a tail call invalid. Currently all
our cases pass this check so no tail calls are invalidated.

Remove `fgCheckStmtAfterTailCall` as it did less thorough and less
correct checking.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Oct 30, 2023
Remove `fgCheckStmtAfterTailCall` as it did less thorough and less
correct checking.

Contributes to #93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Oct 30, 2023
@AndyAyersMS
Copy link
Member Author

For the "dynamic RPO" -- implementing this for morph seems problematic. Value Numbering (VN) has one, but it relies on the loop table, which (roughly speaking) encodes the results of a prior DFS traversal.

VN keeps track of two sets of blocks, those not yet visited with all preds visited, and those not yet visited with some preds visited. As a visit finishes it walks the pred lists of all successors, looking to see if any of those are unvisited and have all preds visited. If so, the block is added to the first set; if not, it is added to the second.

Blocks are preferentially processed from the "all preds visited" set, until this is empty. At that point, if the "some preds visited" set is not empty, VN needs to choose a block from this set, and to do so it relies on loop structure to find an outermost loop block, to maximize the odds of subsequent visits being all preds visited cases.

Without this hint there's no guarantee that the visitation order for VN resembles an RPO.

(Aside: it also seems like VN's approach could be streamlined somewhat by keeping track of the number of unvisited preds; when a visit finishes and we're enumerating the block successors we can just decrement their waiting numbers, moving them to the zero list where appropriate, rather than scanning all their preds. But I don't recall the value number state manipulation being especially costly, so perhaps not worth the trouble?)

For morph there's no loop structure around to consult; we'd need a DFS or equivalent to establish one, at which point we may as well just use the DFS itself to establish an RPO.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 30, 2023

In the "dynamic" case in the absence of a loop table, you could rely on the evolving spanning tree to figure out which of the "some preds visited" set you should visit first; preferentially choose based on (a) fewer unvisited preds; then (b) closest to root visited pred (or maybe just closest to root for the pred that was visited first).

Say in a doubly nested loop both loop entries have one unvisited pred: the outer loop entry would have had preds visited before any inner loop pred, so it should go first (the outer loop entry should be a spanning tree ancestor of the inner loop entry).

This would require a bit of extra state or you could perhaps reconstruct the spanning tree from the existing data (or leverage the slots in the blocks with pre/post order numbers and do it that way, and/or manipulate state on edges or give edges an ID and track which ones are in the spanning tree via a bit vector or something).

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Oct 31, 2023

Thinking about it some more, it seems difficult to get the dynamic case to work properly without some kind of prior analysis of the graph structure. EG for

image - 2023-10-31T081148 374

Once A has been processed, a dynamic RPO needs to decide if B or C should come next. The right answer is B, but both B and C are at the same depth in the spanning tree, and have the same spanning tree parent, so there's no way to use the spanning tree to decide which node should come next.

Here B has an obvious self-loop, and C has no successors, but straightforward elaborations of the graph above preserve the problem of choosing B or C without having those features.

It seems plausible that replacing the "dynamic PGO" done in value numbering with an actual RPO might both improve throughput and give better results, since not all loops are in the loop table (eg cold clones). Worth trying someday.

AndyAyersMS added a commit that referenced this issue Oct 31, 2023
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Oct 31, 2023
When optimizing, process blocks in RPO. Disallow creation of new blocks
and new flow edges (the latter with certain preapproved exceptions).

Morph does not yet take advantage of the RPO to enable more optimization.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 2, 2023
When optimizing, process blocks in RPO. Disallow creation of new blocks
and new flow edges (the latter with certain preapproved exceptions).

Morph does not yet take advantage of the RPO to enable more optimization.

Contributes to #93246.
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Nov 2, 2023

Challenges in modifying local assertion prop:

  • Currently assumes any assertion in the table is valid. Instead, it needs to consult the appropriate bit vector.
  • In places where it does consult a bit vector, has extra checks to ensure bit vector indices remain in bounds, this is needed as we use apFull for local prop, so it can have bits set that are beyond the actual current assertion count. This will no longer be needed.
  • Table size is 64, refreshed for each block.
    • Some blocks have more than 64 live assertion gens and we're losing them now.
    • Globally some methods have more than 64 assertion gens, so when we stop clearing the table it will sometimes need to be larger than 64 (in asp.net there are cases that need 960 or so assertions). This will have a TP and memory cost. Empirically it seems like the number of local vars is a decent predictor of the table size, eg # assertions ~ 0.6 * # locals.
    • In the current scheme assertions are forcibly removed from the table when killed, so current scheme can actually handle blocks with > 64 assertions, provided they're not all live at once... new scheme will not be able to remove things from the table like this as (ultimately) other blocks may be keeping that assertion live
    • We need to guess the table size up front before we morph anything, and if we guess too small we'll lose some assertions. Unlike the current scheme where assertions are only lost within large basic blocks, in the new scheme we'll start to lose them no matter the block size when we reach the tail of the RPO. This may be undesirable.
    • We could try and curtail the number of assertions per block like we do now and see if that helps keep from running out of table space late in the RPO, but that seems problematic given the current compaction scheme. Also note eventually we may want to enable more assertion gen (say the JTRUE edge cases...) so we need to be able to handle lots of assertions somehow.
  • Initial swag on TP impact doesn't look good; need to drill in here.

The plan for now:

  • get things correct and (near) zero diff with an apLocal bit vector tracking live assertions, reset along with the assertion table at every block boundary, and fix any serious TP issues (JIT: revise local assertion prop to use bit vectors #94322)
  • introduce a global table (remove the per-block table reset, but keep the per-block live assertion reset). This will introduce diffs from table sizing issues as noted above, and may also need some TP work.
  • record the out assertion sets on each block, and implement a block merging algorithm taking into account the special blocks that must have empty live in sets... but still clear apLocal on entry. Fix TP issues.
  • stop resetting the apLocal on entry, instead use the merged pred set... hopefully wins in code size and TP!

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 2, 2023
…n tracking

Track the set of active local assertions via a bit vector, rather than assuming
all entries in the table are live.

Doing so required a number of changes in assertion prop to ensure the vector
is consulted before deciding an assertion is valid.

This will (eventually) allow us to propagate assertions cross-block. For
now we reset the bit vector and assertion table back to empty at the start
of each block so nothing propagates past the end of a block.

The table can fill and cause the JIT to miss assertions in very large blocks
as morph will no longer remove assertions while processing a block. Previously
this would happen if there were more than 64 live assertions in a block, and
now it can happen if there are more than 64 assertions in block (so somewhat
more frequently).

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 3, 2023
…n tracking (#94322)

Track the set of active local assertions via a bit vector, rather than assuming
all entries in the table are live.

Doing so required a number of changes in assertion prop to ensure the vector
is consulted before deciding an assertion is valid.

This will (eventually) allow us to propagate assertions cross-block. For
now we reset the bit vector and assertion table back to empty at the start
of each block so nothing propagates past the end of a block.

The table can fill and cause the JIT to miss assertions in very large blocks
as morph will no longer remove assertions while processing a block. Previously
this would happen if there were more than 64 live assertions in a block, and
now it can happen if there are more than 64 assertions in block (so somewhat
more frequently).

Contributes to #93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 3, 2023
During global morph, allow assertions to propagate to a block from the block's
predecessors. Handle special cases where we can't allow this to happen:
* block has preds that have not yet been morphed
* block has no preds
* block is specially flagged as one that might gain new preds during morph
* block is an EH handler entry

Contributes to dotnet#93246.
@AndyAyersMS
Copy link
Member Author

Thoughts about how to get this to the point where it can be merged

(copied from #94363 (comment))

Throughput

I think I can claw some of the TP back by

(1) optimizing search loops in assertion prop. We should generally never search the entire table (save when adding an assertion) and instead walk the live assertion set, and for local prop we should intersect that set with the dep vector.

The bit vector operations aren't free so it might make sense to do this only when there are sufficient numbers of assertions, though that makes the code uglier. Perhaps it can all be hidden behind a suitably smart enumerator.

(2) stop morphing blocks that are unreachable or become unreachable because of local assertion prop.

Aside from removing all the morph related overhead, I suspect sometimes this dead block morphing creates issues for live blocks, eg either useless assertions that cause the table to fill faster, or else global actions (say DNER) that are hard to undo.

Currently the RPO strategy won't allow for removal of dead EH and may or may not make it easy to remove unreachable cyclic flow.

Both of these can be spun off as preliminary checkins, though the full benefits may not be seen without this change, as main has small bit vectors and isn't able to propagate constants nearly as aggressively.

At the end of the day though I expect there will still be a TP impact.

Code Size / Code Quality

With the massive numbers of diffs any sort of manual assessment is going to be a random sampling at best. I will try and look at some of the bigger regressions, possibly suppressing cloning to help remove that as a factor.

The more aggressive copy prop that this enables puts pressure on LSRA, as copies to temps provide natural split points. I am not sure we can find effective heuristics to counterbalance this.

I think it makes sense to check all this in initially as off-by-default, and enable runs in the perf lab, and use that data to both identify CQ regressions and to decide if the CQ improvements justify the additional TP costs.

@AndyAyersMS
Copy link
Member Author

One other note here for posterity -- I spent some time trying to see if there was a way to remove assertions to try and keep the index set more compact. For example, if an assertion born in some block (created for the first time) is no longer live at the end of the block, it can be erased with somewhat minimal effort: scrub it from any dep vector, remove it from the table, and adjust the bits in the apLocal to compensate (easy if this removed assertion has a higher number than any live new assertion, a bit of work if not).

I gathered some data on this and it did not look promising. Most times assertions don't get killed within blocks; they die at merge points, and most of the time when they do get killed they're not numbered higher than any new live one. So I did not pursue this.

There is also one test case in coreclr_run that generates upwards of 40K assertions, none of which are killed. With a global table this method was regressing by 100K bytes as those assertions are useful. The fix for this was to cap the table size to the max tracked, and if there are more locals than this when morph runs, just run a non-cross block version with 64 entry table like we've been doing all along. The problematic case also has 40K locals.

It's still possible to fill the table with junk assertions; I will need to look at some data on how often we lose assertions with the new table resizing. But some of this is inevitable with a simplistic forward push of facts. One can imagine trying to prioritize locals (we have RCS_EARLY counts we could use say), but I don't know yet if something like that is worth the trouble.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 7, 2023
We don't need quite this broad of a start node set, as the DFS will
find unreachable blocks on its own.

Also lay the groundwork for removing unreachable blocks, by tracking
the postorder number of the last reachable block.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 10, 2023
During global morph, allow assertions to propagate to a block from the block's
predecessors. Handle special cases where we can't allow this to happen:
* block has preds that have not yet been morphed
* block has no preds
* block is specially flagged as one that might gain new preds during morph
* block is an EH handler entry

Contributes to #93246.

When enabled, size the assertion table based on the number of locals, up to the tracked limit.

Disabled by default; use 'DOTNET_JitEnableCrossBlockLocalAssertionProp=1` to enable.
AndyAyersMS added a commit that referenced this issue Nov 10, 2023
Remove the up-front computation of enter blocks and dom
start nodes from the DFS computations used for RPO.

Also lay the groundwork for removing unreachable blocks, by tracking
the postorder number of the last reachable block.

Contributes to #93246.
AndyAyersMS added a commit that referenced this issue Nov 13, 2023
Leverage the "dep vectors" to avoid the search the assertion table during local assertion prop. Helps the current (small table) behavior some, helps the future cross-block (larger table) behavior more.

Similar tricks may be possible for global AP, though the set of assertions there is more varied.

Avoid merging assertions from statically unreachable preds. For OSR methods ensure the original method entry is considered reachable (as it may be).

Contributes to #93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 15, 2023
Now that we can propagate assertions across block boundaries we can
generate assertions for true/false branch conditions and propagate
them along the corresponding edges.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 15, 2023
…94741)

Now that we can propagate assertions across block boundaries we can
generate assertions for true/false branch conditions and propagate
them along the corresponding edges.

Contributes to #93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 16, 2023
Fix condition under which we can share a pred's assertion out vector.
Add the ability to disable cross-block local assertion prop via range.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 17, 2023
Fix condition under which we can share a pred's assertion out vector.
Add the ability to disable cross-block local assertion prop via range.

Contributes to #93246.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Nov 17, 2023
Float relop equality does not imply bitwise equality. So skip
making local jtrue assertions about floating types.

Contributes to dotnet#93246.
AndyAyersMS added a commit that referenced this issue Nov 18, 2023
Float relop equality does not imply bitwise equality. So skip
making local jtrue assertions about floating types.

Contributes to #93246.
@JulieLeeMSFT
Copy link
Member

Completed work items for .NET 9.

@github-actions github-actions bot locked and limited conversation to collaborators May 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Projects
Status: Done
Development

No branches or pull requests

2 participants