Proposal: evolve RyuJIT emitter to improve flexibility #80589

BruceForstall · 2023-01-13T00:46:04Z

The RyuJIT emitter uses a machine-level program representation which is not designed for modification after it is generated, such as modification due to peephole optimization or other low-level code transformations. It has long been known that certain "easy" optimizations could be made by adding peephole optimization capability, and a few specific cases have been added with severe limitations due to the current representation. This issue describes some of the problems and proposes incremental changes to increase its flexibility.

Current status

RyuJIT IR (internal representation) for representing a program has three basic forms:

tree-based GenTrees, sometimes referred to as HIR (high-level IR) to distinguish it from LIR and reuse the terminology used in some compiler literature
linear GenTree LIR (linear IR, or low-level IR), used after Lowering
Instruction Groups + Instruction Descriptors (IG + ID; in code: insGroup + instrDesc). I propose to call this MIR for machine-level IR, since it represents machine instructions and arguments that can be used for binary encoding. (Note that some compiler literature instead uses MIR to stand for "medium-level IR".)

In RyuJIT, MIR is a single-linked list of IGs, each of which contains a packed memory buffer of variable-sized instrDescs. There is no facility to navigate to a previous instrDesc from an instrDesc pointer, delete an instrDesc, or insert a new one. An IG roughly corresponds to a BasicBlock in LIR and stores the GC state that exists at the beginning of the IG. Each instrDesc encodes enough information to describe how the instruction affects GC state.

During code generation (codegen), MIR is incrementally created by iterating over LIR and generating machine instructions as IG/ID lists. After codegen, branch tightening and alignment processing occurs. After this point, we know how much memory to request from the VM, and the final memory space for the generated code is allocated. MIR is traversed, encoding instructions into this memory, and also computing and updating GC information, collecting GC state changes at final instruction offsets to hand back to the VM.

Final instruction offsets are needed for other needs as well: (1) IP->native mapping, (2) variable live ranges, (3) unwind information.

One key mechanism used to compute final instruction offsets is the emitLocation type. During code generation, any time a location in the final code stream is needed, an emitLocation is created. It's basically a pointer to the current or specified location of generated instrDescs. Crucially, it can give the actual code offset of this location after branch tightening has changed the initial estimated instruction offsets and we have final offsets.

However, making changes to the instrDesc instruction stream after an emitLocation has captured a particular location might invalidate the emitLocation, thus requiring anyone making such modification to find and update all the appropriate captured emitLocation, which is currently difficult and costly. Thus, we currently do not have any machine-level optimizations that remove any already-generated instructions.

Proposal: Remove `emitLocation`

To improve emitter flexibility, eliminate the use of emitLocation. These are all cases where the code needs a final instruction offset, but during codegen, that is not available. Instead of emitLocation, create new instrDesc representing the thing for which location information is needed, such as an IP->native mapping or the birth or death of a variable. These "pseudo-instructions" (or "MIR nodes") would be represented within the instrDesc list but would not generate any code. Branch tightening would leave them alone, but they would retain their relative position. Then, either before or after instruction encoding, as needed, walk the MIR to process these pseudo-instructions as necessary, such as building up and reporting variable live range data structures.

With this, a peephole optimization would be free to delete an instruction without worrying about updating any "pointers" to the instruction being removed. The optimizations would need to be aware of non-code MIR nodes, however, and adjust behavior appropriately. This is good, though, as those nodes represent semantics that would need to be respected by the optimization.

Proposal: Support backwards navigation

To support peephole optimizations, the JIT currently maintains a "last instruction" pointer (and soon might maintain a larger, though fixed-size and limited, history of "last instruction" pointers to support looking backwards more than a single instruction).

An alternative is to introduce a "previous" pointer to the instrDesc.

Currently, given an instrDesc*, you can use emitAdvanceInstrDesc() or emitNextID() to move to the next instrDesc in the list. To move backwards, we could add a byte to an instrDesc that specifies the number of bytes to subtract from an instrDesc* to reach the previous one, or zero if there is no previous one in the ID list for this IG. This presumes that all instrDesc are less than 256 bytes (which I believe is true, but otherwise we could use a short). To support going backwards across IGs, we would need to add an insGroup* igPrev "previous" pointer to insGroup. Then, add emitter::emitPrevInstrDesc(insGroup** ig, instrDesc** id) for navigation.

Alternatives

This proposal is an incremental change over the existing RyuJIT data structures. An interesting alternative is to scrap using IG/ID as MIR altogether, and reuse the existing linear GenTree representation, and LIR operations, instead. This would be done by creating new machine-level GenTree nodes, with machine-level instructions / opcodes and formats, much as is currently stored in instrDescs. Codegen would transform from one kind of GenTree to another.

This would re-use much existing mechanism, and possibly make it easier to do something like re-use our SSA package on machine-level IR. However, transforming the current code to this would require care to avoid rewriting a significant amount of very complicated instruction encoding logic. Also, memory usage and throughput would need consideration.

The text was updated successfully, but these errors were encountered:

ghost · 2023-01-13T00:46:11Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

The RyuJIT emitter uses a machine-level program representation which is not designed for modification after it is generated, such as modification due to peephole optimization or other low-level code transformations. It has long been known that certain "easy" optimizations could be made by adding peephole optimization capability, and a few specific cases have been added with severe limitations due to the current representation. This issue describes some of the problems and proposes incremental changes to increase its flexibility.

Current status

RyuJIT IR (internal representation) for representing a program has three basic forms:

tree-based GenTrees, sometimes referred to as HIR (high-level IR) to distinguish it from LIR and reuse the terminology used in some compiler literature
linear GenTree LIR (linear IR, or low-level IR), used after Lowering
Instruction Groups + Instruction Descriptors (IG + ID; in code: insGroup + instrDesc). I propose to call this MIR for machine-level IR, since it represents machine instructions and arguments that can be used for binary encoding. (Note that some compiler literature instead uses MIR to stand for "medium-level IR".)

In RyuJIT, MIR is a single-linked list of IGs, each of which contains a packed memory buffer of variable-sized instrDescs. There is no facility to navigate to a previous instrDesc from an instrDesc pointer, delete an instrDesc, or insert a new one. An IG roughly corresponds to a BasicBlock in LIR and stores the GC state that exists at the beginning of the IG. Each instrDesc encodes enough information to describe how the instruction affects GC state.

During code generation (codegen), MIR is incrementally created by iterating over LIR and generating machine instructions as IG/ID lists. After codegen, branch tightening and alignment processing occurs. After this point, we know how much memory to request from the VM, and the final memory space for the generated code is allocated. MIR is traversed, encoding instructions into this memory, and also computing and updating GC information, collecting GC state changes at final instruction offsets to hand back to the VM.

Final instruction offsets are needed for other needs as well: (1) IP->native mapping, (2) variable live ranges, (3) unwind information.

One key mechanism used to compute final instruction offsets is the emitLocation type. During code generation, any time a location in the final code stream is needed, an emitLocation is created. It's basically a pointer to the current or specified location of generated instrDescs. Crucially, it can give the actual code offset of this location after branch tightening has changed the initial estimated instruction offsets and we have final offsets.

However, making changes to the instrDesc instruction stream after an emitLocation has captured a particular location might invalidate the emitLocation, thus requiring anyone making such modification to find and update all the appropriate captured emitLocation, which is currently difficult and costly. Thus, we currently do not have any machine-level optimizations that remove any already-generated instructions.

Proposal: Remove `emitLocation`

To improve emitter flexibility, eliminate the use of emitLocation. These are all cases where the code needs a final instruction offset, but during codegen, that is not available. Instead of emitLocation, create new instrDesc representing the thing for which location information is needed, such as an IP->native mapping or the birth or death of a variable. These "pseudo-instructions" (or "MIR nodes") would be represented within the instrDesc list but would not generate any code. Branch tightening would leave them alone, but they would retain their relative position. Then, either before or after instruction encoding, as needed, walk the MIR to process these pseudo-instructions as necessary, such as building up and reporting variable live range data structures.

With this, a peephole optimization would be free to delete an instruction without worrying about updating any "pointers" to the instruction being removed. The optimizations would need to be aware of non-code MIR nodes, however, and adjust behavior appropriately. This is good, though, as those nodes represent semantics that would need to be respected by the optimization.

Proposal: Support backwards navigation

To support peephole optimizations, the JIT currently maintains a "last instruction" pointer (and soon might maintain a larger, though fixed-size and limited, history of "last instruction" pointers to support looking backwards more than a single instruction).

An alternative is to introduce a "previous" pointer to the instrDesc.

Currently, given an instrDesc*, you can use emitAdvanceInstrDesc() to move to the next instrDesc in the list (it currently can only move within a single IG, and not move to subsequent IGs). To move backwards, we could add a byte to an instrDesc that specifies the number of bytes to subtract from an instrDesc* to reach the previous one, or zero if there is no previous one in the ID list for this IG. This presumes that all instrDesc are less than 256 bytes (which I believe is true, but otherwise we could use a short). To support going backwards across IGs, we would need to add an insGroup* igPrev "previous" pointer to insGroup. Then, add emitter::emitPrevInstrDesc(insGroup** ig, instrDesc** id) for navigation.

Alternatives

This proposal is an incremental change over the existing RyuJIT data structures. An interesting alternative is to scrap using IG/ID as MIR altogether, and reuse the existing linear GenTree representation, and LIR operations, instead. This would be done by creating new machine-level GenTree nodes, with machine-level instructions / opcodes and formats, much as is currently stored in instrDescs. Codegen would transform from one kind of GenTree to another.

This would re-use much existing mechanism, and possibly make it easier to do something like re-use our SSA package on machine-level IR. However, transforming the current code to this would require care to avoid rewriting a significant amount of very complicated instruction encoding logic. Also, memory usage and throughput would need consideration.

Author:	BruceForstall
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

BruceForstall · 2023-01-13T00:47:02Z

@dotnet/jit-contrib

markples · 2023-01-13T23:23:40Z

Thanks for writing this helpful summary and proposal, Bruce.

One thing that stands out to me is the specific issue of stale pointers (emitLocation) vs other concerns, and this leads me to wonder what kinds of abstractions could help in general, be incremental changes, etc. For example, an abstract location could be implemented today by a pointer, then changed to something more symbolic, potentially back to a pointer to a linked list node (if we ended up there), and so on. If it's abstract, then the group could control it (update it when necessary). A sufficient abstraction could even allow backwards traversal to be implemented by rewalking from the group start (not really suggesting exactly that, but it's an example of a time/space tradeoff that could happen).

Whether those abstract locations point to real instructions or a kind of pseudo-instruction would be TBD. Note that this framing is retaining the "external reference to the IR view" - so the pseudo-instructions basically become labels. The need for a location is reduced or goes away completely if everything is in the IR itself.

I think the reason I've seen it this way is that it feels like the stale pointer and "it's hard to update" from become intertwined. In a very abstract sense, it doesn't matter to a peephole optimizer if the "additional information" is on an instruction or in a separate one or even in a side table - assuming that the optimizer has access to the info in order to manipulate it. I suppose some cases "just work" (ldr -> ldp might be able to ignore/persist the effects of the first load, add the effects of the second, delete the ldr, add a ldp), but is that generally true?

Sufficient abstraction would allow additional storage tricks to be used. For example, if updates are fairly few, insertion can be implemented by replacing an instrDesc with an "outline" instrDesc that points to a buffer (or whatever) with N instructions. If next, prev, traversal, etc., all just work, then this is possible. If every piece of code needs to watch out for the "outline" case, then it becomes quite messy.

BruceForstall · 2023-01-14T00:03:21Z

Thanks for the comments Mark.

emitLocation today is actually fairly abstract: it holds a <IG*, instruction number within IG> tuple. The problem we have is that these are effectively "external pointers" to the MIR (that is, IG/ID stream) that might need to be updated if MIR modifications/optimizations take place. E.g., for the "ldr(1)+ldr(2) => ldp" optimization, if the ldr(1) instruction is deleted, and some emitLocation pointed after it, we need to not only find that emitLocation, but we need to determine what the appropriate action is w.r.t. the optimization: update it to point before ldr(1), update it to point after the replaced ldp, or delete it entirely? There is a semantic question to be answered. If the locations represented today by emitLocation are instead interleaved in the IR as "labels", it is certainly easier to find them, but the same semantic questions need to be addressed.

AndyAyersMS · 2023-01-18T03:22:41Z

I was chatting with Bruce and think it would be helpful to keep possible longer-term ambitions in mind. Some examples:

a declarative way of implementing late peephole optimizations (along the lines of Implement machine dependent peephole optimizer at lower time #8035)
revisiting the interplay between alignment and branch tensioning
being able to reason about / optimize prolog and epilog code
fixing the issues with instruction size estimation (see Common place to calculate instruction size so we do not mispredict it #57368 for some suggestions)

These all involve (to lesser/greater degree) the ability to analyze and perhaps modify instructions very late.

The current implementation keeps a lot of implicit ambient state as it streams through codegen and emission, making it tricky to alter what's already been done, or anticipate what is coming next. Moving towards a more explicit representation seems like a good way to open the door to more flexible working arrangements.

I also am not sure how much emphasis should be placed on saving memory. Presumably all the jit-allocated memory gets reclaimed/recycled so any additional footprint is transient. Back in the days of full framework and especially 32 bits it could be that private bytes / address space was at a premium. It might make sense to opt for a simpler but less memory efficient design, if it leads to easier to reason about data or faster overall processing.

BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 13, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jan 13, 2023

BruceForstall added this to the Future milestone Jan 13, 2023

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Jan 13, 2023

BruceForstall mentioned this issue Jan 13, 2023

Replace successive "ldr" and "str" instructions with "ldp" and "stp" #77540

Merged

BruceForstall mentioned this issue Jan 19, 2023

Introduce insGroup/instrDesc backwards navigation #80840

Merged

JulieLeeMSFT added the Status:Proposed label Jan 23, 2023

tannergooding mentioned this issue Feb 16, 2023

Runtime<->JIT reloc hinting mechanism needs work #60712

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: evolve RyuJIT emitter to improve flexibility #80589

Proposal: evolve RyuJIT emitter to improve flexibility #80589

BruceForstall commented Jan 13, 2023 •

edited

Loading

ghost commented Jan 13, 2023

Current status

Proposal: Remove `emitLocation`

Proposal: Support backwards navigation

Alternatives

BruceForstall commented Jan 13, 2023

markples commented Jan 13, 2023

BruceForstall commented Jan 14, 2023

AndyAyersMS commented Jan 18, 2023

Proposal: evolve RyuJIT emitter to improve flexibility #80589

Proposal: evolve RyuJIT emitter to improve flexibility #80589

Comments

BruceForstall commented Jan 13, 2023 • edited Loading

Current status

Proposal: Remove emitLocation

Proposal: Support backwards navigation

Alternatives

ghost commented Jan 13, 2023

Current status

Proposal: Remove emitLocation

Proposal: Support backwards navigation

Alternatives

BruceForstall commented Jan 13, 2023

markples commented Jan 13, 2023

BruceForstall commented Jan 14, 2023

AndyAyersMS commented Jan 18, 2023

BruceForstall commented Jan 13, 2023 •

edited

Loading

Proposal: Remove `emitLocation`

Proposal: Remove `emitLocation`