Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common place to calculate instruction size so we do not mispredict it #57368

Open
kunalspathak opened this issue Aug 13, 2021 · 5 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Aug 13, 2021

During codegen, we try to calculate the instruction size to get an idea of how much memory is needed to store the machine code. Next we allocate the memory from runtime and then output the instructions in the memory. In some situation, we mispredict (overestimate) the size of instruction and during outputting, we end up outputting an instruction of smaller size. The root of the problem is because there are two separate code paths that estimate (emitIns* methods) and output (emitOut* methods) instruction. We should unify this code so we know accurate instruction size and hence reduce memory wastage and speedup the throughput by not going through estimate and then outputting phase separately.

Relevant discussion in #12840, #57041, #12178, #8748 (comment)

category:design
theme:intrinsics
skill-level:expert
cost:large
impact:medium

@kunalspathak kunalspathak added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 13, 2021
@kunalspathak kunalspathak added this to the Future milestone Aug 13, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 13, 2021
@ghost
Copy link

ghost commented Aug 13, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

During codegen, we try to calculate the instruction size to get an idea of how much memory is needed to store the machine code. Next we allocate the memory from runtime and then output the instructions in the memory. In some situation, we mispredict (overestimate) the size of instruction and during outputting, we end up outputting an instruction of smaller size. The root of the problem is because there are two separate code paths that estimate (emitIns* methods) and output (emitOut* methods) instruction. We should unify this code so we know accurate instruction size and hence reduce memory wastage and speedup the throughput by not going through estimate and then outputting phase separately.

Relevant discussion in #12840, #57041, #12178, #8748 (comment)

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: Future

@BruceForstall
Copy link
Member

Related: #50054

@kunalspathak
Copy link
Member Author

I am just pasting here the design proposal that I made in #50054 (comment).

Existing design

Currently there are 4 steps that occur in code generation and here is a rough outline of what happens:

  1. genGenerateMachineCode(): This phase iterates over all the GenTree* nodes (see genCodeForTreeNode()) and translates them to machine dependent data-structure instrDesc. They are not necessarily 1-to-1, but some GenTree* can produce more than one instrDesc* while others can skip it. Just as GenTree* maps to instrDesc*, during this phase, so does BasicBlock* maps to instrGroup* (instruction group), which is a collection of instrDesc* we create during this phase. instrDesc* are first created in a temporary buffer. Depending on the instruction, various fields are set in instrDesc* include registers, instruction, instruction format, etc. We also estimate the instruction size of each instruction that is saved in instrDesc*. Once we complete processing all the instructions for a single instrGroup*, we save the raw data of all instrDesc* in a memory region allocated in ig->igData field. (See emitSavIG()). Basically, we memcpy all the instrDsc* data (contiguous raw memory contents) created so far in ig->igData.

  2. At this point, since we have instruction size of all instrDesc*, we allocate equivalent memory from runtime. This is the final code buffer where instruction encoding will be saved for execution.

  3. genEmitMachineCode() : This phase iterates over each instrGroup* and then over each instrDesc* (which is present in ig->igData). It extracts all the metadata present in instrDesc* (See emitIssue1Instr()) and depending on the insFmt (instruction format), calls the appropriate emitOutput*() method to generate the raw encoding for that instruction. The raw encoding is saved in the code buffer that we requested in step 2 above. In this step, we also update the GC refs to reflect the true code address where GC refs goes live/dead.

  4. Finally, we adjust any forward jump addresses to reflect the final code location.

Proposal

If we notice, step 1 has all the information needed for an instruction to be encoded. Step 3 uses this exact information to determine the encoding needed. The proposal is to combine step 1 and 3 in following manner:

A1. If we see emitOutputInstr(), it has a big switch-case depending on the instruction format present in instrDesc*. During step 1, since we already have all the information, we could just call various emitOutput*() methods. The only challenge at that point is we do not have final code memory where the encoded instructions will be stored. We can get around the problem by having a growable memory say ig->finalCode at instrGroup* level (See C1 below). The encoding can be accumulated in ig->finalCode during step 1 itself.

A2. After above step, we would perform step 2, just as it happens currently. Except, in this case, we would have accurate memory size from ig->finalCode that should be allocated from the runtime, lets call it finalCodeBuffer.

A3. Traverse ig->finalCode of all instrGroup* and memcpy() it to final code memory that we allocated in A2 above.

A4. In A1 step, we will have to somehow track the GC liveness by building a table (See C2 below) of which codeAddress (memory of ig->finalCode) needs update. Later, once code is moved to final code memory, we would have to write a routine which will efficiently map the codeAddress from ig->finalCode to finalCodeBuffer and search/replace all the addresses based on the mapping.

A5. Forward jump addresses will be addressed, the one they are fixed today.

Refactoring

B1. emitOutputInstr*() should be segregated based upon the instruction format that they handle. For example, here is a piece of code from emitxarch.cpp emitOutputInstr() method:

        case IF_ARD_RRD:
        case IF_AWR_RRD:
        case IF_ARW_RRD:
            code    = insCodeMR(ins);
            code    = AddSimdPrefixIfNeeded(ins, code, size);
            regcode = (insEncodeReg345(ins, id->idReg1(), size, &code) << 8);
            dst     = emitOutputAM(dst, id, code | regcode);
            sz      = emitSizeOfInsDsc(id);
            break;
            

IMO, we should just move such code in a separate method with appropriate name. In A1, we would just call the relevant methods, since we know what is the insFmt and instruction we are setting, 2 things that we check in above code.

B2. GC update related code should just be in a separate file IMO, today it is mixed with emitter code.

Data structures

C1. We could have two ways in which finalCode can be stored. [1] A growable contiguous memory of X bytes. Once X bytes are filled, we allocate 2X bytes, copy the contents of current memory to new memory and continue writing to the new memory. We continue growing it by 4X, 8X, etc. I think this option has a downside of over allocating memory after one point and additional cost of memcpy, everytime the buffer is grown. [2] A simple linked list of fixed size, which when filled, we just allocate next chunk and continue write instruction encoding in it. In step A3, we would just traverse the linked list and copy the contents to finalCodeBuffer.

C2. To track GC refs, currently the addr is used just to find the offset from the start. Once we start using ig->finalCode, we should be able to track the offset from the start, and might not need readjustment, as far as I understand the code. May be Bruce can correct me here.

Open Questinos

We will have to still do the branch tensioning and will have to see how we can compact the already encoded jmp instructions in the finalCode or finalCodeBuffer region.

@kunalspathak
Copy link
Member Author

I gathered some data on how much mis-estimation we are doing today.

Collection # of methods Misestimated bytes
benchmarks-win-x64 128 561
libraries-pmi-win-x64 181 637
asp-win-x64 269 1193
benchmarks-unix-x64 95 440
libraries-pmi-unix-x64 302 1112

I will post the individual instruction distribution shortly.

@kunalspathak
Copy link
Member Author

kunalspathak commented Jan 12, 2023

Benchmarks windows/x64

Instruction Number of occurrence Total bytes
ja 4 16
jae 14 56
jb 1 4
jbe 4 16
je 63 252
jg 7 28
jge 11 44
jl 32 128
jle 28 112
jmp 21 63
jne 30 120
jo 1 4
mov 3 9
movsxd 22 22
Grand Total 241 874

Libraries.pmi windows/x64

Instructions Number of occurrences Total bytes
crc32 4 4
imul 1 1
ja 14 56
jae 21 84
jb 3 12
jbe 4 16
je 55 220
jg 7 28
jge 3 12
jl 26 104
jle 23 92
jmp 33 99
jne 32 128
jo 5 20
js 2 8
mov 4 12
movsxd 53 53
vmovnti 4 4
vprefetchnta 1 1
vprefetcht0 1 1
vprefetcht1 1 1
vprefetcht2 1 1
Grand Total 298 957

coreclr_run windows/x64

Instructions Number of occurrences Total bytes
crc32 43 43
idiv 2 6
ja 1 4
jae 15 60
jb 6 24
jbe 10 40
je 72 288
jg 9 36
jge 10 40
jl 71 284
jle 44 176
jmp 24 72
jne 186 744
jp 17 68
js 3 12
mov 24 72
movsxd 105 105
vmovnti 20 20
vmovsd 10 30
vmovupd 42 126
vmulsd 4 12
vprefetchnta 3 3
vprefetcht0 3 3
vprefetcht1 3 3
vprefetcht2 3 3
Grand Total 730 2274

crossgen windows/x64

Instructions Number of occurrences Total bytes
crc32 4 4
imul 1 1
mov 138 414
movsxd 70 70
Grand Total 213 489

Note: The difference of grand total that we see in above tables vs. the one in #57368 (comment) is because we add NOP compensation bytes that reduces the difference between actual code size vs. memory allocated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants