Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This fixes emitGetVexPrefixSize to support detecting 2-byte prefix support #79478

Merged
merged 15 commits into from
Dec 14, 2022

Conversation

tannergooding
Copy link
Member

As raised on #79363 and in various past PR/issues, we did not correctly estimate the size of the VEX prefix when used. This had negative side effects such as allocating more memory than necessary and when loop alignment support was added, it meant that we could no longer use the 2-byte prefix when also aligning loops.

This resolves that by updating emitGetVexPrefixSize to check the relevant instrDesc inputs to determine if the 2-byte or 3-byte prefix will be used.

As a side effect, this also removes some code that was dead and does some other minor cleanup to improve the general handling of the VEX prefix.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 9, 2022
@ghost ghost assigned tannergooding Dec 9, 2022
@ghost
Copy link

ghost commented Dec 9, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

As raised on #79363 and in various past PR/issues, we did not correctly estimate the size of the VEX prefix when used. This had negative side effects such as allocating more memory than necessary and when loop alignment support was added, it meant that we could no longer use the 2-byte prefix when also aligning loops.

This resolves that by updating emitGetVexPrefixSize to check the relevant instrDesc inputs to determine if the 2-byte or 3-byte prefix will be used.

As a side effect, this also removes some code that was dead and does some other minor cleanup to improve the general handling of the VEX prefix.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

CC. @kunalspathak, @BruceForstall

This passed the full HardwareIntrinsics_r and HardwareIntrinsics_ro tests locally for "default", TieredCompilation=0, and ReadyToRun=0.

We'll want to also run the JitStress tests once CI shows there aren't any missed edge cases.

//
unsigned emitter::emitOutputSimdPrefixIfNeeded(instruction ins, BYTE* dst, code_t& code)
emitter::code_t emitter::emitExtractEvexPrefix(instruction ins, code_t& code)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially extracted these methods as I thought the easiest thing was going to be to just pass the code_t through to emitGetVexPrefixSize and then do something like:

code_t vexPrefix = emitExtractVexPrefix(ins, code);
assert(vexPrefix != 0);

if ((vexPrefix & 0xFFFF7F80) == 0x00C46100)
{
    return 2;
}

return 3;

However, there is quite a bit more logic that goes into correctly building up code and since we don't cache it anywhere, it was going to negatively impact throughput.

I ended up leaving the helper method here as it may still be useful in the future and it isolates a large chunk of complex logic that was just splatted inline before.

Comment on lines 1777 to 1781
unsigned emitter::emitGetEvexPrefixSize(instrDesc* id)
{
instruction ins = id->idIns();
assert(IsEvexEncodedInstruction(ins));
return 4;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only called this from emitGetAdjustedSize and only under an existing IsEvexEncodedInstruction check, so I simplified it to just assert and return the constant.

// code -- The current opcode and any known prefixes
//
// Returns:
// Updated size.
//
unsigned emitter::emitGetAdjustedSizeEvexAware(instruction ins, emitAttr attr, code_t code)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When emitGetAdjustedSizeEvexAware was added emitGetAdjustedSize became fully dead code.

It was duplicating quite a bit of complex logic and if we really need it again, we can grab it from the git history, so I removed the dead code and renamed the EvexAware method to the same as the original name.

// Returns:
// Prefix size in bytes.
//
unsigned emitter::emitGetVexPrefixSize(instrDesc* id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function had to be moved "down" so it could access hasCodeMR.

Like I mentioned above, I was originally going to go a different route but settled on the simpler approach here where we switch on the insFmt and do the couple minor checks rather than trying to build up the full code_t.

If we were caching the code_t somewhere so we didn't need to rebuild it 2-3 times, then another approach would be better. That is also a much more involved/complex change, but one that may be worthwhile long term.


if (EncodedBySSE38orSSE3A(ins))
{
// When the prefix is 0x0F38 or 0x0F3A, we must use the 3-byte encoding
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filters out a majority of the complex instructions, particularly those that take 3 inputs or do other "special things" with register representation.

Comment on lines +2455 to +2465
if ((regForSibBits != REG_NA) && IsExtendedReg(regForSibBits))
{
// When the REX.X bit is present, we must use the 3-byte encoding
return 3;
}

if ((regFor012Bits != REG_NA) && IsExtendedReg(regFor012Bits))
{
// When the REX.B bit is present, we must use the 3-byte encoding
return 3;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insEncodeReg345 uses the REX.R bit and is always available
insEncodeReg3456 uses the vvvv field and is always available.

insEncodeReg012 uses the REX.B bit for extended registers and is only available in the 3-byte encoded
insEncodeRegSIB uses the REX.X bit for extended registers and is only available in the 3-byte encoded

Since SIB is only used for address encodings, we typically don't need to worry about it. Likewise, we normally only have to worry about the 012 case for scenarios where an operand can come from register or memory.

For VEX encoded binary instructions, like vaddps, this is normally the second operand:

  • ins tgt, op1, op2/mem scenario.

However, there are also some unary instructions, like vmovd, where this can be the destination or first operand:

  • ins tgt/mem, op1
  • ins tgt, op1/mem

break;
}

case IF_RRW_RRW_CNS:
Copy link
Member Author

@tannergooding tannergooding Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to call out this format in particular.

It seems like we have some cases where the IF_* defined doesn't quite "make sense". This one, for example, should probably be IF_RWR_RRD_CNS.

It is currently used by emitIns_R_R_I and applies to instructions like:

  • pextrb
  • pextrd
  • pextrq
  • pextrw_sse41
  • extractps
  • vextractf128
  • vextracti128
  • shld
  • shrd
  • psrldq
  • pslldq

There are other formats I saw as well that don't exactly match the semantics of the instruction that's using them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely identify and fix cases where the IF_ formats are incorrectly used; they are complicated enough as-is, without some being wrong.

@tannergooding
Copy link
Member Author

Diffs are hugely positive, with similar diffs on Linux x64

Overall (-22,919 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x64.checked.mch 25,002,891 -1,718
coreclr_tests.run.windows.x64.checked.mch 362,743,027 -14,183
libraries.pmi.windows.x64.checked.mch 52,033,070 -4,440
libraries_tests.pmi.windows.x64.checked.mch 114,282,976 -2,578

MinOpts (-4,778 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x64.checked.mch 1,717,456 +0
coreclr_tests.run.windows.x64.checked.mch 266,521,814 -4,774
libraries.pmi.windows.x64.checked.mch 1,500,480 +0
libraries_tests.pmi.windows.x64.checked.mch 6,882,261 -4

FullOpts (-18,141 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.windows.x64.checked.mch 23,285,435 -1,718
coreclr_tests.run.windows.x64.checked.mch 96,221,213 -9,409
libraries.pmi.windows.x64.checked.mch 50,532,590 -4,440
libraries_tests.pmi.windows.x64.checked.mch 107,400,715 -2,574


There is a throughput regression, which is to be expected since we now have to do more checks/computation:

Overall (+0.03%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.05%
coreclr_tests.run.windows.x64.checked.mch -0.02%
libraries.crossgen2.windows.x64.checked.mch +0.12%
libraries.pmi.windows.x64.checked.mch +0.08%
libraries_tests.pmi.windows.x64.checked.mch +0.07%

MinOpts (-0.05%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.37%
coreclr_tests.run.windows.x64.checked.mch -0.05%
libraries.crossgen2.windows.x64.checked.mch +0.42%
libraries.pmi.windows.x64.checked.mch +0.16%
libraries_tests.pmi.windows.x64.checked.mch +0.20%

FullOpts (+0.05%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.04%
coreclr_tests.run.windows.x64.checked.mch +0.01%
libraries.crossgen2.windows.x64.checked.mch +0.12%
libraries.pmi.windows.x64.checked.mch +0.08%
libraries_tests.pmi.windows.x64.checked.mch +0.07%

@tannergooding
Copy link
Member Author

Could probably always estimate the 3-byte encoding in min-opts to save time and reduce the min-opts impact.

@tannergooding tannergooding marked this pull request as ready for review December 10, 2022 05:24
// prefix if optimizations are enabled or we know we won't negatively impact the
// estimated alignment sizes.

if (emitComp->opts.OptimizationEnabled() || (emitCurIG->igNum > emitLastAlignedIgNum))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @kunalspathak and this is currently needed as we still try to do alignment when optimizations are disabled.

We may want to revisit that since OSR + TC should handle all the important cases and aligning debug code likely isn't worth the cycles required.

@tannergooding
Copy link
Member Author

-- Don't try to estimate the 2-byte VEX prefix when optimizations are disabled

This commit actually regressed throughput even more, which was unexpected. For example, MinOpts throughput changed from the above to

MinOpts (+0.25%)
Collection PDIFF
benchmarks.run.windows.x64.checked.mch +0.41%
coreclr_tests.run.windows.x64.checked.mch +0.25%
libraries.crossgen2.windows.x64.checked.mch +0.42%
libraries.pmi.windows.x64.checked.mch +0.32%
libraries_tests.pmi.windows.x64.checked.mch +0.23%

I've pushed a new commit that tries just MinOpts rather than OptimizationsDisabled to see if that helps at all. I'd guess, but haven't actually investigated yet, that the emitComp->opts.OptimizationsDisabled() call wasn't being inlined...

@tannergooding
Copy link
Member Author

tannergooding commented Dec 10, 2022

Turns out the OptimizationsDisabled call prevented MSVC from reordering the emitAttr size = id->idOpSize();

I manually reordered it and the codegen is a lot better. That being said, idOpSize is unnecessarily expensive and is accessing an array/lookup table to compute what is effectively 1 << _idOpSize. Going to submit a separate PR to fix that -- see #79493

@tannergooding
Copy link
Member Author

Better but still regressed throughput for minopts, more so than for full-opts which doesn't really make sense since the check should mean we're doing "less work".

I'd guess its negatively interacting with something else, like the alignment support, and so the 4k savings we get makes up for the difference in time

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@tannergooding
Copy link
Member Author

Fuzzlyn failures are happening in morph and are unrelated. They are:

JIT assert failed:
Assertion failed '((tree->gtDebugFlags & GTF_DEBUG_NODE_MORPHED) == 0) && "ERROR: Already morphed this node!"' in 'S0:M7():short:this' during 'Morph - Global' (IL size 1578; hash 0xb09b63d3; FullOpts)

    File: /__w/1/s/src/coreclr/jit/morph.cpp Line: 12875

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib this is ready for review.

Good size savings in known hot code at the cost of a small TP regression. Resolving the TP regression would require some non-trivial work/refactorings in the emitter.

@jakobbotsch
Copy link
Member

Have you looked at a detailed throughput trace (e.g. using @SingleAccretion's tool)? +0.2% to +0.4% in MinOpts is quite a bit (e.g. it is more than we spent in FullOpts on tail merging recently, which was a rather large optimization).

@tannergooding
Copy link
Member Author

Have you looked at a detailed throughput trace (e.g. using @SingleAccretion's tool)?

What is the tool and where is the documentation for running it, etc?

@SingleAccretion
Copy link
Contributor

What is the tool and where is the documentation for running it, etc?

The general tool is the pintool, building documented here: https://github.com/SingleAccretion/Dotnet-Runtime.Dev#dotnet-runtimedev.

The particular part which Jakob refers to is this script: https://github.com/SingleAccretion/Dotnet-Runtime.Dev#analyze-pin-trace-diffps1---diff-the-traces-produced-by-the-pin-tool, which compares two traces captured using PIN and prints statistics on which methods are most responsible for regressions / improvements.

@tannergooding
Copy link
Member Author

tannergooding commented Dec 13, 2022

Numbers for 4d0c099 show the following (noting some methods were renamed and one method is new, so I tried to break it apart slightly):

Base: 99141678103, Diff: 99192173532, +0.0509%

?emitIns_R_I@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@_J@Z  : 17705226   : +17.85%  : 1.71%  : +0.0179%
memset                                                                         : 4455056    : +0.68%   : 0.43%  : +0.0045%
?TakesRexWPrefix@emitter@@SA_NW4instruction@@W4emitAttr@@@Z                    : 3581705    : +2.37%   : 0.35%  : +0.0036%
?emitIns_R_R@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@2@Z   : -1244214   : -4.76%   : 0.12%  : -0.0013%
?emitEndCodeGen@emitter@@QEAAIPEAVCompiler@@_N11IPEAI2PEAPEAX33@Z              : -1310818   : -0.58%   : 0.13%  : -0.0013%
?emitIns_Mov@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@2_N@Z : -1994274   : -0.78%   : 0.19%  : -0.0020%
?genAllocLclFrame@CodeGen@@IEAAXIW4_regNumber_enum@@PEA_NI@Z                   : -2293818   : -46.75%  : 0.22%  : -0.0023%
?emitInsSizeSVCalcDisp@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                    : -10589308  : -30.60%  : 1.02%  : -0.0107%
?emitFindOffset@emitter@@IEAAIPEAUinsGroup@@I@Z                                : -51499536  : -23.77%  : 4.96%  : -0.0519%

?EncodedBySSE38orSSE3A@emitter@@QEBA_NW4instruction@@@Z                        : 14487091   : NA       : 1.40%  : +0.0146%
?EncodedBySSE38orSSE3A@emitter@@QEAA_NW4instruction@@@Z                        : -14487091  : -100.00% : 1.40%  : -0.0146%

?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                              : 56890150   : NA       : 5.48%  : +0.0574%
?emitInsSize@emitter@@QEAAI_K_N@Z                                              : -36429961  : -100.00% : 3.51%  : -0.0367%

?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                        : 110127564  : NA       : 10.61% : +0.1111%
?emitGetAdjustedSizeEvexAware@emitter@@QEAAIW4instruction@@W4emitAttr@@_K@Z    : -66319632  : -100.00% : 6.39%  : -0.0669%

?emitInsSizeRR@emitter@@QEAAIPEAUinstrDesc@1@@Z                                : 103959389  : NA       : 10.02% : +0.1049%
?emitInsSizeRR@emitter@@QEAAIW4instruction@@W4_regNumber_enum@@1W4emitAttr@@@Z : -85630059  : -100.00% : 8.25%  : -0.0864%

?emitOutputRexOrSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z    : 225521710  : NA       : 21.73% : +0.2275%
?emitOutputSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z         : -220829033 : -100.00% : 21.28% : -0.2227%

?emitGetVexPrefixSize@emitter@@QEBAIPEAUinstrDesc@1@@Z                         : 7245402    : NA       : 0.70%  : +0.0073%

A slight refactoring (57d5725) changes it instead to be:

Base: 99141678103, Diff: 99179548037, +0.0382%

?emitIns_R_I@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@_J@Z  : 17705226   : +17.85%  : 1.73%  : +0.0179%
?TakesRexWPrefix@emitter@@SA_NW4instruction@@W4emitAttr@@@Z                    : 3581705    : +2.37%   : 0.35%  : +0.0036%
?emitIns_R_R@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@2@Z   : -1244214   : -4.76%   : 0.12%  : -0.0013%
?emitEndCodeGen@emitter@@QEAAIPEAVCompiler@@_N11IPEAI2PEAPEAX33@Z              : -1310818   : -0.58%   : 0.13%  : -0.0013%
?emitIns_Mov@emitter@@QEAAXW4instruction@@W4emitAttr@@W4_regNumber_enum@@2_N@Z : -1994274   : -0.78%   : 0.19%  : -0.0020%
?genAllocLclFrame@CodeGen@@IEAAXIW4_regNumber_enum@@PEA_NI@Z                   : -2293818   : -46.75%  : 0.22%  : -0.0023%
?emitInsSizeSVCalcDisp@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                    : -10589308  : -30.60%  : 1.03%  : -0.0107%
?emitFindOffset@emitter@@IEAAIPEAUinsGroup@@I@Z                                : -51499536  : -23.77%  : 5.02%  : -0.0519%

?EncodedBySSE38orSSE3A@emitter@@QEBA_NW4instruction@@@Z                        : 14487091   : NA       : 1.41%  : +0.0146%
?EncodedBySSE38orSSE3A@emitter@@QEAA_NW4instruction@@@Z                        : -14487091  : -100.00% : 1.41%  : -0.0146%

?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                              : 56890150   : NA       : 5.55%  : +0.0574%
?emitInsSize@emitter@@QEAAI_K_N@Z                                              : -36429961  : -100.00% : 3.55%  : -0.0367%

?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                        : 105018266  : NA       : 10.24% : +0.1059%
?emitGetAdjustedSizeEvexAware@emitter@@QEAAIW4instruction@@W4emitAttr@@_K@Z    : -66319632  : -100.00% : 6.47%  : -0.0669%

?emitInsSizeRR@emitter@@QEAAIPEAUinstrDesc@1@@Z                                : 101120612  : NA       : 9.86%  : +0.1020%
?emitInsSizeRR@emitter@@QEAAIW4instruction@@W4_regNumber_enum@@1W4emitAttr@@@Z : -85630059  : -100.00% : 8.35%  : -0.0864%

?emitOutputRexOrSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z    : 225521710  : NA       : 21.99% : +0.2275%
?emitOutputSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z         : -220829033 : -100.00% : 21.53% : -0.2227%

?emitGetVexPrefixSize@emitter@@QEBAIPEAUinstrDesc@1@@Z                         : 7245402    : NA       : 0.71%  : +0.0073%

It doesn't look to be profitable to skip the exact size estimation, even if the optimizationDisabled checks are outlined/etc. This looks to be because of the indirect benefits in genAllocLclFrame, emitFindOffset, and others where having smaller code sizes estimated results in less overall work.

Like I mentioned in Discord, the three biggest regressions are:

emitIns_R_I is more expensive because every instruction now has 1 more branch so that the id can be passed through to emitInsSize. This one could probably be refactored to avoid that branch with a goto. The function in general needs some cleanup though as its doing work that is normally handled by functions like emitInsSize* helpers and is doing work in a non-optimal way.

emitIns_R_R is going through emitInsSizeRR(instrDesc*) that probably needs to be merged in with emitInsSizeRR(instrDesc*, code_t) and which in general is doing things it probably needs some cleanup around (also pre-existing)

TakesRexWPrefix is doing a non-inlined jump table that is just a static bit of data for a few instructions. It should probably be a flag which would be quite a bit more efficient.

All three of these are really unrelated to this change and are pre-existing issues. They are showing regressions because they aren't doing the same thing as all the other paths and so the VEX only changes are showing up where you wouldn't expect them to. We should ideally work on cleaning these up so that the only real impact is in the new code paths.

@BruceForstall
Copy link
Member

[unrelated to PR]

The general tool is the pintool, building documented here: https://github.com/SingleAccretion/Dotnet-Runtime.Dev#dotnet-runtimedev.

@SingleAccretion Looks like an awesome set of scripts for working with .NET and JIT. I'm sure everyone on the CodeGen team has their own similar set. It's too bad we don't share these kind of scripts more broadly, e.g., in jitutils where they could end up in jitutils/bin that will (likely) be on our PATH.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for doing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants