-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use AVX2 in codegen for GT_STORE_BLK #55604
Conversation
…, resolves failing tests.
… of allocating a GPR
…avx2-gt-store-blk
Related: #33665 |
src/coreclr/jit/codegenxarch.cpp
Outdated
regSize /= 2; | ||
} | ||
// Copy the remainder by moving the last regSize bytes of the buffer | ||
unsigned remainder = regSize - size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth an assert that (remainder <= regSize)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense. I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this to have a perf difference, although I'd have to benchmark it to be sure. It seemed easier/cleaner to just reuse the register and regSize that was setup pre-copy. Let me know if you think it's better to break up cases like this instead.
I think that the change to support is it kind of small, you can use the same register just
unsigned remainder = regSize - size; | |
if ((regSize > XMM_REGSIZE_BYTES) && (size <= XMM_REGSIZE_BYTES)) | |
{ | |
regSize = XMM_REGSIZE_BYTES; | |
} | |
unsigned remainder = regSize - size; |
Nit: remainder
looks confusing for me, because I was expecting size
to be called the remainder, but I am not sure what would be a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size
really means remainingSize
if I understand correctly, so I would at least rename remainder because the real is size
.
unsigned remainder = regSize - size; | |
unsigned shiftBack = regSize - size; // so we could use another full move [totalSize - regSize; totalSize). |
…o depend on AVX, remove vzeroupper inserts.
@tannergooding I've finished updating the PR with your suggestions, please let me know if there is anything else I can change. |
The changes look good/correct to me. CC. @dotnet/jit-contrib, community PR from our friends at AMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @alexcovington, thanks for the PR, the changes look good to me, with 1 question.
Could you please check if it fixes #33617?
src/coreclr/jit/codegenxarch.cpp
Outdated
unsigned remainder = regSize - size; | ||
assert(remainder <= regSize); | ||
|
||
srcOffset -= remainder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the idea that after it srcOffset + regSize == totalSize
?
if we have blockSize = 48, would it be beneficial to copy as [0, 32) YMM [32, 48) XMM, not as [0,32) YMM [16, 48) YMM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check if it fixes #33617?
Sure, I'll take a look and add a separate comment after I check.
is the idea that after it
srcOffset + regSize == totalSize
?
Yes, that's the idea. If there are any remaining bytes leftover after the main copy loop, back up the offset enough so that we can do a single SIMD move with whatever register was allocated to finish moving the remaining bytes.
would it be beneficial to copy as [0, 32) YMM [32, 48) XMM, not as [0,32) YMM [16, 48) YMM?
I don't expect this to have a perf difference, although I'd have to benchmark it to be sure. It seemed easier/cleaner to just reuse the register and regSize that was setup pre-copy. Let me know if you think it's better to break up cases like this instead.
@sandreenko Unfortunately, it looks like #33617 is not fixed by this change. I'm seeing what you described, the IL looks the same for both the Vector inits:
But the generated code still uses XMM registers to complete the assignment to the field:
Here's the full dump: Full dump
|
I see. It is because you are changing only runtime/src/coreclr/jit/codegenxarch.cpp Line 2707 in b412aac
could I ask you to do the same change there if you have time? If not I am ok with merging this, it is still a nice improvement. |
@sandreenko Sure, I'll take a look at |
Yeah, the only difference on modern CPUs should be if the read/write crosses a cache-line or page boundary. Using |
It occurs to me that generating a zero into a register means the "const zero" node should not be marked contained, so codegen puts it into a register. So maybe it shouldn't be contained, but should be marked as a type that codegen would use to create an xmm or ymm zero. It's a bit tricky here because if we actually needed a zero in the GPR and xmm/ymm, we only have a single "const zero" node. |
… remainder instead of step-down
…ft-back for GPR in init instead of step-down register size
@BruceForstall I've updated the PR with your suggested changes. Now the remainder is handled with an Now for the test case: [StructLayout(LayoutKind.Sequential, Size = 33)]
private struct Block33 { }
private static Block33 a, b;
static void ReplaceBlock33()
{
a = new();
b = a;
} The following code is produced:
Updated ASM diffsaspnet.run.windows.x64.checked.mch:
Detail diffs
benchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.crossgen2.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
Please let me know if there are any other changes I can make. Thanks! |
Not sure why x86 is failing, currently investigating. |
…avx2-gt-store-blk
…avx2-gt-store-blk
…esolve test failures on x86
Resolved x86 failures. Here is the updated ASM diff: ASM Diffsbenchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.crossgen2.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
|
…avx2-gt-store-blk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the contribution!
Improvements on alpine 3.12 dotnet/perf-autofiling-issues#1506 |
It looks like a couple of the linked issues are tracked as regressions. We should investigate further and determine why. |
Modifies JIT to allow generating AVX2 instructions when generating code for GT_STORE_BLK nodes when AVX2 is available and the compiler allows for it.
Test case I have been using:
Test Source
Results
Please let me know if I can add any other information. Thanks!