Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that TryGetContainableHWIntrinsicOp is non-mutating #79363

Merged

Conversation

tannergooding
Copy link
Member

This removes a possible mutating side-effect from TryGetContainableHWIntrinsicOp by allowing CreateScalarUnsafe to be directly contained.

This helps ensure accidental side effects don't happen and improves codegen for cases like the following:

Unsafe.SkipInit(out this);

Vector128<float> x = Vector128.Create(m11, m12, m13, m14);
x.StoreUnsafe(ref M11);

Vector128<float> y = Vector128.Create(m21, m22, m23, m24);
y.StoreUnsafe(ref M21);

Vector128<float> z = Vector128.Create(m31, m32, m33, m34);
z.StoreUnsafe(ref M31);

Vector128<float> w = Vector128.Create(m41, m42, m43, m44);
w.StoreUnsafe(ref M41);

Before we generated:

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C5FA104C2428         vmovss   xmm1, dword ptr [rsp+28H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C5FA104C2438         vmovss   xmm1, dword ptr [rsp+38H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2440         vmovss   xmm1, dword ptr [rsp+40H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2448         vmovss   xmm1, dword ptr [rsp+48H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C5FA104C2458         vmovss   xmm1, dword ptr [rsp+58H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2460         vmovss   xmm1, dword ptr [rsp+60H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2468         vmovss   xmm1, dword ptr [rsp+68H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C5FA104C2478         vmovss   xmm1, dword ptr [rsp+78H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA108C2480000000   vmovss   xmm1, dword ptr [rsp+80H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA108C2488000000   vmovss   xmm1, dword ptr [rsp+88H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

Now we generate

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C4E3792144242830     vinsertps xmm0, xmm0, dword ptr [rsp+28H], 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C4E3792144243810     vinsertps xmm0, xmm0, dword ptr [rsp+38H], 16
C4E3792144244020     vinsertps xmm0, xmm0, dword ptr [rsp+40H], 32
C4E3792144244830     vinsertps xmm0, xmm0, dword ptr [rsp+48H], 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C4E3792144245810     vinsertps xmm0, xmm0, dword ptr [rsp+58H], 16
C4E3792144246020     vinsertps xmm0, xmm0, dword ptr [rsp+60H], 32
C4E3792144246830     vinsertps xmm0, xmm0, dword ptr [rsp+68H], 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C4E3792144247810     vinsertps xmm0, xmm0, dword ptr [rsp+78H], 16
C4E3792184248000000020 vinsertps xmm0, xmm0, dword ptr [rsp+80H], 32
C4E3792184248800000030 vinsertps xmm0, xmm0, dword ptr [rsp+88H], 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 7, 2022
@ghost ghost assigned tannergooding Dec 7, 2022
@ghost
Copy link

ghost commented Dec 7, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This removes a possible mutating side-effect from TryGetContainableHWIntrinsicOp by allowing CreateScalarUnsafe to be directly contained.

This helps ensure accidental side effects don't happen and improves codegen for cases like the following:

Unsafe.SkipInit(out this);

Vector128<float> x = Vector128.Create(m11, m12, m13, m14);
x.StoreUnsafe(ref M11);

Vector128<float> y = Vector128.Create(m21, m22, m23, m24);
y.StoreUnsafe(ref M21);

Vector128<float> z = Vector128.Create(m31, m32, m33, m34);
z.StoreUnsafe(ref M31);

Vector128<float> w = Vector128.Create(m41, m42, m43, m44);
w.StoreUnsafe(ref M41);

Before we generated:

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C5FA104C2428         vmovss   xmm1, dword ptr [rsp+28H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C5FA104C2438         vmovss   xmm1, dword ptr [rsp+38H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2440         vmovss   xmm1, dword ptr [rsp+40H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2448         vmovss   xmm1, dword ptr [rsp+48H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C5FA104C2458         vmovss   xmm1, dword ptr [rsp+58H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA104C2460         vmovss   xmm1, dword ptr [rsp+60H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA104C2468         vmovss   xmm1, dword ptr [rsp+68H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C5FA104C2478         vmovss   xmm1, dword ptr [rsp+78H]
C4E37921C110         vinsertps xmm0, xmm0, xmm1, 16
C5FA108C2480000000   vmovss   xmm1, dword ptr [rsp+80H]
C4E37921C120         vinsertps xmm0, xmm0, xmm1, 32
C5FA108C2488000000   vmovss   xmm1, dword ptr [rsp+88H]
C4E37921C130         vinsertps xmm0, xmm0, xmm1, 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0

Now we generate

3809                 cmp      byte  ptr [rcx], cl
C4E37121C210         vinsertps xmm0, xmm1, xmm2, 16
C4E37921C320         vinsertps xmm0, xmm0, xmm3, 32
C4E3792144242830     vinsertps xmm0, xmm0, dword ptr [rsp+28H], 48
C5F81101             vmovups  xmmword ptr [rcx], xmm0
C5FA10442430         vmovss   xmm0, dword ptr [rsp+30H]
C4E3792144243810     vinsertps xmm0, xmm0, dword ptr [rsp+38H], 16
C4E3792144244020     vinsertps xmm0, xmm0, dword ptr [rsp+40H], 32
C4E3792144244830     vinsertps xmm0, xmm0, dword ptr [rsp+48H], 48
C5F8114110           vmovups  xmmword ptr [rcx+10H], xmm0
C5FA10442450         vmovss   xmm0, dword ptr [rsp+50H]
C4E3792144245810     vinsertps xmm0, xmm0, dword ptr [rsp+58H], 16
C4E3792144246020     vinsertps xmm0, xmm0, dword ptr [rsp+60H], 32
C4E3792144246830     vinsertps xmm0, xmm0, dword ptr [rsp+68H], 48
C5F8114120           vmovups  xmmword ptr [rcx+20H], xmm0
C5FA10442470         vmovss   xmm0, dword ptr [rsp+70H]
C4E3792144247810     vinsertps xmm0, xmm0, dword ptr [rsp+78H], 16
C4E3792184248000000020 vinsertps xmm0, xmm0, dword ptr [rsp+80H], 32
C4E3792184248800000030 vinsertps xmm0, xmm0, dword ptr [rsp+88H], 48
C5F8114130           vmovups  xmmword ptr [rcx+30H], xmm0
Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Ideally we'd be able to improve this even more and recognize n sequential inserts all coming from sequential addresses such that we could replace it with a single LoadVector128.

However, I've been told this is complicated since we apparently do not surface whether or not a given parameter is pass by register or pass via shadow copy and so we cannot trivially see that these are coming from sequential addresses.

@tannergooding tannergooding marked this pull request as ready for review December 8, 2022 19:15
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

@BruceForstall
Copy link
Member

Diffs

Why are there some regressions?

@tannergooding
Copy link
Member Author

@BruceForstall, looks like possibly a bug in diff reporting? If I had to guess, this is an issue with a peephole optimization not correctly impacting the computed instruction size.

If we look at 173746.dasm from asm.coreclr_tests.run.Linux.x64.checked:

We'll see G_M25977_IG01 has the following diff even though there were zero code changes.

-						;; size=91 bbWeight=1    PerfScore 18.08
+						;; size=97 bbWeight=1    PerfScore 18.08

For IG02, it drops 5 movaps instructions (-4 bytes each, so -20 bytes total) and gains a jg instruction (+6 bytes) since all instructions from the BB fit into one IG now. However, the diff still reports an +18 byte change in size, even though it should be -14.

Latter instruction groups largely just change the IG block number they jump to and occasionally insert new alignment bytes, but even when its a diff of 1 alignment byte, they often have significantly larger size inscreases measured.

@tannergooding
Copy link
Member Author

@BruceForstall, @kunalspathak actually after looking at the raw disassembly (which includes the codegen bytes), this is the alignment code pessimizing SIMD codegen.

Basically, if we decide to emit alignment blocks at all, then SIMD codegen gets forced to always emit the 3-byte encoding rather than the more optimal 2-byte encoding: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L1560-L1569

This can mean otherwise profitable changes can appear to be no longer profitable in any method that is SIMD or floating-point heavy.

@kunalspathak
Copy link
Member

This can mean otherwise profitable changes can appear to be no longer profitable in any method that is SIMD or floating-point heavy.

That is unfortunately true. We could potentially have a heuristic to not align blocks beyond a point where we have seen XXX SIMD instructions?

@tannergooding
Copy link
Member Author

That is unfortunately true. We could potentially have a heuristic to not align blocks beyond a point where we have seen XXX SIMD instructions?

I'm going to see how easy it is to "fix" the size estimation to not statically return 3.

The biggest "missing" piece should really be the checks on which registers are used.

@tannergooding
Copy link
Member Author

I don't think its worth blocking this PR on that change, however.

This change is a net positive to the number of instructions emitted with the "regressions" being an external factor based on loop alignment.

@tannergooding
Copy link
Member Author

No changes, just rerunning CI and getting updated diffs now that #79478 is merged. I expect the SPMI diffs to look even better. Plan on merging after those get back, assuming nothing unexpected crops up.

@tannergooding
Copy link
Member Author

Indeed better diffs

But shows a couple small regressions on vfma***ps that need to be looked into and resolved. These look to have existed in the original as well but were harder to spot due to all the alignment diffs

@tannergooding
Copy link
Member Author

@BruceForstall, resolved all the regressions with the last two commits. Since they were a little more complex, it might be worth a second review/sign-off:

The first handles the FMA case by ensuring we recheck containment after removing the NEG node. Basically we had CreateScalarUnsafe(-value) and that blocked containment. When we have such a case, we decide to change the instruction being emitted to something like FusedMultiplySubtract and remove the NEG. We then just need to recheck containment on CreateScalarUnsafe(value).

The second was a little more complex and involves vpbroadcast* being a special instruction where the scalar integer type can come as a from memory or from a SIMD register (where-as typically its a general purpose register). This means for byte we have effectively vpbroadcastb xmm, xmm/m8. To resolve this second one I opted to always remove CreateScalarUnsafe and then specially handle the case where its regOptional but not spilled in codegen.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding tannergooding merged commit c6045ad into dotnet:main Dec 16, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants