Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the codegen of the vector accelerated System.Numerics.* types #81335

Merged
merged 32 commits into from
Feb 2, 2023

Conversation

tannergooding
Copy link
Member

This improves the codegen for more scenarios involving Vector2/3/4, Quaternion, and Plane.

Note about Quaternion and Plane

In an ideal world, particularly for Quaternion and Plane, a large portion of the JIT side changes of this PR would be unnecessary. However, we don't live in that world and these types were all exposed with public fields which limits the ability to change their internal implementation details.

What this means is that even though Quaternion should be wrapping a single Vector4 field and deferring most of its operations to Vector4, it can't. And unlike Matrix4x4 where it is sufficiently large and therefore the limitation could be worked around using Unsafe.As to convert the byref, avoid the copy, and still end up getting promoted; the same is not trivially true for Quaternion or Plane which are enregisterable.

So what ends up happening is that we explicitly track Quaternion and Plane as being a TYP_SIMD16 much as we do for Vector4. This allows us to light up all the same methods with all the same intrinsic handling.

I considered limiting things to a couple "bitcast" intrinsics, AsVector4() and AsQuaternion/Plane , as this would allow implementing operator + as (value1.AsVector4() + value2.AsVector4()).AsQuaternion. However, given we already have the intrinsic handling for operator + for Vector4 directly and this avoids the need to handle 3 intrinsics + inlining + eliding the copies, I opted to do it this way instead.

Note about other changes

The other APIs are split between intrinsic and managed handling.

For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g. Add and operator + can both be trivially done) and otherwise to APIs which are functionally "primitive" operations for these types such as Length. This allows us to avoid inlining, extra intrinsic recognition/matching, and better handle problematic scenarios such as these being instance methods (when ideally they'd be static extensions like they are for Vector64/128/256/512<T>) or returning scalar types just to immediately convert back to a vector.

Other APIs such as Transform, Cross, Reflect, etc are considered "complex" and are left entirely to managed. They do not optimize down to 1-2 instruction sequences and are expected to be more expensive and so are left to managed code.

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost assigned tannergooding Jan 29, 2023
@ghost
Copy link

ghost commented Jan 29, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This improves the codegen for more scenarios involving Vector2/3/4, Quaternion, and Plane.

Note about Quaternion and Plane

In an ideal world, particularly for Quaternion and Plane, a large portion of the JIT side changes of this PR would be unnecessary. However, we don't live in that world and these types were all exposed with public fields which limits the ability to change their internal implementation details.

What this means is that even though Quaternion should be wrapping a single Vector4 field and deferring most of its operations to Vector4, it can't. And unlike Matrix4x4 where it is sufficiently large and therefore the limitation could be worked around using Unsafe.As to convert the byref, avoid the copy, and still end up getting promoted; the same is not trivially true for Quaternion or Plane which are enregisterable.

So what ends up happening is that we explicitly track Quaternion and Plane as being a TYP_SIMD16 much as we do for Vector4. This allows us to light up all the same methods with all the same intrinsic handling.

I considered limiting things to a couple "bitcast" intrinsics, AsVector4() and AsQuaternion/Plane , as this would allow implementing operator + as (value1.AsVector4() + value2.AsVector4()).AsQuaternion. However, given we already have the intrinsic handling for operator + for Vector4 directly and this avoids the need to handle 3 intrinsics + inlining + eliding the copies, I opted to do it this way instead.

Note about other changes

The other APIs are split between intrinsic and managed handling.

For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g. Add and operator + can both be trivially done) and otherwise to APIs which are functionally "primitive" operations for these types such as Length. This allows us to avoid inlining, extra intrinsic recognition/matching, and better handle problematic scenarios such as these being instance methods (when ideally they'd be static extensions like they are for Vector64/128/256/512<T>) or returning scalar types just to immediately convert back to a vector.

Other APIs such as Transform, Cross, Reflect, etc are considered "complex" and are left entirely to managed. They do not optimize down to 1-2 instruction sequences and are expected to be more expensive and so are left to managed code.

Author: tannergooding
Assignees: tannergooding
Labels:

area-System.Numerics, new-api-needs-documentation

Milestone: -

@tannergooding
Copy link
Member Author

Many diffs are cases such as:

-; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
+; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data

or

-;* V04 tmp2         [V04    ] (  0,  0   )   simd8  ->  zero-ref    ld-addr-op "NewObj constructor temp"
-;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    ld-addr-op "NewObj constructor temp"
-;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
-;* V07 tmp5         [V07    ] (  0,  0   )   simd8  ->  zero-ref    V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
-;* V08 tmp6         [V08    ] (  0,  0   )   simd8  ->  zero-ref    V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
-;* V09 tmp7         [V09,T01] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
-;* V10 tmp8         [V10,T02] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
-;* V11 tmp9         [V11,T03] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
+;* V04 tmp2         [V04    ] (  0,  0   )   simd8  ->  zero-ref    V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
+;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
+;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
+;* V07 tmp5         [V07,T01] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
+;* V08 tmp6         [V08,T02] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
+;* V09 tmp7         [V09,T03] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"

Both of these represent less work for the JIT to be doing.

@tannergooding
Copy link
Member Author

Some of the instance methods do show small regressions. These are because the instance method takes the first parameter as a byref and so they'll generate slightly different code (most frequently an extra local) to account for this.

In practice, such methods are intrinsic themselves and so this code isn't actually hit and doesn't represent any real regression.

@tannergooding
Copy link
Member Author

The change to morph to recognize Create(ScalarDot(..., ...)) and Create(ScalarSqrt(ScalarDot(..., ...))) gives some nice wins in various places:

-       vdpps    xmm3, xmm2, xmm2, 113
-       vsqrtss  xmm3, xmm3
-       vbroadcastss xmm3, xmm3
+       vdpps    xmm3, xmm2, xmm2, 127
+       vsqrtps  xmm3, xmm3

Other functions have "regressions" which are just code size increases due to methods that were supposed to already be getting inlined actually being inlined now thanks to them being intrinsic.

And then of course APIs involving Quaternion and Plane themselves have improvements ranging from small to large. For example, Quaterion.get_IsIdentity was:

; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T01] (  3,  3   )   byref  ->  rcx         this single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )  struct (16) [rsp+08H]   do-not-enreg[SF] "impAppendStmt"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "struct address for call/obj"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V05 tmp4         [V05,T02] (  3,  2   )    bool  ->  rax         "Inline return value spill temp"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V07 tmp6         [V07    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V08 tmp7         [V08,T07] (  0,  0   )   float  ->  zero-ref    V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V09 tmp8         [V09,T08] (  0,  0   )   float  ->  zero-ref    V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;* V10 tmp9         [V10,T09] (  0,  0   )   float  ->  zero-ref    V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;* V11 tmp10        [V11,T10] (  0,  0   )   float  ->  zero-ref    V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
;  V12 tmp11        [V12,T03] (  2,  2   )   float  ->  mm0         V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
;  V13 tmp12        [V13,T04] (  2,  1.50)   float  ->  mm1         V06.Y(offs=0x04) P-INDEP "field V06.Y (fldOffset=0x4)"
;  V14 tmp13        [V14,T05] (  2,  1.50)   float  ->  mm2         V06.Z(offs=0x08) P-INDEP "field V06.Z (fldOffset=0x8)"
;  V15 tmp14        [V15,T06] (  2,  1.50)   float  ->  mm3         V06.W(offs=0x0c) P-INDEP "field V06.W (fldOffset=0xc)"
;* V16 tmp15        [V16    ] (  0,  0   )   float  ->  zero-ref    V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)"
;* V17 tmp16        [V17,T11] (  0,  0   )   float  ->  zero-ref    V07.Y(offs=0x04) P-INDEP "field V07.Y (fldOffset=0x4)"
;* V18 tmp17        [V18,T12] (  0,  0   )   float  ->  zero-ref    V07.Z(offs=0x08) P-INDEP "field V07.Z (fldOffset=0x8)"
;* V19 tmp18        [V19,T13] (  0,  0   )   float  ->  zero-ref    V07.W(offs=0x0c) P-INDEP "field V07.W (fldOffset=0xc)"
;
; Lcl frame size = 24

G_M51825_IG01:
       sub      rsp, 24
       vzeroupper 
						;; size=7 bbWeight=1 PerfScore 1.25
G_M51825_IG02:
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+08H], xmm0
       vmovss   xmm0, dword ptr [rsp+08H]
       vmovss   xmm1, dword ptr [rsp+0CH]
       vmovss   xmm2, dword ptr [rsp+10H]
       vmovss   xmm3, dword ptr [rsp+14H]
       vxorps   xmm4, xmm4
       vucomiss xmm0, xmm4
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
						;; size=46 bbWeight=1 PerfScore 21.33
G_M51825_IG03:
       vxorps   xmm0, xmm0
       vucomiss xmm1, xmm0
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
       vxorps   xmm0, xmm0
       vucomiss xmm2, xmm0
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
       xor      eax, eax
       vucomiss xmm3, dword ptr [reloc @RWD00]
       setnp    al
       jp       SHORT G_M51825_IG04
       sete     al
						;; size=42 bbWeight=0.50 PerfScore 7.96
G_M51825_IG04:
       jmp      SHORT G_M51825_IG06
						;; size=2 bbWeight=0.50 PerfScore 1.00
G_M51825_IG05:
       xor      eax, eax
						;; size=2 bbWeight=0.50 PerfScore 0.12
G_M51825_IG06:
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
RWD00  	dd	3F800000h		;         1


; Total bytes of code 104, prolog size 7, PerfScore 43.32, instruction count 29, allocated bytes for code 104 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================

But is now:

; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )   byref  ->  rcx         this single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M51825_IG01:
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M51825_IG02:
       vmovups  xmm0, xmmword ptr [rcx]
       vcmpps   xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
       vmovmskps xrax, xmm0
       cmp      eax, 15
       sete     al
       movzx    rax, al
						;; size=26 bbWeight=1 PerfScore 10.50
G_M51825_IG03:
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
RWD00  	dq	0000000000000000h, 3F80000000000000h


; Total bytes of code 30, prolog size 3, PerfScore 15.50, instruction count 8, allocated bytes for code 30 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================

@tannergooding
Copy link
Member Author

There are still a few APIs that can be optimized further, but they are all on the managed side of things and involve a more complex refactoring of their code. I plan on handling that in a separate PR

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

This just needs sign-off on the Mono side before it can be merged.

@tannergooding
Copy link
Member Author

tannergooding commented Feb 1, 2023

Just tracking some of the Mono issues logged related to this PR:

As discussed with @fanyang-mono offline. My current plan, after this PR is merged, is to put up a PR which changes emit_vector_2_3_4 to actually handle Vector2/3 (today it only handles Vector4). I will optimistically be adding Quaternion/Plane support at the same time (#81483) and will update the issue with anything that blocks the latter two (I don't expect any blockers for Vector2/3).

@fanyang-mono
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@fanyang-mono fanyang-mono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of Mono.

src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
else
{
MonoClass *klass = mono_class_from_mono_type_internal (vector_type);
g_assert (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to assert here, since mono won't disable assertion in release build. I think it is better to return false when the class is not one of those expected ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, didn't realize Mono asserts were preserved in release code.

src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
src/mono/mono/mini/simd-intrinsics.c Outdated Show resolved Hide resolved
src/mono/mono/mini/simd-intrinsics.c Show resolved Hide resolved
@tannergooding
Copy link
Member Author

I can see SN_WithElement case is not being removed. Based on what you said earlier, they should be gone too?

Yes, looks like I didn't add the change to my commit. I'll fix this and the formatting issues after extra-platforms finishes to avoid additional CI churn.

@tannergooding
Copy link
Member Author

Link to extra platforms failures is here: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=156061

Other than a timeout due to an offline machine, the test failures match what the scheduled runs are hitting: https://dev.azure.com/dnceng-public/public/_build/results?buildId=155434&view=ms.vss-test-web.build-test-results-tab

@tannergooding
Copy link
Member Author

Brought in latest main to pickup the CI fix

@tannergooding tannergooding merged commit 90401dd into dotnet:main Feb 2, 2023
@tannergooding tannergooding deleted the numerics-rewrite branch February 2, 2023 06:21
@ghost ghost locked as resolved and limited conversation to collaborators Mar 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants