Improve the codegen of the vector accelerated `System.Numerics.*` types #81335

tannergooding · 2023-01-29T17:27:35Z

This improves the codegen for more scenarios involving Vector2/3/4, Quaternion, and Plane.

Note about Quaternion and Plane

In an ideal world, particularly for Quaternion and Plane, a large portion of the JIT side changes of this PR would be unnecessary. However, we don't live in that world and these types were all exposed with public fields which limits the ability to change their internal implementation details.

What this means is that even though Quaternion should be wrapping a single Vector4 field and deferring most of its operations to Vector4, it can't. And unlike Matrix4x4 where it is sufficiently large and therefore the limitation could be worked around using Unsafe.As to convert the byref, avoid the copy, and still end up getting promoted; the same is not trivially true for Quaternion or Plane which are enregisterable.

So what ends up happening is that we explicitly track Quaternion and Plane as being a TYP_SIMD16 much as we do for Vector4. This allows us to light up all the same methods with all the same intrinsic handling.

I considered limiting things to a couple "bitcast" intrinsics, AsVector4() and AsQuaternion/Plane , as this would allow implementing operator + as (value1.AsVector4() + value2.AsVector4()).AsQuaternion. However, given we already have the intrinsic handling for operator + for Vector4 directly and this avoids the need to handle 3 intrinsics + inlining + eliding the copies, I opted to do it this way instead.

Note about other changes

The other APIs are split between intrinsic and managed handling.

For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g. Add and operator + can both be trivially done) and otherwise to APIs which are functionally "primitive" operations for these types such as Length. This allows us to avoid inlining, extra intrinsic recognition/matching, and better handle problematic scenarios such as these being instance methods (when ideally they'd be static extensions like they are for Vector64/128/256/512<T>) or returning scalar types just to immediately convert back to a vector.

Other APIs such as Transform, Cross, Reflect, etc are considered "complex" and are left entirely to managed. They do not optimize down to 1-2 instruction sequences and are expected to be more expensive and so are left to managed code.

dotnet-issue-labeler · 2023-01-29T17:27:42Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2023-01-29T17:27:46Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This improves the codegen for more scenarios involving Vector2/3/4, Quaternion, and Plane.

Note about Quaternion and Plane

In an ideal world, particularly for Quaternion and Plane, a large portion of the JIT side changes of this PR would be unnecessary. However, we don't live in that world and these types were all exposed with public fields which limits the ability to change their internal implementation details.

What this means is that even though Quaternion should be wrapping a single Vector4 field and deferring most of its operations to Vector4, it can't. And unlike Matrix4x4 where it is sufficiently large and therefore the limitation could be worked around using Unsafe.As to convert the byref, avoid the copy, and still end up getting promoted; the same is not trivially true for Quaternion or Plane which are enregisterable.

So what ends up happening is that we explicitly track Quaternion and Plane as being a TYP_SIMD16 much as we do for Vector4. This allows us to light up all the same methods with all the same intrinsic handling.

I considered limiting things to a couple "bitcast" intrinsics, AsVector4() and AsQuaternion/Plane , as this would allow implementing operator + as (value1.AsVector4() + value2.AsVector4()).AsQuaternion. However, given we already have the intrinsic handling for operator + for Vector4 directly and this avoids the need to handle 3 intrinsics + inlining + eliding the copies, I opted to do it this way instead.

Note about other changes

The other APIs are split between intrinsic and managed handling.

For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g. Add and operator + can both be trivially done) and otherwise to APIs which are functionally "primitive" operations for these types such as Length. This allows us to avoid inlining, extra intrinsic recognition/matching, and better handle problematic scenarios such as these being instance methods (when ideally they'd be static extensions like they are for Vector64/128/256/512<T>) or returning scalar types just to immediately convert back to a vector.

Other APIs such as Transform, Cross, Reflect, etc are considered "complex" and are left entirely to managed. They do not optimize down to 1-2 instruction sequences and are expected to be more expensive and so are left to managed code.

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-System.Numerics`, `new-api-needs-documentation`
Milestone:	-

tannergooding · 2023-01-29T17:30:27Z

Many diffs are cases such as:

-; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
+; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data

or

-;* V04 tmp2         [V04    ] (  0,  0   )   simd8  ->  zero-ref    ld-addr-op "NewObj constructor temp"
-;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    ld-addr-op "NewObj constructor temp"
-;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
-;* V07 tmp5         [V07    ] (  0,  0   )   simd8  ->  zero-ref    V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
-;* V08 tmp6         [V08    ] (  0,  0   )   simd8  ->  zero-ref    V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
-;* V09 tmp7         [V09,T01] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
-;* V10 tmp8         [V10,T02] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
-;* V11 tmp9         [V11,T03] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
+;* V04 tmp2         [V04    ] (  0,  0   )   simd8  ->  zero-ref    V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
+;* V05 tmp3         [V05    ] (  0,  0   )   simd8  ->  zero-ref    V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
+;* V06 tmp4         [V06    ] (  0,  0   )   simd8  ->  zero-ref    V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
+;* V07 tmp5         [V07,T01] (  0,  0   )   simd8  ->  zero-ref    V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
+;* V08 tmp6         [V08,T02] (  0,  0   )   simd8  ->  zero-ref    V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
+;* V09 tmp7         [V09,T03] (  0,  0   )   simd8  ->  zero-ref    V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"

Both of these represent less work for the JIT to be doing.

tannergooding · 2023-01-29T17:32:59Z

Some of the instance methods do show small regressions. These are because the instance method takes the first parameter as a byref and so they'll generate slightly different code (most frequently an extra local) to account for this.

In practice, such methods are intrinsic themselves and so this code isn't actually hit and doesn't represent any real regression.

tannergooding · 2023-01-29T17:43:41Z

The change to morph to recognize Create(ScalarDot(..., ...)) and Create(ScalarSqrt(ScalarDot(..., ...))) gives some nice wins in various places:

-       vdpps    xmm3, xmm2, xmm2, 113
-       vsqrtss  xmm3, xmm3
-       vbroadcastss xmm3, xmm3
+       vdpps    xmm3, xmm2, xmm2, 127
+       vsqrtps  xmm3, xmm3

Other functions have "regressions" which are just code size increases due to methods that were supposed to already be getting inlined actually being inlined now thanks to them being intrinsic.

And then of course APIs involving Quaternion and Plane themselves have improvements ranging from small to large. For example, Quaterion.get_IsIdentity was:

; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T01] (  3,  3   )   byref  ->  rcx         this single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  5, 10   )  struct (16) [rsp+08H]   do-not-enreg[SF] "impAppendStmt"
;* V03 tmp2         [V03    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "struct address for call/obj"
;* V04 tmp3         [V04    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V05 tmp4         [V05,T02] (  3,  2   )    bool  ->  rax         "Inline return value spill temp"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V07 tmp6         [V07    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
;* V08 tmp7         [V08,T07] (  0,  0   )   float  ->  zero-ref    V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V09 tmp8         [V09,T08] (  0,  0   )   float  ->  zero-ref    V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;* V10 tmp9         [V10,T09] (  0,  0   )   float  ->  zero-ref    V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;* V11 tmp10        [V11,T10] (  0,  0   )   float  ->  zero-ref    V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
;  V12 tmp11        [V12,T03] (  2,  2   )   float  ->  mm0         V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
;  V13 tmp12        [V13,T04] (  2,  1.50)   float  ->  mm1         V06.Y(offs=0x04) P-INDEP "field V06.Y (fldOffset=0x4)"
;  V14 tmp13        [V14,T05] (  2,  1.50)   float  ->  mm2         V06.Z(offs=0x08) P-INDEP "field V06.Z (fldOffset=0x8)"
;  V15 tmp14        [V15,T06] (  2,  1.50)   float  ->  mm3         V06.W(offs=0x0c) P-INDEP "field V06.W (fldOffset=0xc)"
;* V16 tmp15        [V16    ] (  0,  0   )   float  ->  zero-ref    V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)"
;* V17 tmp16        [V17,T11] (  0,  0   )   float  ->  zero-ref    V07.Y(offs=0x04) P-INDEP "field V07.Y (fldOffset=0x4)"
;* V18 tmp17        [V18,T12] (  0,  0   )   float  ->  zero-ref    V07.Z(offs=0x08) P-INDEP "field V07.Z (fldOffset=0x8)"
;* V19 tmp18        [V19,T13] (  0,  0   )   float  ->  zero-ref    V07.W(offs=0x0c) P-INDEP "field V07.W (fldOffset=0xc)"
;
; Lcl frame size = 24

G_M51825_IG01:
       sub      rsp, 24
       vzeroupper 
						;; size=7 bbWeight=1 PerfScore 1.25
G_M51825_IG02:
       vmovups  xmm0, xmmword ptr [rcx]
       vmovups  xmmword ptr [rsp+08H], xmm0
       vmovss   xmm0, dword ptr [rsp+08H]
       vmovss   xmm1, dword ptr [rsp+0CH]
       vmovss   xmm2, dword ptr [rsp+10H]
       vmovss   xmm3, dword ptr [rsp+14H]
       vxorps   xmm4, xmm4
       vucomiss xmm0, xmm4
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
						;; size=46 bbWeight=1 PerfScore 21.33
G_M51825_IG03:
       vxorps   xmm0, xmm0
       vucomiss xmm1, xmm0
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
       vxorps   xmm0, xmm0
       vucomiss xmm2, xmm0
       jp       SHORT G_M51825_IG05
       jne      SHORT G_M51825_IG05
       xor      eax, eax
       vucomiss xmm3, dword ptr [reloc @RWD00]
       setnp    al
       jp       SHORT G_M51825_IG04
       sete     al
						;; size=42 bbWeight=0.50 PerfScore 7.96
G_M51825_IG04:
       jmp      SHORT G_M51825_IG06
						;; size=2 bbWeight=0.50 PerfScore 1.00
G_M51825_IG05:
       xor      eax, eax
						;; size=2 bbWeight=0.50 PerfScore 0.12
G_M51825_IG06:
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
RWD00  	dd	3F800000h		;         1


; Total bytes of code 104, prolog size 7, PerfScore 43.32, instruction count 29, allocated bytes for code 104 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================

But is now:

; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )   byref  ->  rcx         this single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M51825_IG01:
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00
G_M51825_IG02:
       vmovups  xmm0, xmmword ptr [rcx]
       vcmpps   xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
       vmovmskps xrax, xmm0
       cmp      eax, 15
       sete     al
       movzx    rax, al
						;; size=26 bbWeight=1 PerfScore 10.50
G_M51825_IG03:
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
RWD00  	dq	0000000000000000h, 3F80000000000000h


; Total bytes of code 30, prolog size 3, PerfScore 15.50, instruction count 8, allocated bytes for code 30 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================

tannergooding · 2023-01-29T17:45:50Z

There are still a few APIs that can be optimized further, but they are all on the managed side of things and involve a more complex refactoring of their code. I plan on handling that in a separate PR

tannergooding · 2023-02-01T05:40:41Z

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

azure-pipelines · 2023-02-01T05:41:42Z

Azure Pipelines successfully started running 3 pipeline(s).

tannergooding · 2023-02-01T15:31:33Z

This just needs sign-off on the Mono side before it can be merged.

tannergooding · 2023-02-01T19:19:42Z

Just tracking some of the Mono issues logged related to this PR:

As discussed with @fanyang-mono offline. My current plan, after this PR is merged, is to put up a PR which changes emit_vector_2_3_4 to actually handle Vector2/3 (today it only handles Vector4). I will optimistically be adding Quaternion/Plane support at the same time (#81483) and will update the issue with anything that blocks the latter two (I don't expect any blockers for Vector2/3).

fanyang-mono · 2023-02-01T19:32:53Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-02-01T19:33:18Z

Azure Pipelines successfully started running 1 pipeline(s).

fanyang-mono

Thanks for taking care of Mono.

src/mono/mono/mini/simd-intrinsics.c

fanyang-mono · 2023-02-01T19:38:14Z

src/mono/mono/mini/simd-intrinsics.c

+	else
+	{
+		MonoClass *klass = mono_class_from_mono_type_internal (vector_type);
+		g_assert (


I am not sure if we want to assert here, since mono won't disable assertion in release build. I think it is better to return false when the class is not one of those expected ones.

👍, didn't realize Mono asserts were preserved in release code.

src/mono/mono/mini/simd-intrinsics.c

tannergooding · 2023-02-01T20:09:42Z

I can see SN_WithElement case is not being removed. Based on what you said earlier, they should be gone too?

Yes, looks like I didn't add the change to my commit. I'll fix this and the formatting issues after extra-platforms finishes to avoid additional CI churn.

tannergooding · 2023-02-01T23:43:58Z

Link to extra platforms failures is here: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=156061

Other than a timeout due to an offline machine, the test failures match what the scheduled runs are hitting: https://dev.azure.com/dnceng-public/public/_build/results?buildId=155434&view=ms.vss-test-web.build-test-results-tab

tannergooding · 2023-02-02T02:58:26Z

Brought in latest main to pickup the CI fix

tannergooding added 20 commits January 27, 2023 15:48

Allow Quaternion and Plane to be imported as TYP_SIMD16

83433ff

Add some minimal intrinsification of Quaternion and Plane

9d5aa21

Ensure Vector indexers are marked readonly

a666f6c

Ensure the vector constant properties are intrinsic

6cc5c3d

Ensure that the vector GetElement and WithElement APIs are intrinsic

e9e4528

Ensure vector division by scalar is intrinsic

0d3690b

Minor cleanup to vector files

15f804f

Ensure vector arithmetic functions are consistently intrinsic

8cdb2f5

Ensure creating a plane from a vector4 is intrinsic

192f814

Ensure accessing the Normal field of a Plane is efficient

bc510bb

Ensure Quaternion and Plane return the correct SimdAsHWIntrinsicClassId

dd3746f

Specially optimize Create(Dot(..., ...)) and Create(Sqrt(Dot(..., ...)))

78e1317

Ensure vector clamp is intrinsic

623c440

Ensure vector lerp is intrinsic

20db9dc

Ensure vector length is intrinsic

186bf95

Ensure vector normalize is intrinsic

a22827f

Ensure vector distance is intrinsic

db1541b

Optimize the vector transform by matrix methods

2262902

Ensure quaternion conjugate and inverse are intrinsic

05abc18

Fixing assert, formatting, and build failure

6bf2e67

dotnet-issue-labeler bot added area-System.Numerics new-api-needs-documentation labels Jan 29, 2023

ghost assigned tannergooding Jan 29, 2023

Ensure Quaternion.Inverse uses LengthSquared not Length

5ca919a

build-analysis bot mentioned this pull request Feb 1, 2023

System.Data.OleDb.Test - timeout and hangs #74488

Open

fanyang-mono reviewed Feb 1, 2023

View reviewed changes

Resolving mono formatting feedback

cba1d2f

Merge remote-tracking branch 'dotnet/main' into numerics-rewrite

ff68c6c

build-analysis bot mentioned this pull request Feb 2, 2023

arm64 linux build machines seem to be missing key components? dotnet/arcade#12395

Closed

5 tasks

tannergooding merged commit 90401dd into dotnet:main Feb 2, 2023

tannergooding deleted the numerics-rewrite branch February 2, 2023 06:21

kunalspathak mentioned this pull request Feb 3, 2023

Test failure System.Numerics.Tests.Vector4Tests.Vector4DistanceTest1 #81585

Closed

EgorBo mentioned this pull request Feb 7, 2023

Regressions in System.Numerics.Tests #81766

Open

lewing mentioned this pull request Feb 7, 2023

[Perf] Linux/x64: 3 Improvements on 2/2/2023 8:18:56 AM dotnet/perf-autofiling-issues#12740

Open

kotlarmilos mentioned this pull request Feb 13, 2023

[Perf] Linux/x64: 42 Improvements on 2/2/2023 8:18:56 AM dotnet/perf-autofiling-issues#12504

Closed

This was referenced Feb 16, 2023

[Perf] Windows/arm64: 11 Regressions on 2/2/2023 8:18:56 AM dotnet/perf-autofiling-issues#12799

Closed

[Perf] Linux/arm64: 12 Regressions on 2/2/2023 8:18:56 AM dotnet/perf-autofiling-issues#12773

Closed

ghost locked as resolved and limited conversation to collaborators Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the codegen of the vector accelerated `System.Numerics.*` types #81335

Improve the codegen of the vector accelerated `System.Numerics.*` types #81335

tannergooding commented Jan 29, 2023

dotnet-issue-labeler bot commented Jan 29, 2023

ghost commented Jan 29, 2023

Note about Quaternion and Plane

Note about other changes

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Feb 1, 2023

azure-pipelines bot commented Feb 1, 2023

tannergooding commented Feb 1, 2023

tannergooding commented Feb 1, 2023 •

edited

Loading

fanyang-mono commented Feb 1, 2023

azure-pipelines bot commented Feb 1, 2023

fanyang-mono left a comment

fanyang-mono Feb 1, 2023

tannergooding Feb 1, 2023

tannergooding commented Feb 1, 2023

tannergooding commented Feb 1, 2023

tannergooding commented Feb 2, 2023

Improve the codegen of the vector accelerated System.Numerics.* types #81335

Improve the codegen of the vector accelerated System.Numerics.* types #81335

Conversation

tannergooding commented Jan 29, 2023

Note about Quaternion and Plane

Note about other changes

dotnet-issue-labeler bot commented Jan 29, 2023

ghost commented Jan 29, 2023

Note about Quaternion and Plane

Note about other changes

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Jan 29, 2023

tannergooding commented Feb 1, 2023

azure-pipelines bot commented Feb 1, 2023

tannergooding commented Feb 1, 2023

tannergooding commented Feb 1, 2023 • edited Loading

fanyang-mono commented Feb 1, 2023

azure-pipelines bot commented Feb 1, 2023

fanyang-mono left a comment

Choose a reason for hiding this comment

fanyang-mono Feb 1, 2023

Choose a reason for hiding this comment

tannergooding Feb 1, 2023

Choose a reason for hiding this comment

tannergooding commented Feb 1, 2023

tannergooding commented Feb 1, 2023

tannergooding commented Feb 2, 2023

Improve the codegen of the vector accelerated `System.Numerics.*` types #81335

Improve the codegen of the vector accelerated `System.Numerics.*` types #81335

tannergooding commented Feb 1, 2023 •

edited

Loading