Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: ARM64 - Added SVE APIs ExtractLastVector, ExtractLastScalar, ExtractAfterLastVector, ExtractAfterLastScalar #103847

Closed
wants to merge 20 commits into from

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Jun 22, 2024

Contributes to #99957

Adds:

  • ExtractLastVector
  • ExtractLastScalar
  • ExtractAfterLastVector
  • ExtractAfterLastScalar

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@TIHan TIHan marked this pull request as ready for review June 22, 2024 04:14
@TIHan
Copy link
Contributor Author

TIHan commented Jun 22, 2024

@dotnet/arm64-contrib @kunalspathak this is ready.

Example codegen:

  • ExtractLastScalar for int
G_M47661_IG01:        ; func=00, offs=0x000000, size=0x0010, bbWeight=1, PerfScore 3.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0006: 000000      stp     fp, lr, [sp, #-0x20]!
IN0007: 000004      mov     fp, sp
IN0008: 000008      str     xzr, [fp, #0x10]    // [V00 loc0]
IN0009: 00000C      str     xzr, [fp, #0x18]    // [V00 loc0+0x08]

G_M47661_IG02:        ; offs=0x000010, size=0x0014, bbWeight=1, PerfScore 9.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], byref

IN0001: 000010      movi    v16.4s, #0
IN0002: 000014      str     q16, [fp, #0x10]    // [V00 loc0]
IN0003: 000018      ldr     q16, [fp, #0x10]    // [V00 loc0]
IN0004: 00001C      ptrue   p0.s
IN0005: 000020      lasta   w0, p0, z16.s

G_M47661_IG03:        ; offs=0x000024, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN000a: 000024      ldp     fp, lr, [sp], #0x20
IN000b: 000028      ret     lr
  • ExtractLastScalar for double
IN0006: 000000      stp     fp, lr, [sp, #-0x20]!
IN0007: 000004      mov     fp, sp
IN0008: 000008      str     xzr, [fp, #0x10]    // [V00 loc0]
IN0009: 00000C      str     xzr, [fp, #0x18]    // [V00 loc0+0x08]

G_M12491_IG02:        ; offs=0x000010, size=0x0014, bbWeight=1, PerfScore 7.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], byref

IN0001: 000010      movi    v16.4s, #0
IN0002: 000014      str     q16, [fp, #0x10]    // [V00 loc0]
IN0003: 000018      ldr     q16, [fp, #0x10]    // [V00 loc0]
IN0004: 00001C      ptrue   p0.d
IN0005: 000020      lasta   z0, p0, z16.d

G_M12491_IG03:        ; offs=0x000024, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN000a: 000024      ldp     fp, lr, [sp], #0x20
IN000b: 000028      ret     lr
  • ExtractLastVector for int
G_M39689_IG01:        ; func=00, offs=0x000000, size=0x0010, bbWeight=1, PerfScore 3.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0006: 000000      stp     fp, lr, [sp, #-0x20]!
IN0007: 000004      mov     fp, sp
IN0008: 000008      str     xzr, [fp, #0x10]    // [V00 loc0]
IN0009: 00000C      str     xzr, [fp, #0x18]    // [V00 loc0+0x08]

G_M39689_IG02:        ; offs=0x000010, size=0x0014, bbWeight=1, PerfScore 7.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0001], byref

IN0001: 000010      movi    v16.4s, #0
IN0002: 000014      str     q16, [fp, #0x10]    // [V00 loc0]
IN0003: 000018      ldr     q16, [fp, #0x10]    // [V00 loc0]
IN0004: 00001C      ptrue   p0.s
IN0005: 000020      lastb   z0, p0, z16.s

G_M39689_IG03:        ; offs=0x000024, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN000a: 000024      ldp     fp, lr, [sp], #0x20
IN000b: 000028      ret     lr

Stress tests all passed:

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Byte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Double() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Int16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Int32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Int64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_SByte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_Single() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_UInt16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_UInt32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastVector_UInt64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Byte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Double() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Int16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Int32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Int64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_SByte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_Single() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_UInt16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_UInt32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractLastScalar_UInt64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Byte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Double() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Int16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Int32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Int64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_SByte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_Single() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_UInt16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_UInt32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastVector_UInt64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Byte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Double() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Int16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Int32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Int64() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_SByte() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_Single() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_UInt16() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_UInt32() : 7
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveExtractAfterLastScalar_UInt64() : 7
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@TIHan
Copy link
Contributor Author

TIHan commented Jun 24, 2024

@kunalspathak this is ready again. Hope this is a reasonable compromise. I didn't create a flag, but I created a function. There isn't any room for an additional flag.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add ConditionalSelect tests for Extract*Vector API.

src/coreclr/jit/lowerarmarch.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsic.h Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
@TIHan
Copy link
Contributor Author

TIHan commented Jun 26, 2024

@kunalspathak this is almost done, but I'm having issues with the ConditionalSelectScenarios when TieredCompilation=0

@kunalspathak
Copy link
Member

@kunalspathak this is almost done, but I'm having issues with the ConditionalSelectScenarios when TieredCompilation=0

are you still blocked on this or is the PR working as expected?

@TIHan
Copy link
Contributor Author

TIHan commented Jun 27, 2024

are you still blocked on this or is the PR working as expected?

I'm making progress, basically we can't remove the sel instruction for ConditionalSelect(mask, ExtractLastVector, falseOp) and ConditionalSelect(mask, ExtractAfterLastVector, falseOp).

@TIHan
Copy link
Contributor Author

TIHan commented Jun 27, 2024

@kunalspathak this is ready, apart from some ConditionalSelect issues that we are trying to address.

/// LASTA Wresult, Pg, Zop.B
/// LASTA Bresult, Pg, Zop.B
/// </summary>
public static unsafe byte ExtractAfterLastScalar(Vector<byte> value) => ExtractAfterLastScalar(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the managed signature of these APIs isn't correct (and wasn't caught in API review).

These all behave specially and need to take a mask explicitly, so should've been in a general shape that looked more like:

public static T ExtractAfterLastActiveToScalar(Vector<T> mask, Vector<T> value)
public static Vector<T> ExtractAfterLastActiveToVector(Vector<T> mask, Vector<T> value)

These are shipping as experimental in .NET 9, so we can go ahead and fix it now and follow up in API review later, but we should log an issue tracking that.

The general consideration is that LASTA and LASTB take a destination register, mask register, and input register. The mask is used to determine which input element is being extracted, so for LASTA (extract after last) it finds the index of the last element in mask that is "active" (all bits set) and extracts index+1 from value. In the case that the mask is zero or allbitsset, it uses index 0. While for LASTB (extract last) it finds the index of the last element in mask that is "active" (all bits set) and extracts that index from value. In the case that mask is zero or allbitsset, it uses the last index.

So, the signatures must explicitly take a mask and they shouldn't participate in the general EmbeddedMaskOperation containment logic

Copy link
Member

@tannergooding tannergooding Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- The names I gave above maybe aren't the best.

The concepts involved is we have what is logically similar to:

  • T Extract(Vector<T> vector, int index)
  • Vector<T> ExtractVector(Vector<T> vector, int index)

There's then a modifier on top which in other APIs is exposed as LastActiveElement and AfterLastActiveElement

So the natural way to combine them would maybe instead be:

T ExtractAfterLastActiveElement(Vector<T> mask, Vector<T> vector);
Vector<T> ExtractVectorAfterLastActiveElement(Vector<T> mask, Vector<T> vector);

T ExtractLastActiveElement(Vector<T> mask, Vector<T> vector);
Vector<T> ExtractVectorLastActiveElement(Vector<T> mask, Vector<T> vector);

// or possibly
T ExtractFromLastActiveElement(Vector<T> mask, Vector<T> vector);
Vector<T> ExtractVectorFromLastActiveElement(Vector<T> mask, Vector<T> vector);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super happy with ExtractVector* though since it's not actually extracting a vector

It's extracting a scalar and simply storing it in element 0 of a destination vector register

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give that the vector form always zero extends and there isn't a great name for it, we could support it by recognizing CreateScalar(ExtractFromLastActiveElement(mask, vector))

That gives us a single understandable name and lets us generate the vector form where its appropriate still.

@TIHan
Copy link
Contributor Author

TIHan commented Jul 1, 2024

We are skipping these APIs for now.

@TIHan TIHan added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 1, 2024
@kunalspathak
Copy link
Member

@TIHan - can we close this, since we will have to rewrite part of it anyway?

@TIHan TIHan closed this Jul 2, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.Intrinsics arm-sve Work related to arm64 SVE/SVE2 support new-api-needs-documentation NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants