Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup some handling around Avx10v1 #103241

Merged
merged 5 commits into from
Jun 18, 2024
Merged

Conversation

tannergooding
Copy link
Member

This does some minor cleanup around the new AVX10v1 support introduced in #101938

Namely it:

  • fixes up the instructionsetdesc implications to not repeat the EVEX dependencies twice
  • ensures the ``instructionsetdesc` implications are correctly ordered
  • has the minipal_getcpufeatures and EEJitManager::SetCpuInfo functions only report Avx512F if F+BW+CD+DQ+VL are supported
  • handles a couple minor edge cases in the JIT that weren't handling Avx10v1 support

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 10, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Comment on lines 1371 to 1375
if (((cpuFeatures & XArchIntrinsicConstants_Avx512f) != 0) &&
((cpuFeatures & XArchIntrinsicConstants_Avx512f_vl) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512bw) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512bw_vl) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512cd) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512cd_vl) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512dq) != 0)
((cpuFeatures & XArchIntrinsicConstants_Avx512dq_vl) != 0))
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might require a R2R breaking change, but it would be nice if we could just compress these down to 1 bit and cleanup the CorInfoInstructionSet handling a bit as well.

We're never going to support, in the JIT at least, having the baseline AVX512 ISAs independent, so it'd be nicer if we just had DOTNET_EnableAVX512, XArchIntrinsicConstants_Avx512, and InstructionSet_Avx512 instead. That would also allow us to do simpler queries in the JIT around ISA support

At the same time, the nested classes are implied by other information. For example, the X64, VL, and V512 nested classes are all implied by the enclosing type being supported and another implicit bit of information such as the target architecture or if the base class is supported. So we really don't "need" all the extra logic and instruction set IDs to support those.

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib for review and @dotnet/avx512-contrib as an FYI

Just some post cleanup work for the AVX10v1 support that went in last week.

@tannergooding
Copy link
Member Author

(Resolved merge conflicts)

@tannergooding
Copy link
Member Author

CC. @jkotas for the VM side changes.

Relatively straightforward as detailed above, mostly just ensuring that the instructionsetdesc are ordered correctly so the implications work out as expected. The recursive implications could actually cause problems in some edges depending on how they were disabled via the environment variable knobs, so enforcing it as part of the minipal cpufeature querying seems overall much cleaner and avoids that complexity.

As called out, it would be nice if we could simplify this even more but I expect that'd require a breaking change to the config knobs we expose and to the R2R format.

}
}
result |= XArchIntrinsicConstants_Evex;
result |= XArchIntrinsicConstants_Avx512f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bake the JIT policy of requiring all these features together at the same time into the minipal, we can fold all these bits into just a single bit XArchIntrinsicConstants_Avx512f.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in JIT/EE interface https://github.com/dotnet/runtime/pull/103241/files#diff-eba17a72b6caa8b52b196f3b84bfd20dfeaf39a89f410943a692d1f33f569f3bR1366 does the same thing. What's the benefit of doing this in the minipal too?

Copy link
Member Author

@tannergooding tannergooding Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bake the JIT policy of requiring all these features together at the same time into the minipal, we can fold all these bits into just a single bit XArchIntrinsicConstants_Avx512f.

Right, and I think that's desirable long term. I've updated the PR to fold it down to Avx512 and Avx512Vbmi

What's the benefit of doing this in the minipal too?

It's helping enforce the consistency everywhere that consumes the CPUID bits.

What I ended up finding is that the implication defined in InstructionSetDesc don't work well with recursive implication sets. This is because it basically does a naive ordering and so what happened is that Avx512F would be supported, which meant that Avx512BW/CD/DQ had their requirements met. Then at end when processing the recursive implications, if one of BW/CD/DQ/VL wasn't supported, F would be unset but the others wouldn't leading to a torn view.

While handling this in codeman is sufficient for the JIT, it doesn't necessarily mean we followed suite in NAOT or other scenarios as they could get disconnected. So simply enforcing this at the minipal level makes it a bit easier and ensures everywhere has a consistent view of what we require to be supported together.

More ideally we'd also clean this up at the InstructionSet_* level too as we don't need instructionset64bit (this is implied by the target architecture bitness), we don't need instructionset Avx512F_VL (this could be compressed down to just Avx512F), etc. I think doing this when we're ready for the next R2R breaking change would be good, as it should free up bits and reduce the overall complexity of the JIT by just requiring a little bit of handling in the lookupIntrinsicId function instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing this when we're ready for the next R2R breaking change would be good

We just took a R2R breaking change a few days ago, so if you would like to cleanup the CorInfoInstructionSet bits, it is a good time now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, I'd like to do that in a follow up PR then so that the more core fixes can go in. The CorInfoInstructionSet cleanup is just nice to have at that point.

XArchIntrinsicConstants_Serialize = 0x20000,
XArchIntrinsicConstants_Avx10v1 = 0x40000,
XArchIntrinsicConstants_Avx10v1_V512 = 0x80000,
XArchIntrinsicConstants_Evex = 0x100000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the Evex bit here? It is implied by other bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Evex bit is intentional and a bit special (will work on getting an explanation of it added to the docs).

The consideration is that with Avx10v1 we have two different ways to say that certain 128-bit or 256-bit EVEX encoded instructions are supported. These instructions having been made available originally as part of AVX512VL (which requires 512-bit support) but now also being available in AVX10v1 (which does not require 512-bit support).

So, if code is using say Avx512F.VL.* (and never uses say Avx512F.*), then what we want is to actually track this as requiring the EVEX bit and not as requiring Avx10v1 or Avx512 specifically, since it will work provided that either of these two ISAs are supported. This allows the JIT to track and report the actual requirements for certain instructions more cleanly (we only need to check the EVEX bit for such instructions) and allows NAOT images to be generated that don't fail if you're code was written using one but was generated using another.

If we had known about Avx10v1 prior to shipping Avx512 support, I imagine we would've exposed the managed API surface differently. We probably would've intentionally designed the surface area a bit differently (even with it not matching hardware). In particular, the actual setup we have is basically the following, where AvxF+BW+CD+DQ and below that have a nested V512 type representing the 512-bit extensions and where we would've come up with an ISA name to represent AvxF+BW+CD+DQ (possibly just Evex, maybe something different):

graph TD;
    Sse2-->Aes;
    Vex-->Avx;
    Avx-->Avx2;
    Evex-->AvxF+BW+CD+DQ;
    AvxF+BW+CD+DQ-->AvxVbmi;
    AvxF+BW+CD+DQ-->AvxIfma;
    AvxF+BW+CD+DQ-->AvxBf16;
    AvxF+BW+CD+DQ-->AvxBitalg;
    AvxF+BW+CD+DQ-->AvxFp16;
    AvxF+BW+CD+DQ-->AvxVbmi2;
    AvxF+BW+CD+DQ-->AvxVpopcntdq;
    AvxVbmi-->Avx10v1;
    AvxIfma-->Avx10v1;
    AvxBf16-->Avx10v1;
    AvxBitalg-->Avx10v1;
    AvxFp16-->Avx10v1;
    AvxVbmi2-->Avx10v1;
    AvxVpopcntdq-->Avx10v1;
    Avx2-->AvxVnni;
    Vex-->Bmi1;
    Vex-->Bmi2;
    Avx2-->Evex;
    Fma-->Evex;
    Avx-->Fma;
    X86Base-->Lzcnt
    Sse2-->Pclmulqdq
    Sse42-->Popcnt
    X86Base-->Sse;
    Sse-->Sse2;
    Sse2-->Sse3;
    Sse3-->Ssse3;
    Ssse3-->Sse41;
    Sse41-->Sse42;
    Sse42-->Vex;
Loading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for the JIT to have internal methods that return true when it is fine to use Vex or Evex instruction encodings. I do not see why the Evex bit has to leak outside the JIT all the way to the minipal. The Vex bit does not leak all the way to the minipal either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consideration is that for EVEX it's not internal to the JIT.

Anytime the JIT does a compExactlyDependsOn(InstructionSet_*) or compOpportunisticallyDependsOn(InstructionSet_*) query, it gets reported back to the VM. The different VMs handle it as follows:

  • For JIT this reporting is a no-op
  • For R2R this reporting exactly tracks the InstructionSet_* queried and whether it was "exactly" or "opportunistic"
    • This is where the cleanup of InstructionSet_* would be beneficial, because we have lots more flags than actually needed
  • For AOT this reporting gets compressed back to XArchIntrinsicConstants which gets queried for an exact match on AOT startup

So, we need the JIT query for EVEX usage to map back to InstructionSet_EVEX for R2R and XArchIntrinsicConstants_EVEX for AOT that way if the code only uses Avx512F.VL.Abs, it can work when Avx10v1 without V512 support exists. Otherwise we'd need it to report as Avx10v1 or Avx512, both of which cut out hardware that would otherwise work with the binary. We only query the more exact ISA when it isn't part of that synthetic EVEX baseline and thus needs to restrict the target.

For VEX there isn't any need, because the hierarchy is logical and makes sense. There is no historical split where two separate ISAs give access to the same underlying instructions.

Copy link
Member

@jkotas jkotas Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.

EVEX is low-level instruction encoding. I think it would be best to confine it to the JIT. The mapping between EVEX and the ISAs is a bit of mental leap outside the JIT.

Would it be better to have bits for Avx512_VLOnly virtual ISA that is the common subset of Avx512 and Avx10v1? I assume that all Avx512*.VL ISAs are supported in Avx10v1. Is that correct?

if the code only uses Avx512F.VL.Abs, it can work when Avx10v1 without V512 support exists.

Is the user code going to look like this:

if (Avx512F.VL.IsSupported || Avx10v1.IsSupported)
{
    ... code that uses Avx512F.VL.Abs ...
}

Or are we going to make Avx512F.VL.IsSupported return true for Avx10v1 so that Avx10v1.IsSupported part of the condition is not needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, just as an idea, we could call it just Avx10 or Avx10v0 (representing the baseline common to Avx512 and Avx10v1)? Alternatively we could call it just AvxF_BW_CD_DQ?

I have looked around how other projects are solving this naming problem. LLVM seems to use EVEX256 name for the Avx10v1 and Avx512 intersection. I guess it is how we ended up with EVEX here. I like the 256 suffix. It helps with clarifying that the virtual ISA represents the non-512 intersection.

On a slightly related point, I believe it would it be a binary breaking change to change from an abstract class to an interface (i.e. change public abstract class Avx to instead be public interface Avx), correct?

It should not be a breaking change for typical uses, but as you have said there are likely cases that it would break.

I am not sure how well the ISAs modeling via inheritance hierarchy is going to work going forward. I would not be surprised if we see more places where new hardware revisions carve out subsets of existing instructions and give it a new ISA name. It just happened with Avx512/Avx10v1, Arm64 streaming mode is heading that way, and RISCV may end up there too given its development process. We may want to think about different approach to accommodate these trends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is how we ended up with EVEX here. I like the 256 suffix. It helps with clarifying that the virtual ISA represents the non-512 intersection.

👍

I am not sure how well the ISAs modeling via inheritance hierarchy is going to work going forward.

I agree we should think some more, but overall I do believe this is still the correct model. We've seen quite a bit of praise on how we've modeled everything as compared to how C/C++ or Rust do it. The overall separation of xplat APIs vs platform specific APIs and then the division of the latter into clear classes that are based on the underlying hardware checks has helped make discoverability, approachability, and usability of the feature in .NET one of the best.

It just happened with Avx512/Avx10v1

This one I think was more x64 having tried something new with Avx512 and then finding out that it didn't work in practice, so they are now fixing it by reconverging everything with Avx10v1. The architectural documents around Avx10v1 then cover how they plan to handle this moving forward and that it's effectively going back to the model we had seen previously with Sse and Avx (where you have a baseline encoding and then incremental functionality extended on top of this in a version like manner).

Arm64 streaming mode is heading that way

This is a case where there is still a clear hierarchy given we have SveStreaming and SveNonStreaming as the split.

Even if we had shipped class Sve prior to streaming support becoming defined, we could have inserted a class SveStreaming and had Sve derive from it, moving relevant methods down (since that is safe to do for classes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- What I'll plan on doing given the conversation so far is getting this in as is first as it fixes some bugs without the refactoring being too complex and puts us in a good position for .NET 9.

I'll then follow up with a PR that does a broader cleanup of the data used by R2R and changes EVEX to EVEX256 (which will itself require some refactoring in the JIT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we had shipped class Sve prior to streaming support becoming defined, we could have inserted a class SveStreaming and had Sve derive from it, moving relevant methods down (since that is safe to do for classes).

I think this hierarchy would look weird since the streaming and non-streaming support do not form a hierarchy. You can have one without the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streaming and non-streaming support do not form a hierarchy

I think this is more just the names aren't the "right" ones.

What I had meant is that you have two sets of instructions. That is, there are SVE instructions that are available both in regular SVE and in SVE streaming mode. Then there are SVE instructions which are not available in streaming mode unless another architectural bit is set.

@jkotas
Copy link
Member

jkotas commented Jun 15, 2024

@jkotas for the VM side changes.

LGTM

@tannergooding tannergooding merged commit cb5832d into dotnet:main Jun 18, 2024
184 of 186 checks passed
@tannergooding tannergooding deleted the avx10v1 branch June 18, 2024 03:19
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants