Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for Min/Max intrinsics xarch (#65625) #65700

Closed
wants to merge 2 commits into from

Conversation

SkiFoD
Copy link
Contributor

@SkiFoD SkiFoD commented Feb 22, 2022

I made the changes using #65584 as a referrence. It works with HelloWorld'sh code, but I'm stuck with building it via build.cmd. It doesn't want to compile and I can't figure out why.
I face the errors:

Assertion failed 'IsThreeOperandAVXInstruction(ins)' in 'System.MathF:Min(float,float):float' during 'Generate code' (IL size 13)
Assertion failed 'IsThreeOperandAVXInstruction(ins)' in 'System.Single:System.INumber<System.Single>.Min(float,float):float' during 'Generate code' (IL size 8)

image
@EgorBo Could you please help me to figure out what I implemented wrong?

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 22, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Feb 22, 2022
@ghost
Copy link

ghost commented Feb 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

I made the changes using #65584 as a referrence. It works with HelloWorld'sh code, but I'm stuck with building it via build.cmd. It doesn't want to compile and I can't figure out why.
I face the errors:

Assertion failed 'IsThreeOperandAVXInstruction(ins)' in 'System.MathF:Min(float,float):float' during 'Generate code' (IL size 13)
Assertion failed 'IsThreeOperandAVXInstruction(ins)' in 'System.Single:System.INumber<System.Single>.Min(float,float):float' during 'Generate code' (IL size 8)

image
@EgorBo Could you please help me to figure out what I implemented wrong?

Author: SkiFoD
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Feb 22, 2022

@EgorBo Could you please help me to figure out what I implemented wrong?

maxss and vmaxss are different instructions basically (VEX encoding), only vmaxss can be used with R_R_R, but it's not available without AVX


case NI_System_Math_Max:
genConsumeOperands(treeNode->AsOp());
GetEmitter()->emitIns_R_R_R(INS_maxss, emitActualTypeSize(treeNode), treeNode->GetRegNum(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to use emitIns_SIMD_R_R_R which will take care of VEX vs non-VEX differences

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually also expect we want to handle things like containment/etc in which case lowering to a HWIntrinsic might be simpler overall

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, manual containment management is quite verbose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Thank you for the advice. I'm sorry , but could you tell me more about the containments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally each node is processed by the JIT and will individually get register assignments and code generated for it. Containment is the term used to mean that a node should not be handled that way and instead is handled directly by its parent node.

So by default, if you have something like GT_HWINTRINSIC NI_System_Math_Max and it has a child node like GT_IND float you would end up generating

movss reg1, [mem]
maxss reg2, reg1

However, if you mark the GT_IND as "contained", then you can handle these together and emit just maxss reg1, [mem].


There is a lot of existing logic for supporting containment (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6030) and so it would likely be better to lower or import NI_System_Math_Max to be the relevant HWIntrinsic node(s) instead so that much of this handling implicitly lights up, rather than having to duplicate it just for the NI_System_Math_Max node.

If you were to do it the other way, by just handling NI_System_Math_Max, then it would likely be done around here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L5371-L5387

@EgorBo
Copy link
Member

EgorBo commented Mar 15, 2022

I assume failures are related e.g.:

Process terminated. Assertion failed.
   at System.Globalization.CalendricalCalculationsHelper.PersianNewYearOnOrBefore(Int64 numberOfDays) in /_/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendricalCalculationsHelper.cs:line 396

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-65700-merge-f505ab65165146859e/System.Runtime.Tests/1/console.e40913b6.log?sv=2019-07-07&se=2022-03-21T10%3A27%3A05Z&sr=c&sp=rl&sig=8MkYay3CV4QZ0iVUwk2q8n54366%2F0ilLMZ0UWu%2FbUms%3D

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Mar 15, 2022

@EgorBo Since I'm learning to work with the project, I got a question. Let's say you have a piece of code failing (like the assertion in the log). How do you run clr against this piece of code to debug what is wrong? I believe there must be a better way than just "copy pasting" the code into a simple "Hello world" project to emulate the behaviour. Thank you in advance.

@SkiFoD
Copy link
Contributor Author

SkiFoD commented Mar 31, 2022

@tannergooding Thank you for the explanation. Let's say I want to implement importing. I made an investigation and the HWIntrinsics are used if either a type belongs to System.Numerical.Vector or System.Numerics. Is it ok to implement the import somewhere here

else if (strcmp(methodName, "Max") == 0)
{
result = NI_System_Math_Max;
}
else if (strcmp(methodName, "Min") == 0)
{
result = NI_System_Math_Min;
}

in case FEATURE_HW_INTRINSICS directive is turned on? The second questions is, I can't figure out which ISA to use for converting NI_System_Math_Max to a relevant HW node. Could you give me a hand please?

@ghost
Copy link

ghost commented Apr 15, 2022

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@tannergooding
Copy link
Member

The second questions is, I can't figure out which ISA to use for converting NI_System_Math_Max to a relevant HW node. Could you give me a hand please?

Sorry, I had missed this. You can generally see the two approaches if you look at NI_System_Math_Round and NI_System_Math_FusedMultiplyAdd. You'll see that Compiler::IsTargetIntrinsic reports if the given math call can be handled via special logic here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importer.cpp#L20682-L20745

Compiler::impIntrinsic then specially handles importing Math_FusedMultiplyAdd directly as the relevant nodes here:

#ifdef FEATURE_HW_INTRINSICS
case NI_System_Math_FusedMultiplyAdd:
{
#ifdef TARGET_XARCH
if (compExactlyDependsOn(InstructionSet_FMA))
{
assert(varTypeIsFloating(callType));
// We are constructing a chain of intrinsics similar to:
// return FMA.MultiplyAddScalar(
// Vector128.CreateScalarUnsafe(x),
// Vector128.CreateScalarUnsafe(y),
// Vector128.CreateScalarUnsafe(z)
// ).ToScalar();
GenTree* op3 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* res =
gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, op3, NI_FMA_MultiplyAddScalar, callJitType, 16);
retNode = gtNewSimdHWIntrinsicNode(callType, res, NI_Vector128_ToScalar, callJitType, 16);
break;
}
#elif defined(TARGET_ARM64)
if (compExactlyDependsOn(InstructionSet_AdvSimd))
{
assert(varTypeIsFloating(callType));
// We are constructing a chain of intrinsics similar to:
// return AdvSimd.FusedMultiplyAddScalar(
// Vector64.Create{ScalarUnsafe}(z),
// Vector64.Create{ScalarUnsafe}(y),
// Vector64.Create{ScalarUnsafe}(x)
// ).ToScalar();
NamedIntrinsic createVector64 =
(callType == TYP_DOUBLE) ? NI_Vector64_Create : NI_Vector64_CreateScalarUnsafe;
constexpr unsigned int simdSize = 8;
GenTree* op3 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
GenTree* op2 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
GenTree* op1 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
// Note that AdvSimd.FusedMultiplyAddScalar(op1,op2,op3) corresponds to op1 + op2 * op3
// while Math{F}.FusedMultiplyAddScalar(op1,op2,op3) corresponds to op1 * op2 + op3
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op3, op2, op1, NI_AdvSimd_FusedMultiplyAddScalar,
callJitType, simdSize);
retNode = gtNewSimdHWIntrinsicNode(callType, retNode, NI_Vector64_ToScalar, callJitType, simdSize);
break;
}
#endif
// TODO-CQ-XArch: Ideally we would create a GT_INTRINSIC node for fma, however, that currently
// requires more extensive changes to valuenum to support methods with 3 operands
// We want to generate a GT_INTRINSIC node in the case the call can't be treated as
// a target intrinsic so that we can still benefit from CSE and constant folding.
break;
}
#endif // FEATURE_HW_INTRINSICS
. It does this because FusedMultiplyAdd is a slightly more complex sequence of nodes.

Alternatively, many of the other Math_* intrinsics defer to impMathIntrinsic here:

case NI_System_Math_Abs:
case NI_System_Math_Acos:
case NI_System_Math_Acosh:
case NI_System_Math_Asin:
case NI_System_Math_Asinh:
case NI_System_Math_Atan:
case NI_System_Math_Atanh:
case NI_System_Math_Atan2:
case NI_System_Math_Cbrt:
case NI_System_Math_Ceiling:
case NI_System_Math_Cos:
case NI_System_Math_Cosh:
case NI_System_Math_Exp:
case NI_System_Math_Floor:
case NI_System_Math_FMod:
case NI_System_Math_ILogB:
case NI_System_Math_Log:
case NI_System_Math_Log2:
case NI_System_Math_Log10:
#ifdef TARGET_ARM64
// ARM64 has fmax/fmin which are IEEE754:2019 minimum/maximum compatible
// TODO-XARCH-CQ: Enable this for XARCH when one of the arguments is a constant
// so we can then emit maxss/minss and avoid NaN/-0.0 handling
case NI_System_Math_Max:
case NI_System_Math_Min:
#endif
case NI_System_Math_Pow:
case NI_System_Math_Round:
case NI_System_Math_Sin:
case NI_System_Math_Sinh:
case NI_System_Math_Sqrt:
case NI_System_Math_Tan:
case NI_System_Math_Tanh:
case NI_System_Math_Truncate:
{
retNode = impMathIntrinsic(method, sig, callType, ni, tailCall);
break;
}
. Those get handling in a few phases throughout the JIT including containment handling here:
void Lowering::ContainCheckIntrinsic(GenTreeOp* node)
{
assert(node->OperIs(GT_INTRINSIC));
NamedIntrinsic intrinsicName = node->AsIntrinsic()->gtIntrinsicName;
if ((intrinsicName == NI_System_Math_Ceiling) || (intrinsicName == NI_System_Math_Floor) ||
(intrinsicName == NI_System_Math_Truncate) || (intrinsicName == NI_System_Math_Round) ||
(intrinsicName == NI_System_Math_Sqrt))
{
GenTree* op1 = node->gtGetOp1();
if ((IsContainableMemoryOp(op1) && IsSafeToContainMem(node, op1)) || op1->IsCnsNonZeroFltOrDbl())
{
MakeSrcContained(node, op1);
}
else
{
// Mark the operand as reg optional since codegen can still
// generate code if op1 is on stack.
op1->SetRegOptional();
}
}
}
. And then code generation here:
void CodeGen::genIntrinsic(GenTree* treeNode)
{
// Handle intrinsics that can be implemented by target-specific instructions
switch (treeNode->AsIntrinsic()->gtIntrinsicName)
{
case NI_System_Math_Abs:
genSSE2BitwiseOp(treeNode);
break;
case NI_System_Math_Ceiling:
case NI_System_Math_Floor:
case NI_System_Math_Truncate:
case NI_System_Math_Round:
genSSE41RoundOp(treeNode->AsOp());
break;
case NI_System_Math_Sqrt:
{
// Both operand and its result must be of the same floating point type.
GenTree* srcNode = treeNode->AsOp()->gtOp1;
assert(varTypeIsFloating(srcNode));
assert(srcNode->TypeGet() == treeNode->TypeGet());
genConsumeOperands(treeNode->AsOp());
const instruction ins = (treeNode->TypeGet() == TYP_FLOAT) ? INS_sqrtss : INS_sqrtsd;
GetEmitter()->emitInsBinary(ins, emitTypeSize(treeNode), treeNode, srcNode);
break;
}
default:
assert(!"genIntrinsic: Unsupported intrinsic");
unreached();
}
genProduceReg(treeNode);
}

1 similar comment
@tannergooding
Copy link
Member

The second questions is, I can't figure out which ISA to use for converting NI_System_Math_Max to a relevant HW node. Could you give me a hand please?

Sorry, I had missed this. You can generally see the two approaches if you look at NI_System_Math_Round and NI_System_Math_FusedMultiplyAdd. You'll see that Compiler::IsTargetIntrinsic reports if the given math call can be handled via special logic here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importer.cpp#L20682-L20745

Compiler::impIntrinsic then specially handles importing Math_FusedMultiplyAdd directly as the relevant nodes here:

#ifdef FEATURE_HW_INTRINSICS
case NI_System_Math_FusedMultiplyAdd:
{
#ifdef TARGET_XARCH
if (compExactlyDependsOn(InstructionSet_FMA))
{
assert(varTypeIsFloating(callType));
// We are constructing a chain of intrinsics similar to:
// return FMA.MultiplyAddScalar(
// Vector128.CreateScalarUnsafe(x),
// Vector128.CreateScalarUnsafe(y),
// Vector128.CreateScalarUnsafe(z)
// ).ToScalar();
GenTree* op3 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, impPopStack().val,
NI_Vector128_CreateScalarUnsafe, callJitType, 16);
GenTree* res =
gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, op3, NI_FMA_MultiplyAddScalar, callJitType, 16);
retNode = gtNewSimdHWIntrinsicNode(callType, res, NI_Vector128_ToScalar, callJitType, 16);
break;
}
#elif defined(TARGET_ARM64)
if (compExactlyDependsOn(InstructionSet_AdvSimd))
{
assert(varTypeIsFloating(callType));
// We are constructing a chain of intrinsics similar to:
// return AdvSimd.FusedMultiplyAddScalar(
// Vector64.Create{ScalarUnsafe}(z),
// Vector64.Create{ScalarUnsafe}(y),
// Vector64.Create{ScalarUnsafe}(x)
// ).ToScalar();
NamedIntrinsic createVector64 =
(callType == TYP_DOUBLE) ? NI_Vector64_Create : NI_Vector64_CreateScalarUnsafe;
constexpr unsigned int simdSize = 8;
GenTree* op3 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
GenTree* op2 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
GenTree* op1 =
gtNewSimdHWIntrinsicNode(TYP_SIMD8, impPopStack().val, createVector64, callJitType, simdSize);
// Note that AdvSimd.FusedMultiplyAddScalar(op1,op2,op3) corresponds to op1 + op2 * op3
// while Math{F}.FusedMultiplyAddScalar(op1,op2,op3) corresponds to op1 * op2 + op3
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op3, op2, op1, NI_AdvSimd_FusedMultiplyAddScalar,
callJitType, simdSize);
retNode = gtNewSimdHWIntrinsicNode(callType, retNode, NI_Vector64_ToScalar, callJitType, simdSize);
break;
}
#endif
// TODO-CQ-XArch: Ideally we would create a GT_INTRINSIC node for fma, however, that currently
// requires more extensive changes to valuenum to support methods with 3 operands
// We want to generate a GT_INTRINSIC node in the case the call can't be treated as
// a target intrinsic so that we can still benefit from CSE and constant folding.
break;
}
#endif // FEATURE_HW_INTRINSICS
. It does this because FusedMultiplyAdd is a slightly more complex sequence of nodes.

Alternatively, many of the other Math_* intrinsics defer to impMathIntrinsic here:

case NI_System_Math_Abs:
case NI_System_Math_Acos:
case NI_System_Math_Acosh:
case NI_System_Math_Asin:
case NI_System_Math_Asinh:
case NI_System_Math_Atan:
case NI_System_Math_Atanh:
case NI_System_Math_Atan2:
case NI_System_Math_Cbrt:
case NI_System_Math_Ceiling:
case NI_System_Math_Cos:
case NI_System_Math_Cosh:
case NI_System_Math_Exp:
case NI_System_Math_Floor:
case NI_System_Math_FMod:
case NI_System_Math_ILogB:
case NI_System_Math_Log:
case NI_System_Math_Log2:
case NI_System_Math_Log10:
#ifdef TARGET_ARM64
// ARM64 has fmax/fmin which are IEEE754:2019 minimum/maximum compatible
// TODO-XARCH-CQ: Enable this for XARCH when one of the arguments is a constant
// so we can then emit maxss/minss and avoid NaN/-0.0 handling
case NI_System_Math_Max:
case NI_System_Math_Min:
#endif
case NI_System_Math_Pow:
case NI_System_Math_Round:
case NI_System_Math_Sin:
case NI_System_Math_Sinh:
case NI_System_Math_Sqrt:
case NI_System_Math_Tan:
case NI_System_Math_Tanh:
case NI_System_Math_Truncate:
{
retNode = impMathIntrinsic(method, sig, callType, ni, tailCall);
break;
}
. Those get handling in a few phases throughout the JIT including containment handling here:
void Lowering::ContainCheckIntrinsic(GenTreeOp* node)
{
assert(node->OperIs(GT_INTRINSIC));
NamedIntrinsic intrinsicName = node->AsIntrinsic()->gtIntrinsicName;
if ((intrinsicName == NI_System_Math_Ceiling) || (intrinsicName == NI_System_Math_Floor) ||
(intrinsicName == NI_System_Math_Truncate) || (intrinsicName == NI_System_Math_Round) ||
(intrinsicName == NI_System_Math_Sqrt))
{
GenTree* op1 = node->gtGetOp1();
if ((IsContainableMemoryOp(op1) && IsSafeToContainMem(node, op1)) || op1->IsCnsNonZeroFltOrDbl())
{
MakeSrcContained(node, op1);
}
else
{
// Mark the operand as reg optional since codegen can still
// generate code if op1 is on stack.
op1->SetRegOptional();
}
}
}
. And then code generation here:
void CodeGen::genIntrinsic(GenTree* treeNode)
{
// Handle intrinsics that can be implemented by target-specific instructions
switch (treeNode->AsIntrinsic()->gtIntrinsicName)
{
case NI_System_Math_Abs:
genSSE2BitwiseOp(treeNode);
break;
case NI_System_Math_Ceiling:
case NI_System_Math_Floor:
case NI_System_Math_Truncate:
case NI_System_Math_Round:
genSSE41RoundOp(treeNode->AsOp());
break;
case NI_System_Math_Sqrt:
{
// Both operand and its result must be of the same floating point type.
GenTree* srcNode = treeNode->AsOp()->gtOp1;
assert(varTypeIsFloating(srcNode));
assert(srcNode->TypeGet() == treeNode->TypeGet());
genConsumeOperands(treeNode->AsOp());
const instruction ins = (treeNode->TypeGet() == TYP_FLOAT) ? INS_sqrtss : INS_sqrtsd;
GetEmitter()->emitInsBinary(ins, emitTypeSize(treeNode), treeNode, srcNode);
break;
}
default:
assert(!"genIntrinsic: Unsupported intrinsic");
unreached();
}
genProduceReg(treeNode);
}

@ghost ghost removed the no-recent-activity label Apr 15, 2022
@ghost ghost added the no-recent-activity label Apr 30, 2022
@ghost
Copy link

ghost commented Apr 30, 2022

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@ghost
Copy link

ghost commented May 14, 2022

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

@ghost ghost closed this May 14, 2022
@SkiFoD
Copy link
Contributor Author

SkiFoD commented May 16, 2022

@EgorBo @tannergooding I'm sorry, I had been busy recently and the PR was closed. Can I open it again?

@tannergooding
Copy link
Member

Yep, feel free to reopen it whenever you have time again

@ghost ghost locked as resolved and limited conversation to collaborators Jun 15, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI no-recent-activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants