[LLVM] Add `llvm.experimental.vector.compress` intrinsic #92289

lawben · 2024-05-15T16:14:03Z

This PR adds a new vector intrinsic @llvm.experimental.vector.compress to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where mask[i] == 1) to consecutive lanes in the result vector. A passthru vector can be provided, from which remaining lanes are filled.

The main reason for this is that the existing @llvm.masked.compressstore has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is very slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches.

In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possibly RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case.

See discussion in forum for initial discussion on the design.

llvmbot · 2024-05-15T16:14:37Z

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-aarch64

Author: Lawrence Benson (lawben)

Changes

This PR adds a new vector intrinsic @llvm.masked.compress to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where mask[i] == 1) to consecutive lanes in the result vector.

The main reason for this is that the existing @llvm.masked.compressstore has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is very slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches.

There are some open question that I still have:

Am I missing something significant to introduce a new intrinsic?
Is there a place to put the duplicated code between the DagTypeLegalizer and the VectorLegalizer?
There are some legalizer cases that I have not implemented yet, as I'm not sure if they are needed. I tried to copy what similar intrinsics do, but it is not fully clear to me when which legalization step is needed, e.g., SplitVecOperand or WidenResultVector. Maybe someone could point me in the right direction here, as none of my test cases fall into these paths.

In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possible RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case.

See discussion in forum for initial discussion on the design.

Patch is 30.65 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/92289.diff

12 Files Affected:

(modified) llvm/docs/LangRef.rst (+79)
(modified) llvm/include/llvm/CodeGen/ISDOpcodes.h (+5)
(modified) llvm/include/llvm/IR/Intrinsics.td (+5)
(modified) llvm/include/llvm/Target/TargetSelectionDAG.td (+6)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+20)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h (+3)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp (+47)
(modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp (+60)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp (+1)
(modified) llvm/lib/CodeGen/TargetLoweringBase.cpp (+3)
(added) llvm/test/CodeGen/AArch64/masked-compress.ll (+299)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 06809f8bf445d..773893b83a5d7 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -24975,6 +24975,85 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
+Masked Vector Compress Intrinsic
+--------------------------------
+
+LLVM provides an intrinsic for compressing data within a vector based on a selection mask.
+Semantically, this is similar to :ref:``@llvm.masked.compressstore <_int_compressstore>`` but with weaker assumptions
+and without storing the results to memory, i.e., the data remains in the vector.
+
+.. _int_masked_compress:
+
+'``llvm.masked.compress.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. A number of scalar values of integer, floating point or pointer data type are collected
+from an input vector and placed adjacently within the result vector. A mask defines which elements to collect from the vector.
+
+:: code-block:: llvm
+
+      declare <8 x i32> @llvm.masked.compress.v8i32(<8 x i32> <value>, <8 x i1> <mask>)
+      declare <16 x float> @llvm.masked.compress.v16f32(<16 x float> <value>, <16 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Selects elements from input vector '``value``' according to the '``mask``'.
+All selected elements are written into adjacent lanes in the result vector, from lower to higher.
+The mask holds a bit for each vector lane, and is used to select elements to be kept.
+The number of valid lanes is equal to the number of active bits in the mask.
+The main difference to :ref:`llvm.masked.compressstore <_int_compressstore>` is that the remainder of the vector may
+contain undefined values.
+This allows for branchless code and better optimization for all targets that do not support the explicit semantics of
+:ref:`llvm.masked.compressstore <_int_compressstore>`.
+The result vector can be written with a similar effect, as all the selected values are at the lower positions of the
+vector, but without requiring branches to avoid writes where the mask is 0.
+
+
+Arguments:
+""""""""""
+
+The first operand is the input vector, from which elements are selected.
+The second operand is the mask, a vector of boolean values.
+The mask and the input vector must have the same number of vector elements.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.compress``' intrinsic is designed for compressing data within a vector, i.e., ideally within a register.
+It allows to collect elements from possibly non-adjacent lanes of a vector and place them contiguously in the result vector in one IR operation.
+It is useful for targets all that support compress operations (e.g., AVX-512, ARM SVE, RISCV V), which more instruction
+sets do than explicit compressstore, i.e., ``llvm.masked.compress`` may yield better performance on more targets than
+``llvm.masked.compressstore`` due to weaker constraints.
+This intrinsic allows vectorizing loops with cross-iteration dependencies like in the following example:
+
+.. code-block:: c
+
+    // Consecutively store selected values with branchless code.
+    int *in, *out; bool *mask; int pos = 0;
+    for (int i = 0; i < size; ++i) {
+      out[pos] = in[i];
+      // if mask[i] == 0, the current value is overwritten in the next iteration.
+      pos += mask[i];
+    }
+
+
+.. code-block:: llvm
+
+    ; Load elements from `in`.
+    %vec = load <4 x i32>, ptr %inPtr
+    %mask = load <4 x i1>, ptr %maskPtr
+    %compressed = call <4 x i32> @llvm.masked.compress.v4i32(<4 x i32> %vec, <4 x i1> %mask)
+    store <4 x i32> %compressed, ptr %outPtr
+
+    ; %outPtr should be increased in each iteration by the number of '1's in the mask.
+    %iMask = bitcast <4 x i1> %mask to i4
+    %popcnt = call i4 @llvm.ctpop.i4(i4 %iMask)
+    %zextPopcnt = zext i4 %popcnt to i64
+    %nextOut = add i64 %outPos, %zextPopcnt
+
 
 Memory Use Markers
 ------------------
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index d8af97957e48e..71dfd8b43b710 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -1294,6 +1294,11 @@ enum NodeType {
   MLOAD,
   MSTORE,
 
+  // Masked compress - consecutively place vector elements based on mask
+  // e.g., vec = {A, B, C, D} and mask = 1010
+  //         --> {A, C, ?, ?} where ? is undefined
+  MCOMPRESS,
+
   // Masked gather and scatter - load and store operations for a vector of
   // random addresses with additional mask operand that prevents memory
   // accesses to the masked-off lanes.
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index f1c7d950f9275..e924d28956b0a 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2362,6 +2362,11 @@ def int_masked_compressstore:
             [IntrWriteMem, IntrArgMemOnly, IntrWillReturn,
              NoCapture<ArgIndex<1>>]>;
 
+def int_masked_compress:
+    DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+              [llvm_anyvector_ty, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+              [IntrNoMem, IntrWillReturn]>;
+
 // Test whether a pointer is associated with a type metadata identifier.
 def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
                               [IntrNoMem, IntrWillReturn, IntrSpeculatable]>;
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index 1684b424e3b44..061330fb4e08f 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -266,6 +266,10 @@ def SDTMaskedScatter : SDTypeProfile<0, 4, [
   SDTCisSameNumEltsAs<0, 1>, SDTCisSameNumEltsAs<0, 3>
 ]>;
 
+def SDTMaskedCompress : SDTypeProfile<1, 2, [
+  SDTCisVec<0>, SDTCisVec<1>, SDTCisSameNumEltsAs<0, 1>,
+]>;
+
 def SDTVecShuffle : SDTypeProfile<1, 2, [
   SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>
 ]>;
@@ -731,6 +735,8 @@ def masked_gather : SDNode<"ISD::MGATHER", SDTMaskedGather,
 def masked_scatter : SDNode<"ISD::MSCATTER", SDTMaskedScatter,
                             [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
 
+def masked_compress : SDNode<"ISD::MCOMPRESS", SDTMaskedCompress>;
+
 // Do not use ld, st directly. Use load, extload, sextload, zextload, store,
 // and truncst (see below).
 def ld         : SDNode<"ISD::LOAD"       , SDTLoad,
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 0aa36deda79dc..80f645b433cbe 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -87,6 +87,7 @@ void DAGTypeLegalizer::PromoteIntegerResult(SDNode *N, unsigned ResNo) {
     break;
   case ISD::MGATHER:     Res = PromoteIntRes_MGATHER(cast<MaskedGatherSDNode>(N));
     break;
+  case ISD::MCOMPRESS:   Res = PromoteIntRes_MCOMPRESS(N); break;
   case ISD::SELECT:
   case ISD::VSELECT:
   case ISD::VP_SELECT:
@@ -948,6 +949,11 @@ SDValue DAGTypeLegalizer::PromoteIntRes_MGATHER(MaskedGatherSDNode *N) {
   return Res;
 }
 
+SDValue DAGTypeLegalizer::PromoteIntRes_MCOMPRESS(SDNode *N) {
+  SDValue Vec = GetPromotedInteger(N->getOperand(0));
+  return DAG.getNode(ISD::MCOMPRESS, SDLoc(N), Vec.getValueType(), Vec, N->getOperand(1));
+}
+
 /// Promote the overflow flag of an overflowing arithmetic node.
 SDValue DAGTypeLegalizer::PromoteIntRes_Overflow(SDNode *N) {
   // Change the return type of the boolean result while obeying
@@ -1855,6 +1861,7 @@ bool DAGTypeLegalizer::PromoteIntegerOperand(SDNode *N, unsigned OpNo) {
                                                  OpNo); break;
   case ISD::MSCATTER: Res = PromoteIntOp_MSCATTER(cast<MaskedScatterSDNode>(N),
                                                   OpNo); break;
+  case ISD::MCOMPRESS: Res = PromoteIntOp_MCOMPRESS(N, OpNo); break;
   case ISD::VP_TRUNCATE:
   case ISD::TRUNCATE:     Res = PromoteIntOp_TRUNCATE(N); break;
   case ISD::BF16_TO_FP:
@@ -2335,6 +2342,19 @@ SDValue DAGTypeLegalizer::PromoteIntOp_MSCATTER(MaskedScatterSDNode *N,
                               N->getIndexType(), TruncateStore);
 }
 
+SDValue DAGTypeLegalizer::PromoteIntOp_MCOMPRESS(SDNode *N, unsigned OpNo) {
+  SDValue Vec = N->getOperand(0);
+  SDValue Mask = N->getOperand(1);
+  EVT VT = Vec.getValueType();
+
+  if (OpNo == 0)
+    Vec = GetPromotedInteger(Vec);
+  else
+    Mask = PromoteTargetBoolean(Mask, VT);
+
+  return DAG.getNode(ISD::MCOMPRESS, SDLoc(N), VT, Vec, Mask);
+}
+
 SDValue DAGTypeLegalizer::PromoteIntOp_TRUNCATE(SDNode *N) {
   SDValue Op = GetPromotedInteger(N->getOperand(0));
   if (N->getOpcode() == ISD::VP_TRUNCATE)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index d925089d5689f..5fb14757f8991 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -321,6 +321,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue PromoteIntRes_LOAD(LoadSDNode *N);
   SDValue PromoteIntRes_MLOAD(MaskedLoadSDNode *N);
   SDValue PromoteIntRes_MGATHER(MaskedGatherSDNode *N);
+  SDValue PromoteIntRes_MCOMPRESS(SDNode *N);
   SDValue PromoteIntRes_Overflow(SDNode *N);
   SDValue PromoteIntRes_FFREXP(SDNode *N);
   SDValue PromoteIntRes_SADDSUBO(SDNode *N, unsigned ResNo);
@@ -390,6 +391,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue PromoteIntOp_MLOAD(MaskedLoadSDNode *N, unsigned OpNo);
   SDValue PromoteIntOp_MSCATTER(MaskedScatterSDNode *N, unsigned OpNo);
   SDValue PromoteIntOp_MGATHER(MaskedGatherSDNode *N, unsigned OpNo);
+  SDValue PromoteIntOp_MCOMPRESS(SDNode *N, unsigned OpNo);
   SDValue PromoteIntOp_FRAMERETURNADDR(SDNode *N);
   SDValue PromoteIntOp_FIX(SDNode *N);
   SDValue PromoteIntOp_ExpOp(SDNode *N);
@@ -882,6 +884,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   void SplitVecRes_MLOAD(MaskedLoadSDNode *MLD, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_Gather(MemSDNode *VPGT, SDValue &Lo, SDValue &Hi,
                           bool SplitSETCC = false);
+  void SplitVecRes_MCOMPRESS(SDNode *N, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_ScalarOp(SDNode *N, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_STEP_VECTOR(SDNode *N, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index 423df9ae6b2a5..ca32db26e511c 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -134,6 +134,7 @@ class VectorLegalizer {
   SDValue ExpandVSELECT(SDNode *Node);
   SDValue ExpandVP_SELECT(SDNode *Node);
   SDValue ExpandVP_MERGE(SDNode *Node);
+  SDValue ExpandMCOMPRESS(SDNode *Node);
   SDValue ExpandVP_REM(SDNode *Node);
   SDValue ExpandSELECT(SDNode *Node);
   std::pair<SDValue, SDValue> ExpandLoad(SDNode *N);
@@ -442,6 +443,7 @@ SDValue VectorLegalizer::LegalizeOp(SDValue Op) {
   case ISD::FP_TO_SINT_SAT:
   case ISD::FP_TO_UINT_SAT:
   case ISD::MGATHER:
+  case ISD::MCOMPRESS:
     Action = TLI.getOperationAction(Node->getOpcode(), Node->getValueType(0));
     break;
   case ISD::SMULFIX:
@@ -1101,6 +1103,9 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
       return;
 
     break;
+  case ISD::MCOMPRESS:
+    Results.push_back(ExpandMCOMPRESS(Node));
+    return;
   }
 
   SDValue Unrolled = DAG.UnrollVectorOp(Node);
@@ -1505,6 +1510,48 @@ SDValue VectorLegalizer::ExpandVP_MERGE(SDNode *Node) {
   return DAG.getSelect(DL, Node->getValueType(0), FullMask, Op1, Op2);
 }
 
+SDValue VectorLegalizer::ExpandMCOMPRESS(SDNode *Node) {
+  SDLoc DL(Node);
+  SDValue Vec = Node->getOperand(0);
+  SDValue Mask = Node->getOperand(1);
+
+  EVT VecVT = Vec.getValueType();
+  EVT ScalarVT = VecVT.getScalarType();
+  EVT MaskScalarVT = Mask.getValueType().getScalarType();
+
+  SDValue StackPtr = DAG.CreateStackTemporary(VecVT);
+  SDValue Chain = DAG.getEntryNode();
+  SDValue OutPos = DAG.getConstant(0, DL, MVT::i32);
+
+  unsigned NumElms = VecVT.getVectorNumElements();
+  // Skip element zero, as we always copy this to the output vector.
+  for (unsigned I = 0; I < NumElms; I++) {
+    SDValue Idx = DAG.getVectorIdxConstant(I, DL);
+
+    SDValue ValI =
+        DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarVT, Vec, Idx);
+    SDValue OutPtr =
+        TLI.getVectorElementPointer(DAG, StackPtr, VecVT, OutPos);
+    Chain = DAG.getStore(Chain, DL, ValI, OutPtr, MachinePointerInfo::getUnknownStack(DAG.getMachineFunction()));
+
+    // Skip this for last element.
+    if (I < NumElms - 1) {
+      // Get the mask value and add it to the current output position. This
+      // either increments by 1 if MaskI is true or adds 0 otherwise.
+      SDValue MaskI =
+          DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MaskScalarVT, Mask, Idx);
+      MaskI = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, MaskI);
+      MaskI = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i32, MaskI);
+      OutPos = DAG.getNode(ISD::ADD, DL, MVT::i32, OutPos, MaskI);
+    }
+  }
+
+  int FI = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
+  MachinePointerInfo PtrInfo =
+      MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI);
+  return DAG.getLoad(VecVT, DL, Chain, StackPtr, PtrInfo);
+}
+
 SDValue VectorLegalizer::ExpandVP_REM(SDNode *Node) {
   // Implement VP_SREM/UREM in terms of VP_SDIV/VP_UDIV, VP_MUL, VP_SUB.
   EVT VT = Node->getValueType(0);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index cd858003cf03b..62e7febed6568 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -1058,6 +1058,9 @@ void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
   case ISD::VP_GATHER:
     SplitVecRes_Gather(cast<MemSDNode>(N), Lo, Hi, /*SplitSETCC*/ true);
     break;
+  case ISD::MCOMPRESS:
+    SplitVecRes_MCOMPRESS(N, Lo, Hi);
+    break;
   case ISD::SETCC:
   case ISD::VP_SETCC:
     SplitVecRes_SETCC(N, Lo, Hi);
@@ -2304,6 +2307,63 @@ void DAGTypeLegalizer::SplitVecRes_Gather(MemSDNode *N, SDValue &Lo,
   ReplaceValueWith(SDValue(N, 1), Ch);
 }
 
+void DAGTypeLegalizer::SplitVecRes_MCOMPRESS(SDNode *N, SDValue &Lo,
+                                             SDValue &Hi) {
+  // This is not "trivial", as there is a dependency between the two subvectors.
+  // Depending on the number of 1s in the mask, the elements from the Hi vector
+  // need to be moved to the Lo vector. So we just perform this as one "big"
+  // operation (analogously to the default MCOMPRESS expand implementation), by
+  // writing to memory and then loading the Lo and Hi vectors from that. This
+  // gets rid of MCOMPRESS and all other operands can be legalized later.
+  SDLoc DL(N);
+  SDValue Vec = N->getOperand(0);
+  SDValue Mask = N->getOperand(1);
+
+  EVT VecVT = Vec.getValueType();
+  EVT SubVecVT = VecVT.getHalfNumVectorElementsVT(*DAG.getContext());
+  EVT ScalarVT = VecVT.getScalarType();
+  EVT MaskScalarVT = Mask.getValueType().getScalarType();
+
+  // TODO: This code is duplicated here and in LegalizeVectorOps.
+  SDValue StackPtr = DAG.CreateStackTemporary(VecVT);
+  SDValue Chain = DAG.getEntryNode();
+  SDValue OutPos = DAG.getConstant(0, DL, MVT::i32);
+
+  unsigned NumElms = VecVT.getVectorNumElements();
+  // Skip element zero, as we always copy this to the output vector.
+  for (unsigned I = 0; I < NumElms; I++) {
+    SDValue Idx = DAG.getVectorIdxConstant(I, DL);
+
+    SDValue ValI = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarVT, Vec, Idx);
+    SDValue OutPtr = TLI.getVectorElementPointer(DAG, StackPtr, VecVT, OutPos);
+    Chain = DAG.getStore(
+        Chain, DL, ValI, OutPtr,
+        MachinePointerInfo::getUnknownStack(DAG.getMachineFunction()));
+
+    // Skip this for last element.
+    if (I < NumElms - 1) {
+      // Get the mask value and add it to the current output position. This
+      // either increments by 1 if MaskI is true or adds 0 otherwise.
+      SDValue MaskI =
+          DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MaskScalarVT, Mask, Idx);
+      MaskI = DAG.getNode(ISD::TRUNCATE, DL, MVT::i1, MaskI);
+      MaskI = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i32, MaskI);
+      OutPos = DAG.getNode(ISD::ADD, DL, MVT::i32, OutPos, MaskI);
+    }
+  }
+
+  int FI = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
+  MachinePointerInfo PtrInfo =
+      MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), FI);
+  SDValue HiPtr = TLI.getVectorElementPointer(
+      DAG, StackPtr, VecVT, DAG.getConstant(NumElms / 2, DL, MVT::i32));
+
+  Lo = DAG.getLoad(SubVecVT, DL, Chain, StackPtr, PtrInfo);
+  Hi = DAG.getLoad(
+      SubVecVT, DL, Chain, HiPtr,
+      MachinePointerInfo::getUnknownStack(DAG.getMachineFunction()));
+}
+
 void DAGTypeLegalizer::SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi) {
   assert(N->getValueType(0).isVector() &&
          N->getOperand(0).getValueType().isVector() &&
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index ca352da5d36eb..665bab6121837 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -6718,6 +6718,13 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::masked_compressstore:
     visitMaskedStore(I, true /* IsCompressing */);
     return;
+  case Intrinsic::masked_compress:
+    setValue(&I, DAG.getNode(ISD::MCOMPRESS, sdl,
+                             getValue(I.getArgOperand(0)).getValueType(),
+                             getValue(I.getArgOperand(0)),
+                             getValue(I.getArgOperand(1)),
+                             Flags));
+    return;
   case Intrinsic::powi:
     setValue(&I, ExpandPowI(sdl, getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)), DAG));
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 59742e90c6791..37288054b0e7b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -416,6 +416,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
   case ISD::MSTORE:                     return "masked_store";
   case ISD::MGATHER:                    return "masked_gather";
   case ISD::MSCATTER:                   return "masked_scatter";
+  case ISD::MCOMPRESS:                  return "masked_compress";
   case ISD::VAARG:                      return "vaarg";
   case ISD::VACOPY:                     return "vacopy";
   case ISD::VAEND:                      return "vaend";
diff --git a/llvm/lib/CodeGen/TargetLoweringBase.cpp b/llvm/lib/CodeGen/TargetLoweringBase.cpp
index 09b70cfb72278..5ee12be752b27 100644
--- a/llvm/lib/CodeGen/TargetLoweringBase.cpp
+++ b/llvm/lib/CodeGen/TargetLoweringBase.cpp
@@ -956,6 +956,9 @@ void TargetLoweringBase::initActions() {
     // Named vector shuffles default to expand.
     setOperationAction(ISD::VECTOR_SPLICE, VT, Expand);
 
+    // Only some target support this vector operation. Most need to expand it.
+    setOperationAction(ISD::MCOMPRESS, VT, Expand);
+
     // VP operations default to expand.
 #define BEGIN_REGISTER_VP_SDNODE(SDOPC, ...)                                   \
     setOperationAction(ISD::SDOPC, VT, Expand);
diff --git a/llvm/test/CodeGen/AArch64/masked-compress.ll b/llvm/test/CodeGen/AArch64/masked-compress.ll
new file mode 100644
index 0000000000000..a2f39b9620c95
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-compress.ll
@@ -0,0 +1,299 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=aarch64-apple-darwin -verify-machineinstrs < %s | FileCheck %s
+
+define <4 x i32> @test_compress_v4i32(<4 x i32> %vec, <4 x i1> %mask) {
+; CHECK-LABEL: test_compress_v4i32:
+; CHEC...
[truncated]

github-actions · 2024-05-15T16:17:02Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm

Missing globalisel implementation

lawben · 2024-05-15T16:17:59Z

Adding participants from discussion: @preames @topperc @efriedma-quic @lukel97 for feedback. Please let me know if you'd be willing to review this PR.

efriedma-quic · 2024-05-15T16:49:37Z

Is there a place to put the duplicated code between the DagTypeLegalizer and the VectorLegalizer?

TargetLowering has a bunch of expand* routines.

SplitVecOperand or WidenResultVector. Maybe someone could point me in the right direction here, as none of my test cases fall into these paths.

Suggested types to test with:

<3 x i32>.
<vscale x 4 x double>

It should be possible to exercise all forms of result legalization. Some forms of operand legalization might be impossible to hit: we legalize results before operands, so the possible operand types are restricted.

fhahn · 2024-05-15T17:19:53Z

also cc @nikolaypanchenko for the RISCV side

lawben · 2024-05-16T14:15:51Z

@arsenm I'm struggling a bit with the GlobalISel implementation, as I don't have experience with it. I've created a GenericOpcode for it in GenericOpcodes.td as follows:

// Generic masked compress.
def G_MCOMPRESS: GenericInstruction {
  let OutOperandList = (outs type0:$dst);
  let InOperandList = (ins type1:$vec, type2:$mask);
  let hasSideEffects = false;
}

and I've added

def : GINodeEquiv<G_MCOMPRESS, masked_compress>;

to SelectionDAGCompat.td by following some examples/tutorials.

However, I cannot get any further with this, as other instructions "fail" in GlobalISel before I even get to my masked compress. As I understand, AArch64 has one of the more sophisticated GISel implementations, but even with that, I'm getting errors like:

LLVM ERROR: unable to legalize instruction: %1:_(<4 x s1>) = G_TRUNC %2:_(<4 x s16>)

This truncation is unavoidable in my case, as I need a <N x s1> mask, which is not legal in itself.

Also, my current SelectionDAG-based implementation has a generic expand path in TargetLowering.cpp for all tagets that don't support it natively. I cannot seems to find a target-independent code path for GIsel in which I could add some logic to expand this.

Could you maybe point me in the right direction to continue here? I'm not really sure how to continue / what to actually implement for GIsel.

arsenm · 2024-05-16T14:26:11Z

def : GINodeEquiv<G_MCOMPRESS, masked_compress>;

This will help you with step 3 (select), but some work will be needed to define some legalizer rules (and regbankselect for the legal cases)

However, I cannot get any further with this, as other instructions "fail" in GlobalISel before I even get to my masked compress. As I understand, AArch64 has one of the more sophisticated GISel implementations, but even with that, I'm getting errors like:
LLVM ERROR: unable to legalize instruction: %1:_(<4 x s1>) = G_TRUNC %2:_(<4 x s16>)

AArch64 is fairly incomplete on vectors last I checked. It's still possible to write a MIR only test that shows your expansion without handling everything

This truncation is unavoidable in my case, as I need a <N x s1> mask, which is not legal in itself.

Also, my current SelectionDAG-based implementation has a generic expand path in TargetLowering.cpp for all tagets that don't support it natively. I cannot seems to find a target-independent code path for GIsel in which I could add some logic to expand this.

This would be in LegalizerHelper, conditionally called depending on the legalizer rules defined by the target (e.g.

llvm-project/llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp

Line 920 in c675a58

getActionDefinitionsBuilder(G_SHUFFLE_VECTOR)

)

llvm/docs/LangRef.rst

lawben · 2024-05-17T13:46:46Z

@arsenm Thanks. I've added some more code and got it to run for a basic test on AArch64. See commit 17004b9. The GISel path of this PR is still WIP at this stage.

It's still possible to write a MIR only test that shows your expansion without handling everything

I didn't really want to mix the initial "generic" implementation with target-specific ones for the initial PR. But it seems to me like I need to specify getActionDefinitionsBuilder rules for each target and cannot use a "global" version like I did with SelectionDAG and TargetLoweringBase. I currently have a small hack in AArch64LegalizerInfo.cpp for testing (note: there is a "DO NOT COMMIT!" comment here as this should not be committed).

Question: How can I add (and test) this for now in a way that does not require me to specify all legalizing logic in AArch64, as this intrinsic itself is never legal on AARch64 (except maybe with SVE)? If I only add this to AArch64 now, I guess all other targets will crash with this intrinsic. What is the preferred way to handle this?

I'll add fewerElementsVector()/moreElementsVector() in the LegalizeHelper once I know how to proceed here.

and regbankselect for the legal cases

I don't think this makes sense at this point, as there is no legal generic case. I'll need to add the regbankselect once I implement the target-specific versions. Please correct me if I'm wrong.

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp

RKSimon

Maybe worth adding ISD::MCOMPRESS to VerifySDNode to ensure the types are correct?

RKSimon · 2024-05-17T14:33:55Z

llvm/include/llvm/CodeGen/ISDOpcodes.h

+  // Masked compress - consecutively place vector elements based on mask
+  // e.g., vec = {A, B, C, D} and mask = 1010
+  //         --> {A, C, ?, ?} where ? is undefined
+  MCOMPRESS,


Please document the operand order

Done. Is this enough? I've followed the examples of the other instructions.

Expand the name to exactly match the intrinsic name

lawben · 2024-05-17T15:27:55Z

Maybe worth adding ISD::MCOMPRESS to VerifySDNode to ensure the types are correct?

@RKSimon that's a good idea. As VerifySDNode is still quite new, I'm not sure if this is intended behavior or not, but when implementing MCOMPRESS in it, I get crashes in my program. Specifically, in SelectionDAG::getNode(), there is a step that tries to FoldConstantArithmetic, which creates an MCOMPRESS node with scalar input. As this is not valid, my verify code crashes. I assume this will happen for many other nodes that require vector inputs.

I could add a check at the beginning of FoldConstantArithmetic and bail, like CONCAT_VECTORS

llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

Lines 6279 to 6283 in 84aee95

    
           // We can't create a scalar CONCAT_VECTORS so skip it. It will break 
        
           // for concats involving SPLAT_VECTOR. Concats of BUILD_VECTORS are handled by 
        
           // foldCONCAT_VECTORS in getNode before this is called. 
        
           if (Opcode >= ISD::BUILTIN_OP_END || Opcode == ISD::CONCAT_VECTORS) 
        
             return SDValue();

but that feels like the wrong place/abstraction to do this at. Do you have any suggestions where to put logic to avoid this? In my opinion, the crash in VerifySDNode is correct here and the scalar path should never be tried, so we probably want to avoid it.

efriedma-quic · 2024-06-19T20:15:25Z

llvm/docs/LangRef.rst

+The mask holds an entry for each vector lane, and is used to select elements to be kept.
+If ``passthru`` is undefined, the number of valid lanes is equal to the number of ``true`` entries in the mask, i.e., all lanes >= number-of-selected-values are undefined.
+If a ``passthru`` vector is given, all remaining lanes are filled with the corresponding lane's value from ``passthru``.
+The main difference to :ref:`llvm.masked.compressstore <int_compressstore>` is that the we do not need to guard against memory access for unselected lanes.


Please wrap lines at 80 columns

Probably worth explicitly stating what happens if the mask has poison/undef bits. (And make sure the expansion freezes store indexes as necessary to implement those semantics.)

Good point, I've added this in the docs. I would interpret undef as "false" (this how I implemented it), because we cannot access that element and the caller says "I don't care about this value". I guess this does not need any freezing stores, as we handle this deterministically.

That doesn't work. There's no way at runtime to compute whether an input is undef/poison at runtime. (You can technically special-case a literal undef/poison, but that doesn't really do anything useful.)

Ah, okay. I think my understanding of when a freeze is needed was wrong. In this case, I'm not sure we can specify one or the other behavior if a mask element is undef/poison. Depending on the code path, we perform a vec_reduce to get a passthru value or not. So our result depends on how vec_reduce behaves with undef/poison in addition to how our extracted mask element if frozen, at which point this probably gets weird to handle consistently.

Before changing the docs again: Is it valid for us to say "If the mask is undef or contains undef elements, the entire result is undef"? That would give us a bit of wiggle room in the implementation.

If a mask element is poison, I think it's fine if the whole returned vector is poison. But preferably not undefined behavior; undefined behavior means we can't hoist, which makes everything more complicated, so I'd rather not deal with that. (I guess you could try to more narrowly state that elements after the poisoned mask bit are poison, but I don't think that's actually useful.)

If a mask element is undef... I guess we could similarly say the whole returned vector is undef? Not sure how much it matters exactly what rule we pick here.

What does @llvm.masked.compressstore currently do with undef or poison mask elements? It would be nice if they were consistent

@lukel97 I just played around a bit with @llvm.masked.compressstore and immediate values. It is does not seem very consistent (https://godbolt.org/z/xjeY5z9bb). Sometimes and undef value is treated as true, sometimes as false. I'm not sure what the takeaway from this is though.

@efriedma-quic Sorry, but I'm a bit lost in your distinction here. Could you maybe suggest some wording on how to treat this? I think we basically want to say that if a mask is/contains poison/undef bits, we don't care about the result vector. I just don't know how to express this with the correct LLVM terminology.

Strawman wording: "if any element of the mask is poison, all elements of the result are poison. Otherwise, if any element of the mask is undef, all elements of the result are undef."

This mostly only affects optimizations you haven't implemented yet... but probably the code to expand llvm.masked.compress to stores needs to freeze the mask, so we don't do a store with a poisoned pointer operand.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

RKSimon

A couple of minors

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

lawben · 2024-07-01T09:29:00Z

@lukel97 I've renamed the intrinsic to llvm.experimental.vector.compress.
@efriedma-quic I've taken your wording and added it to the docs. I think it described the semantics clearly, so I don't see any reason to change it. I've also frozen the mask entries.

lukel97

The renaming looks good, thanks!

llvm/docs/LangRef.rst

llvm/docs/ReleaseNotes.rst

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

lukel97

Can't speak for the GlobalISel/AArch64 stuff but the LangRef + SelectionDAG parts LGTM, thanks! Probably best if you wait for another LGTM.

llvm/include/llvm/Target/TargetSelectionDAG.td

lawben · 2024-07-08T07:05:21Z

@efriedma-quic @RKSimon is there anything else to add/do? as @lukel97 suggested, I think it would be better to have more than one LGTM here

RKSimon

LGTM

efriedma-quic

LGTM

RKSimon · 2024-07-17T11:41:29Z

@lawben You should be OK to commit now

llvm-ci · 2024-07-17T14:22:24Z

LLVM Buildbot has detected a new failure on builder mlir-nvidia-gcc7 running on mlir-nvidia while building llvm at step 5 "build-check-mlir-build-only".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/116/builds/1373

Here is the relevant piece of the build log for the reference:

Step 5 (build-check-mlir-build-only) failure: build (failure)
...
   return ((llvm::isa<Tys>(t1) && llvm::isa<Tys>(t2)) || ... || false);
           ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:63:31: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/examples/transform/Ch4/lib/MyExtension.cpp:63:31: warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]
667.305 [51/16/4420] Linking CXX static library lib/libMyExtensionCh4.a
667.331 [50/16/4421] Building CXX object tools/mlir/examples/minimal-opt/CMakeFiles/mlir-minimal-opt.dir/mlir-minimal-opt.cpp.o
667.502 [49/16/4422] Linking CXX executable bin/mlir-minimal-opt
670.657 [48/16/4423] Linking CXX executable bin/mlir-transform-opt
682.453 [47/16/4424] Building CXX object tools/llc/CMakeFiles/llc.dir/llc.cpp.o
1636.700 [46/16/4425] Building CXX object tools/mlir/tools/mlir-reduce/CMakeFiles/mlir-reduce.dir/mlir-reduce.cpp.o
FAILED: tools/mlir/tools/mlir-reduce/CMakeFiles/mlir-reduce.dir/mlir-reduce.cpp.o 
CCACHE_CPP2=yes CCACHE_HASHDIR=yes /usr/bin/ccache /usr/bin/g++-7 -DBUILD_EXAMPLES -DGTEST_HAS_RTTI=0 -DMLIR_INCLUDE_TESTS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.obj/tools/mlir/tools/mlir-reduce -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/tools/mlir-reduce -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.obj/include -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/llvm/include -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/include -I/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.obj/tools/mlir/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -fno-lifetime-dse -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-uninitialized -Wno-nonnull -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -Wundef -Wno-unused-but-set-parameter -O3 -DNDEBUG  -fno-exceptions -funwind-tables -fno-rtti -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -std=c++1z -MD -MT tools/mlir/tools/mlir-reduce/CMakeFiles/mlir-reduce.dir/mlir-reduce.cpp.o -MF tools/mlir/tools/mlir-reduce/CMakeFiles/mlir-reduce.dir/mlir-reduce.cpp.o.d -o tools/mlir/tools/mlir-reduce/CMakeFiles/mlir-reduce.dir/mlir-reduce.cpp.o -c /vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/tools/mlir-reduce/mlir-reduce.cpp
In file included from /vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/include/mlir/Dialect/XeGPU/IR/XeGPU.h:34:0,
                 from /vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/include/mlir/InitAllDialects.h:95,
                 from /vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.src/mlir/tools/mlir-reduce/mlir-reduce.cpp:18:
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.obj/tools/mlir/include/mlir/Dialect/XeGPU/IR/XeGPU.h.inc: In member function ‘llvm::ArrayRef<long int> mlir::xegpu::CreateNdDescOp::getStaticStrides()’:
/vol/worker/mlir-nvidia/mlir-nvidia-gcc7/llvm.obj/tools/mlir/include/mlir/Dialect/XeGPU/IR/XeGPU.h.inc:1218:26: warning: unused variable ‘offset’ [-Wunused-variable]
     auto [strides, offset] = getStridesAndOffset(memrefType);
                          ^
g++-7: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
command timed out: 1200 seconds without output running [b'ninja', b'-j', b'16', b'check-mlir-build-only'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=4632.400717

This PR adds a new vector intrinsic `@llvm.experimental.vector.compress` to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where `mask[i] == 1`) to consecutive lanes in the result vector. A `passthru` vector can be provided, from which remaining lanes are filled. The main reason for this is that the existing `@llvm.masked.compressstore` has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is _very_ slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches. In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possibly RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case. See [discussion in forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663) for initial discussion on the design.

Summary: This PR adds a new vector intrinsic `@llvm.experimental.vector.compress` to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where `mask[i] == 1`) to consecutive lanes in the result vector. A `passthru` vector can be provided, from which remaining lanes are filled. The main reason for this is that the existing `@llvm.masked.compressstore` has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is _very_ slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches. In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possibly RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case. See [discussion in forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663) for initial discussion on the design. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250979

) This is a follow-up to #92289 that adds custom lowering of the new `@llvm.experimental.vector.compress` intrinsic on AArch64 with SVE instructions. Some vectors have a `compact` instruction that they can be lowered to.

…#101015) This is a follow-up to llvm#92289 that adds custom lowering of the new `@llvm.experimental.vector.compress` intrinsic on AArch64 with SVE instructions. Some vectors have a `compact` instruction that they can be lowered to.

This is a follow-up to #92289 that adds lowering of the new `@llvm.experimental.vector.compress` intrinsic on x86 with AVX512 instructions. This intrinsic maps directly to `vpcompress`.

lawben added 6 commits May 15, 2024 14:08

Add initial code for @llvm.masked.compress intrinsics

3a7b064

Remove requirements for legal types

75abf0b

Add tests for AArch64

0329bc9

Add floating point test

73bfebb

Add documentation

e4423a1

Fix formatting

3e99678

llvmbot added backend:AArch64 llvm:SelectionDAG SelectionDAGISel as well llvm:ir labels May 15, 2024

arsenm reviewed May 15, 2024

View reviewed changes

lawben added 4 commits May 16, 2024 11:15

Fix references in docs

b686f83

Add widen for vector type legalization

73cc28f

Put expand logic in TargerLowering to avoid code duplication.

8a613f3

Fix formatting

a4df959

topperc reviewed May 16, 2024

View reviewed changes

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

Add basic lowering of MCOMPRESS in GlobalISel

17004b9

arsenm reviewed May 17, 2024

View reviewed changes

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp Outdated Show resolved Hide resolved

lawben added 2 commits May 17, 2024 15:52

Add basic AArch64 MIR test

984cad1

Address PR comments

0ea2415

RKSimon reviewed May 17, 2024

View reviewed changes

Update docs according to PR comments

1dc79b4

RKSimon mentioned this pull request Jun 19, 2024

suboptimal shufflevector codegen, shufflevector+trunc #88030

Open

efriedma-quic reviewed Jun 19, 2024

View reviewed changes

lawben added 2 commits June 21, 2024 13:12

Address PR comments

50cce23

Return passthru for undef mask or vec

89c3e9f

RKSimon reviewed Jun 24, 2024

View reviewed changes

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp Outdated Show resolved Hide resolved

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp Outdated Show resolved Hide resolved

lawben added 3 commits June 25, 2024 10:20

Address PR comments

995c863

Rename masked.compress to experimental.vector.compress

adb9d9c

Add freeze to mask extract for poison/undef entries

ba2939e

lawben changed the title ~~[LLVM] Add llvm.masked.compress intrinsic~~ [LLVM] Add llvm.experimental.vector.compress intrinsic Jul 1, 2024

Fix docs

11e1742

lukel97 reviewed Jul 2, 2024

View reviewed changes

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved

llvm/docs/ReleaseNotes.rst Outdated Show resolved Hide resolved

Fix docs

32cc27f

lukel97 reviewed Jul 3, 2024

View reviewed changes

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp Show resolved Hide resolved

lukel97 approved these changes Jul 3, 2024

View reviewed changes

llvm/include/llvm/Target/TargetSelectionDAG.td Outdated Show resolved Hide resolved

Address PR comments

99610a8

RKSimon approved these changes Jul 8, 2024

View reviewed changes

efriedma-quic approved these changes Jul 16, 2024

View reviewed changes

Merge branch 'main' into masked_compress

efa2e92

lawben merged commit 177ce19 into llvm:main Jul 17, 2024
8 checks passed

lawben mentioned this pull request Jul 29, 2024

[AArch64] Add lowering for @llvm.experimental.vector.compress #101015

Merged

lawben mentioned this pull request Aug 20, 2024

[x86] Add lowering for @llvm.experimental.vector.compress #104904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVM] Add `llvm.experimental.vector.compress` intrinsic #92289

[LLVM] Add `llvm.experimental.vector.compress` intrinsic #92289

lawben commented May 15, 2024 •

edited

Loading

llvmbot commented May 15, 2024 •

edited

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

arsenm left a comment

lawben commented May 15, 2024

efriedma-quic commented May 15, 2024

fhahn commented May 15, 2024

lawben commented May 16, 2024

arsenm commented May 16, 2024

lawben commented May 17, 2024 •

edited

Loading

RKSimon left a comment

RKSimon May 17, 2024

lawben May 21, 2024

arsenm Jun 6, 2024

lawben commented May 17, 2024

efriedma-quic Jun 19, 2024

lawben Jun 21, 2024

efriedma-quic Jun 24, 2024

lawben Jun 25, 2024

efriedma-quic Jun 27, 2024

lukel97 Jun 28, 2024

lawben Jun 28, 2024

efriedma-quic Jun 28, 2024

RKSimon left a comment

lawben commented Jul 1, 2024

lukel97 left a comment

lukel97 left a comment

lawben commented Jul 8, 2024

RKSimon left a comment

efriedma-quic left a comment

RKSimon commented Jul 17, 2024

llvm-ci commented Jul 17, 2024

[LLVM] Add llvm.experimental.vector.compress intrinsic #92289

[LLVM] Add llvm.experimental.vector.compress intrinsic #92289

Conversation

lawben commented May 15, 2024 • edited Loading

llvmbot commented May 15, 2024 • edited Loading

github-actions bot commented May 15, 2024 • edited Loading

arsenm left a comment

Choose a reason for hiding this comment

lawben commented May 15, 2024

efriedma-quic commented May 15, 2024

fhahn commented May 15, 2024

lawben commented May 16, 2024

arsenm commented May 16, 2024

lawben commented May 17, 2024 • edited Loading

RKSimon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lawben commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RKSimon left a comment

Choose a reason for hiding this comment

lawben commented Jul 1, 2024

lukel97 left a comment

Choose a reason for hiding this comment

lukel97 left a comment

Choose a reason for hiding this comment

lawben commented Jul 8, 2024

RKSimon left a comment

Choose a reason for hiding this comment

efriedma-quic left a comment

Choose a reason for hiding this comment

RKSimon commented Jul 17, 2024

llvm-ci commented Jul 17, 2024

[LLVM] Add `llvm.experimental.vector.compress` intrinsic #92289

[LLVM] Add `llvm.experimental.vector.compress` intrinsic #92289

lawben commented May 15, 2024 •

edited

Loading

llvmbot commented May 15, 2024 •

edited

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

lawben commented May 17, 2024 •

edited

Loading