Skip to content

Commit

Permalink
[LLVM] Add llvm.experimental.vector.compress intrinsic (#92289)
Browse files Browse the repository at this point in the history
Summary:
This PR adds a new vector intrinsic `@llvm.experimental.vector.compress`
to "compress" data within a vector based on a selection mask, i.e., it
moves all selected values (i.e., where `mask[i] == 1`) to consecutive
lanes in the result vector. A `passthru` vector can be provided, from
which remaining lanes are filled.

The main reason for this is that the existing
`@llvm.masked.compressstore` has very strong constraints in that it can
only write values that were selected, resulting in guard branches for
all targets except AVX-512 (and even there the AMD implementation is
_very_ slow). More instruction sets support "compress" logic, but only
within registers. So to store the values, an additional store is needed.
But this combination is likely significantly faster on many target as it
avoids branches.

In follow up PRs, my plan is to add target-specific lowerings for x86,
SVE, and possibly RISCV. I also want to combine this with a store
instruction, as this is probably a common case and we can avoid some
memory writes in that case.

See [discussion in
forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663)
for initial discussion on the design.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60250979
  • Loading branch information
lawben authored and yuxuanchen1997 committed Jul 25, 2024
1 parent 0b4efa4 commit 46e0dfe
Show file tree
Hide file tree
Showing 27 changed files with 1,105 additions and 1 deletion.
7 changes: 7 additions & 0 deletions llvm/docs/GlobalISel/GenericOpcode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -726,6 +726,13 @@ The type of the operand must be equal to or larger than the vector element
type. If the operand is larger than the vector element type, the scalar is
implicitly truncated to the vector element type.

G_VECTOR_COMPRESS
^^^^^^^^^^^^^^^^^

Given an input vector, a mask vector, and a passthru vector, continuously place
all selected (i.e., where mask[i] = true) input lanes in an output vector. All
remaining lanes in the output are taken from passthru, which may be undef.

Vector Reduction Operations
---------------------------

Expand Down
87 changes: 87 additions & 0 deletions llvm/docs/LangRef.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19525,6 +19525,93 @@ the follow sequence of operations:

The ``mask`` operand will apply to at least the gather and scatter operations.


.. _int_vector_compress:

'``llvm.experimental.vector.compress.*``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

LLVM provides an intrinsic for compressing data within a vector based on a selection mask.
Semantically, this is similar to :ref:`llvm.masked.compressstore <int_compressstore>` but with weaker assumptions
and without storing the results to memory, i.e., the data remains in the vector.

Syntax:
"""""""
This is an overloaded intrinsic. A number of scalar values of integer, floating point or pointer data type are collected
from an input vector and placed adjacently within the result vector. A mask defines which elements to collect from the vector.
The remaining lanes are filled with values from ``passthru``.

:: code-block:: llvm

declare <8 x i32> @llvm.experimental.vector.compress.v8i32(<8 x i32> <value>, <8 x i1> <mask>, <8 x i32> <passthru>)
declare <16 x float> @llvm.experimental.vector.compress.v16f32(<16 x float> <value>, <16 x i1> <mask>, <16 x float> undef)

Overview:
"""""""""

Selects elements from input vector ``value`` according to the ``mask``.
All selected elements are written into adjacent lanes in the result vector,
from lower to higher.
The mask holds an entry for each vector lane, and is used to select elements
to be kept.
If a ``passthru`` vector is given, all remaining lanes are filled with the
corresponding lane's value from ``passthru``.
The main difference to :ref:`llvm.masked.compressstore <int_compressstore>` is
that the we do not need to guard against memory access for unselected lanes.
This allows for branchless code and better optimization for all targets that
do not support or have inefficient
instructions of the explicit semantics of
:ref:`llvm.masked.compressstore <int_compressstore>` but still have some form
of compress operations.
The result vector can be written with a similar effect, as all the selected
values are at the lower positions of the vector, but without requiring
branches to avoid writes where the mask is ``false``.

Arguments:
""""""""""

The first operand is the input vector, from which elements are selected.
The second operand is the mask, a vector of boolean values.
The third operand is the passthru vector, from which elements are filled
into remaining lanes.
The mask and the input vector must have the same number of vector elements.
The input and passthru vectors must have the same type.

Semantics:
""""""""""

The ``llvm.experimental.vector.compress`` intrinsic compresses data within a vector.
It collects elements from possibly non-adjacent lanes of a vector and places
them contiguously in the result vector based on a selection mask, filling the
remaining lanes with values from ``passthru``.
This intrinsic performs the logic of the following C++ example.
All values in ``out`` after the last selected one are undefined if
``passthru`` is undefined.
If all entries in the ``mask`` are 0, the ``out`` vector is ``passthru``.
If any element of the mask is poison, all elements of the result are poison.
Otherwise, if any element of the mask is undef, all elements of the result are undef.
If ``passthru`` is undefined, the number of valid lanes is equal to the number
of ``true`` entries in the mask, i.e., all lanes >= number-of-selected-values
are undefined.

.. code-block:: cpp

// Consecutively place selected values in a vector.
using VecT __attribute__((vector_size(N))) = int;
VecT compress(VecT vec, VecT mask, VecT passthru) {
VecT out;
int idx = 0;
for (int i = 0; i < N / sizeof(int); ++i) {
out[idx] = vec[i];
idx += static_cast<bool>(mask[i]);
}
for (; idx < N / sizeof(int); ++idx) {
out[idx] = passthru[idx];
}
return out;
}


Matrix Intrinsics
-----------------

Expand Down
1 change: 1 addition & 0 deletions llvm/docs/ReleaseNotes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ Changes to the LLVM IR
* ``llvm.instprof.mcdc.tvbitmap.update``: 3rd argument has been
removed. The next argument has been changed from byte index to bit
index.
* Added ``llvm.experimental.vector.compress`` intrinsic.

Changes to LLVM infrastructure
------------------------------
Expand Down
1 change: 1 addition & 0 deletions llvm/include/llvm/CodeGen/GlobalISel/LegalizerHelper.h
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ class LegalizerHelper {
LegalizeResult lowerUnmergeValues(MachineInstr &MI);
LegalizeResult lowerExtractInsertVectorElt(MachineInstr &MI);
LegalizeResult lowerShuffleVector(MachineInstr &MI);
LegalizeResult lowerVECTOR_COMPRESS(MachineInstr &MI);
Register getDynStackAllocTargetPtr(Register SPReg, Register AllocSize,
Align Alignment, LLT PtrTy);
LegalizeResult lowerDynStackAlloc(MachineInstr &MI);
Expand Down
8 changes: 8 additions & 0 deletions llvm/include/llvm/CodeGen/ISDOpcodes.h
Original file line number Diff line number Diff line change
Expand Up @@ -659,6 +659,14 @@ enum NodeType {
/// non-constant operands.
STEP_VECTOR,

/// VECTOR_COMPRESS(Vec, Mask, Passthru)
/// consecutively place vector elements based on mask
/// e.g., vec = {A, B, C, D} and mask = {1, 0, 1, 0}
/// --> {A, C, ?, ?} where ? is undefined
/// If passthru is defined, ?s are replaced with elements from passthru.
/// If passthru is undef, ?s remain undefined.
VECTOR_COMPRESS,

/// MULHU/MULHS - Multiply high - Multiply two integers of type iN,
/// producing an unsigned/signed value of type i[2*N], then return the top
/// part.
Expand Down
4 changes: 4 additions & 0 deletions llvm/include/llvm/CodeGen/TargetLowering.h
Original file line number Diff line number Diff line change
Expand Up @@ -5496,6 +5496,10 @@ class TargetLowering : public TargetLoweringBase {
/// method accepts vectors as its arguments.
SDValue expandVectorSplice(SDNode *Node, SelectionDAG &DAG) const;

/// Expand a vector VECTOR_COMPRESS into a sequence of extract element, store
/// temporarily, advance store position, before re-loading the final vector.
SDValue expandVECTOR_COMPRESS(SDNode *Node, SelectionDAG &DAG) const;

/// Legalize a SETCC or VP_SETCC with given LHS and RHS and condition code CC
/// on the current target. A VP_SETCC will additionally be given a Mask
/// and/or EVL not equal to SDValue().
Expand Down
5 changes: 5 additions & 0 deletions llvm/include/llvm/IR/Intrinsics.td
Original file line number Diff line number Diff line change
Expand Up @@ -2398,6 +2398,11 @@ def int_masked_compressstore:
[IntrWriteMem, IntrArgMemOnly, IntrWillReturn,
NoCapture<ArgIndex<1>>]>;

def int_experimental_vector_compress:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
[IntrNoMem, IntrWillReturn]>;

// Test whether a pointer is associated with a type metadata identifier.
def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
[IntrNoMem, IntrWillReturn, IntrSpeculatable]>;
Expand Down
3 changes: 3 additions & 0 deletions llvm/include/llvm/Support/TargetOpcodes.def
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,9 @@ HANDLE_TARGET_OPCODE(G_SHUFFLE_VECTOR)
/// Generic splatvector.
HANDLE_TARGET_OPCODE(G_SPLAT_VECTOR)

/// Generic masked compress.
HANDLE_TARGET_OPCODE(G_VECTOR_COMPRESS)

/// Generic count trailing zeroes.
HANDLE_TARGET_OPCODE(G_CTTZ)

Expand Down
7 changes: 7 additions & 0 deletions llvm/include/llvm/Target/GenericOpcodes.td
Original file line number Diff line number Diff line change
Expand Up @@ -1548,6 +1548,13 @@ def G_SPLAT_VECTOR: GenericInstruction {
let hasSideEffects = false;
}

// Generic masked compress.
def G_VECTOR_COMPRESS: GenericInstruction {
let OutOperandList = (outs type0:$dst);
let InOperandList = (ins type0:$vec, type1:$mask, type0:$passthru);
let hasSideEffects = false;
}

//------------------------------------------------------------------------------
// Vector reductions
//------------------------------------------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions llvm/include/llvm/Target/GlobalISel/SelectionDAGCompat.td
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ def : GINodeEquiv<G_VECREDUCE_UMAX, vecreduce_umax>;
def : GINodeEquiv<G_VECREDUCE_SMIN, vecreduce_smin>;
def : GINodeEquiv<G_VECREDUCE_SMAX, vecreduce_smax>;
def : GINodeEquiv<G_VECREDUCE_ADD, vecreduce_add>;
def : GINodeEquiv<G_VECTOR_COMPRESS, vector_compress>;

def : GINodeEquiv<G_STRICT_FADD, strict_fadd>;
def : GINodeEquiv<G_STRICT_FSUB, strict_fsub>;
Expand Down
8 changes: 8 additions & 0 deletions llvm/include/llvm/Target/TargetSelectionDAG.td
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,12 @@ def SDTMaskedScatter : SDTypeProfile<0, 4, [
SDTCisSameNumEltsAs<0, 1>, SDTCisSameNumEltsAs<0, 3>
]>;

def SDTVectorCompress : SDTypeProfile<1, 3, [
SDTCisVec<0>, SDTCisSameAs<0, 1>,
SDTCisVec<2>, SDTCisSameNumEltsAs<1, 2>,
SDTCisSameAs<1, 3>
]>;

def SDTVecShuffle : SDTypeProfile<1, 2, [
SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>
]>;
Expand Down Expand Up @@ -757,6 +763,8 @@ def masked_gather : SDNode<"ISD::MGATHER", SDTMaskedGather,
def masked_scatter : SDNode<"ISD::MSCATTER", SDTMaskedScatter,
[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;

def vector_compress : SDNode<"ISD::VECTOR_COMPRESS", SDTVectorCompress>;

// Do not use ld, st directly. Use load, extload, sextload, zextload, store,
// and truncst (see below).
def ld : SDNode<"ISD::LOAD" , SDTLoad,
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1994,6 +1994,8 @@ unsigned IRTranslator::getSimpleIntrinsicOpcode(Intrinsic::ID ID) {
return TargetOpcode::G_VECREDUCE_UMAX;
case Intrinsic::vector_reduce_umin:
return TargetOpcode::G_VECREDUCE_UMIN;
case Intrinsic::experimental_vector_compress:
return TargetOpcode::G_VECTOR_COMPRESS;
case Intrinsic::lround:
return TargetOpcode::G_LROUND;
case Intrinsic::llround:
Expand Down
89 changes: 89 additions & 0 deletions llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4034,6 +4034,8 @@ LegalizerHelper::lower(MachineInstr &MI, unsigned TypeIdx, LLT LowerHintTy) {
return lowerExtractInsertVectorElt(MI);
case G_SHUFFLE_VECTOR:
return lowerShuffleVector(MI);
case G_VECTOR_COMPRESS:
return lowerVECTOR_COMPRESS(MI);
case G_DYN_STACKALLOC:
return lowerDynStackAlloc(MI);
case G_STACKSAVE:
Expand Down Expand Up @@ -7593,6 +7595,93 @@ LegalizerHelper::lowerShuffleVector(MachineInstr &MI) {
return Legalized;
}

LegalizerHelper::LegalizeResult
LegalizerHelper::lowerVECTOR_COMPRESS(llvm::MachineInstr &MI) {
auto [Dst, DstTy, Vec, VecTy, Mask, MaskTy, Passthru, PassthruTy] =
MI.getFirst4RegLLTs();

if (VecTy.isScalableVector())
report_fatal_error("Cannot expand masked_compress for scalable vectors.");

Align VecAlign = getStackTemporaryAlignment(VecTy);
MachinePointerInfo PtrInfo;
Register StackPtr =
createStackTemporary(TypeSize::getFixed(VecTy.getSizeInBytes()), VecAlign,
PtrInfo)
.getReg(0);
MachinePointerInfo ValPtrInfo =
MachinePointerInfo::getUnknownStack(*MI.getMF());

LLT IdxTy = LLT::scalar(32);
LLT ValTy = VecTy.getElementType();
Align ValAlign = getStackTemporaryAlignment(ValTy);

auto OutPos = MIRBuilder.buildConstant(IdxTy, 0);

bool HasPassthru =
MRI.getVRegDef(Passthru)->getOpcode() != TargetOpcode::G_IMPLICIT_DEF;

if (HasPassthru)
MIRBuilder.buildStore(Passthru, StackPtr, PtrInfo, VecAlign);

Register LastWriteVal;
std::optional<APInt> PassthruSplatVal =
isConstantOrConstantSplatVector(*MRI.getVRegDef(Passthru), MRI);

if (PassthruSplatVal.has_value()) {
LastWriteVal =
MIRBuilder.buildConstant(ValTy, PassthruSplatVal.value()).getReg(0);
} else if (HasPassthru) {
auto Popcount = MIRBuilder.buildZExt(MaskTy.changeElementSize(32), Mask);
Popcount = MIRBuilder.buildInstr(TargetOpcode::G_VECREDUCE_ADD,
{LLT::scalar(32)}, {Popcount});

Register LastElmtPtr =
getVectorElementPointer(StackPtr, VecTy, Popcount.getReg(0));
LastWriteVal =
MIRBuilder.buildLoad(ValTy, LastElmtPtr, ValPtrInfo, ValAlign)
.getReg(0);
}

unsigned NumElmts = VecTy.getNumElements();
for (unsigned I = 0; I < NumElmts; ++I) {
auto Idx = MIRBuilder.buildConstant(IdxTy, I);
auto Val = MIRBuilder.buildExtractVectorElement(ValTy, Vec, Idx);
Register ElmtPtr =
getVectorElementPointer(StackPtr, VecTy, OutPos.getReg(0));
MIRBuilder.buildStore(Val, ElmtPtr, ValPtrInfo, ValAlign);

LLT MaskITy = MaskTy.getElementType();
auto MaskI = MIRBuilder.buildExtractVectorElement(MaskITy, Mask, Idx);
if (MaskITy.getSizeInBits() > 1)
MaskI = MIRBuilder.buildTrunc(LLT::scalar(1), MaskI);

MaskI = MIRBuilder.buildZExt(IdxTy, MaskI);
OutPos = MIRBuilder.buildAdd(IdxTy, OutPos, MaskI);

if (HasPassthru && I == NumElmts - 1) {
auto EndOfVector =
MIRBuilder.buildConstant(IdxTy, VecTy.getNumElements() - 1);
auto AllLanesSelected = MIRBuilder.buildICmp(
CmpInst::ICMP_UGT, LLT::scalar(1), OutPos, EndOfVector);
OutPos = MIRBuilder.buildInstr(TargetOpcode::G_UMIN, {IdxTy},
{OutPos, EndOfVector});
ElmtPtr = getVectorElementPointer(StackPtr, VecTy, OutPos.getReg(0));

LastWriteVal =
MIRBuilder.buildSelect(ValTy, AllLanesSelected, Val, LastWriteVal)
.getReg(0);
MIRBuilder.buildStore(LastWriteVal, ElmtPtr, ValPtrInfo, ValAlign);
}
}

// TODO: Use StackPtr's FrameIndex alignment.
MIRBuilder.buildLoad(Dst, StackPtr, PtrInfo, VecAlign);

MI.eraseFromParent();
return Legalized;
}

Register LegalizerHelper::getDynStackAllocTargetPtr(Register SPReg,
Register AllocSize,
Align Alignment,
Expand Down
Loading

0 comments on commit 46e0dfe

Please sign in to comment.