-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL] Align Non-Uniform Groups with PR#14604: #19238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
WIP: -D_FORTIFY_SOURCE=2 * fixed_size_group → chunk * tangle_group → tangle * ballot_group → fragment * opportunistic_group → merged into fragment * get_ballot_group() → binary_partition() * get_fixed_size_group<N>() → chunked_partition<N>() * get_tangle_group() → entangle() + is_user_constructed_updated and HasExtensionWordBoundary tests
sycl/source/detail/device_impl.hpp
Outdated
@@ -1433,26 +1433,30 @@ class device_impl : public std::enable_shared_from_this<device_impl> { | |||
CASE(ext_intel_esimd) { | |||
return get_info_impl_nocheck<UR_DEVICE_INFO_ESIMD_SUPPORT>().value_or(0); | |||
} | |||
CASE(ext_oneapi_ballot_group) { | |||
CASE(ext_oneapi_fragment) { | |||
// check actual OpenCL extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each introduced entity has own aspect combination in opencl, so checking needed
UPD: left for separate PR
// fragment partitions its parent into two groups (0 and 1) | ||
// We have to force each group down different control flow | ||
// Work-items in the "false" group (0) may still be active | ||
if (g.get_group_id() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this logic still works now that we can create a fragment
using either binary_partition
or get_opportunistic_group
.
Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?
fragment
can be further subdivided by binary_partition
, but each fragment
instance will only "know" about its immediate parent.
BTW, why do we need the else
branch here in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fragment
can be further subdivided bybinary_partition
, but eachfragment
instance will only "know" about its immediate parent.
Right, so at each call to binary_partition
we're getting back a fragment
with at most 2 groups.
I'm just thinking aloud. If we added something like a labeled_partition
that returned more than 2 conceptual fragments, I don't think the current code would work. We might want to add a TODO or a note to that effect, is what I'm thinking. A ballot_group
could only represent the result of a binary split, which is why it was different before.
BTW, why do we need the
else
branch here in the first place?
This was the only way I could find to trick Clang into generating code with the correct semantics. I think if we want to replace this with something that's simpler and easier to understand, then we need a new SPIR-V extension.
Here's a very simple example of the sort of code that we have to worry about:
int x = ...;
auto fragment = syclx::binary_partition(sg, sg.get_local_id() % 2);
auto sum = sycl::reduce_over_group(fragment, x, sycl::plus<>());
Each work-item gets its own fragment
object, and the information stored inside that object describes which other work-items are part of the same conceptual fragment. But we haven't actually changed anything about the execution model: the hardware will continue to execute all the work-items in the sub-group "together" (typically in lock-step).
If we implement reduce_over_group
with no branch, our only option today would be to write something like this:
float reduce_over_group(fragment, float x, BinaryOperation BOp) {
return __spirv_OpGroupNonUniformFAdd(x, BOp);
}
...which wouldn't work. All the work-items in the same sub-group will execute the same "dynamic instance" of the GroupNonUniformFAdd
instruction, and so we'd end up reducing across the whole sub-group, instead of the two individual partitions.
The only way I could find to reduce over separate partitions using existing SPIR-V instructions like GroupNonUniformFAdd
was to convert the partitions into control flow. Each partition encounters a unique dynamic instance of the SPIR-V instruction, and the reduction works as expected.
If we had SPIR-V instructions that expected a mask, we could shift that complexity to the device compiler, and the headers would just have to say:
float reduce_over_group(fragment f, float x, BinaryOperation BOp) {
return __spirv_OpGroupMaskedFAdd(x, BOp, f.Mask);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODOs, please confirm it looks as assumed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Pennycook, I understand why do we need the condition, but I'm still confused about the else
branch.
int x = ...;
auto fragment = syclx::binary_partition(sg, sg.get_local_id() % 2);
auto val = sycl::all_of(fragment, x > 42);
What we have right now is:
if (g.get_group_id() == 1) {
return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
} else {
return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}
Shouldn't it be something like?
if (current work-item is active within the current fragment) {
return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}
I.e. we should check mask embedded into the fragment
and work-items which aren't a part of it just do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really subtle, because SPMD is weird. It took me a long time to wrap my head around this (or, at least, to convince myself that I understood what was happening -- I might still be wrong 😆).
Each fragment
object here represents two different groups with different masks. But if we wrote it as:
if (current work-item is active within the current fragment) {
return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}
...then there's only one branch, based on one condition. It might help to think of it like this:
bool condition = /* current work-item is active within the current fragment */;
if (condition) {
return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}
The compiler isn't smart enough to see that different work-items within the sub-group are computing condition
by comparing against different mask values. All it sees is a branch on condition
, and condition
always evaluates to true
(because each work-item will always be in the mask that represents which part of the sub-group it is in).
So, a branch like this effectively doesn't do anything, and we end up reducing over the whole sub-group.
Does that make sense?
Co-authored-by: John Pennycook <john.pennycook@intel.com>
Co-authored-by: John Pennycook <john.pennycook@intel.com>
// fragment partitions its parent into two groups (0 and 1) | ||
// We have to force each group down different control flow | ||
// Work-items in the "false" group (0) may still be active | ||
if (g.get_group_id() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?
fragment
can be further subdivided by binary_partition
, but each fragment
instance will only "know" about its immediate parent.
BTW, why do we need the else
branch here in the first place?
sycl/source/detail/device_impl.hpp
Outdated
CASE(ext_oneapi_fragment) { | ||
// check actual OpenCL extensions: | ||
if (this->getBackend() == backend::opencl) | ||
return has_extension("cl_khr_subgroup_non_uniform_vote") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use NonUniformBitwiseAnd
and it seems like we need cl_khr_subgroup_non_uniform_arithmetic
for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so leaving for next PR?
// We have to force each group down different control flow | ||
// Work-items in the "false" group (0) may still be active | ||
if (g.get_group_id() == 1) { | ||
return __spirv_GroupNonUniformAny(group_scope<ParentGroup>::value, pred); | ||
} else { | ||
return __spirv_GroupNonUniformAny(group_scope<ParentGroup>::value, pred); | ||
} | ||
// TODO: adding support for fragments have partitioning into more than two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as requested
// if (g.get_group_id() == i) | ||
// return __spirv_GroupNonUniformAll( | ||
// group_scope<ParentGroup>::value, | ||
// pred); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as requested
// return __spirv_GroupNonUniformBroadcast( | ||
// group_scope<ParentGroup>::value, | ||
// WideOCLX, | ||
// OCLId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as requested
sycl/include/sycl/detail/spirv.hpp
Outdated
@@ -1315,6 +1365,15 @@ ControlBarrier(Group g, memory_scope FenceScope, memory_order Order) { | |||
} else { \ | |||
return __spirv_GroupNonUniform##Instruction(Scope, OpInt, Arg); \ | |||
} \ | |||
// clang-format off \ | |||
/* TODO: add support for partitioning into more than two groups \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as requested
|
||
} // namespace ext::oneapi::experimental | ||
} // namespace _V1 | ||
} // namespace sycl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit;
} // namespace sycl | |
} // namespace sycl | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a right that clang-format shouldn't pass this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any customizations to our clang-format to not pass these cases, but we normally require it anyway. I wouldn't be against a customization to our clang-format to enforce it though.
inline std::enable_if_t<sycl::is_group_v<std::decay_t<ParentGroup>> && | ||
std::is_same_v<ParentGroup, sycl::sub_group>, | ||
fragment<ParentGroup>> | ||
binary_partition(ParentGroup parent, bool predicate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14604 relaxed this to allow ParentGroup
to also allow chunk
and fragment
. Will this be done in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbd in next commit
inline std::enable_if_t<sycl::is_group_v<std::decay_t<ParentGroup>> && | ||
std::is_same_v<ParentGroup, sycl::sub_group>, | ||
chunk<ChunkSize, ParentGroup>> | ||
chunked_partition(ParentGroup parent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also allow chunk
as a ParentGroup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbd in next commit
} // namespace sycl | ||
|
||
// chunk->fragment conversion | ||
// must be defined after fragment class is available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the fragment
class is available here and inside the definition of chunk
? Could this definition be moved into the body of fragment
above?
Also, seems like we need to include fragment.hpp
instead of doing a forward declaration. I suspect this works because we coincidentally end up including it at top-level or because instantiations are evaluated after the classes are made available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbd in next feature commit. currently dealing with CI compilation erros
Co-authored-by: John Pennycook <john.pennycook@intel.com>
It looks like there is something mismatching with namespaces building via includes between Windows part(passed Build + LIT) and Linux (Build + Lit): ![]() ![]() |
|
||
namespace sycl { | ||
inline namespace _V1 { | ||
namespace detail { | ||
|
||
using syclex = ::sycl::ext::oneapi::experimental; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using syclex = ::sycl::ext::oneapi::experimental; | |
namespace syclex = sycl::ext::oneapi::experimental; |
I suspect that will fix the build failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, I am not too keen on defining these sorts of aliases in public headers. Granted, it is in the detail namespace, I fear it will cause us problems down the line.
Also, it's a change that isn't really related to the other changes, so it takes away from the main goal of this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only linux CI(only) jobs failed are saying that 'experimental' is not part of 'oneapi'
I was just trying to change the path of aimed namespace just at least to reach it on CI
I know it's not related - I'd replace it by working solution. Currently still I haven't done/reproduced such error locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only linux CI(only) jobs failed are saying that 'experimental' is not part of 'oneapi'
More specifically, it says "error: no type named 'experimental' in namespace 'sycl::ext::oneapi'" which is true, there's no experimental
type in that namespace. 😉
Even if some compilers allow this, we need to use the variants that all our supported compilers allow. I also don't see this kind of using being allowed by the using-directives, but I may also just be reading the reference wrong.
This PR implements the non-uniform groups API redesign proposed in PR 14604
_group
suffix from class names