Skip to content

[SYCL] Align Non-Uniform Groups with PR#14604: #19238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 19 commits into
base: sycl
Choose a base branch
from

Conversation

AndreiZibrov
Copy link
Contributor

@AndreiZibrov AndreiZibrov commented Jul 1, 2025

This PR implements the non-uniform groups API redesign proposed in PR 14604

  1. Removed redundant _group suffix from class names
  2. Used more descriptive function names that clearly indicate the partitioning behaviour
  3. United similar functionality to reduce cognitive overhead
  • fixed_size_group → chunk
  • tangle_group → tangle
  • ballot_group → fragment
  • opportunistic_group → merged into fragment
  • get_ballot_group() → binary_partition()
  • get_fixed_size_group() → chunked_partition()
  • get_tangle_group() → entangle()
  • is_user_constructed_updated and HasExtensionWordBoundary tests

WIP: -D_FORTIFY_SOURCE=2
  * fixed_size_group → chunk
  * tangle_group → tangle
  * ballot_group → fragment
  * opportunistic_group → merged into fragment
  * get_ballot_group() → binary_partition()
  * get_fixed_size_group<N>() → chunked_partition<N>()
  * get_tangle_group() → entangle()
  + is_user_constructed_updated and HasExtensionWordBoundary tests
@@ -1433,26 +1433,30 @@ class device_impl : public std::enable_shared_from_this<device_impl> {
CASE(ext_intel_esimd) {
return get_info_impl_nocheck<UR_DEVICE_INFO_ESIMD_SUPPORT>().value_or(0);
}
CASE(ext_oneapi_ballot_group) {
CASE(ext_oneapi_fragment) {
// check actual OpenCL extensions:
Copy link
Contributor Author

@AndreiZibrov AndreiZibrov Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each introduced entity has own aspect combination in opencl, so checking needed

UPD: left for separate PR

Comment on lines +182 to 185
// fragment partitions its parent into two groups (0 and 1)
// We have to force each group down different control flow
// Work-items in the "false" group (0) may still be active
if (g.get_group_id() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this logic still works now that we can create a fragment using either binary_partition or get_opportunistic_group.

Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?

fragment can be further subdivided by binary_partition, but each fragment instance will only "know" about its immediate parent.

BTW, why do we need the else branch here in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fragment can be further subdivided by binary_partition, but each fragment instance will only "know" about its immediate parent.

Right, so at each call to binary_partition we're getting back a fragment with at most 2 groups.

I'm just thinking aloud. If we added something like a labeled_partition that returned more than 2 conceptual fragments, I don't think the current code would work. We might want to add a TODO or a note to that effect, is what I'm thinking. A ballot_group could only represent the result of a binary split, which is why it was different before.

BTW, why do we need the else branch here in the first place?

This was the only way I could find to trick Clang into generating code with the correct semantics. I think if we want to replace this with something that's simpler and easier to understand, then we need a new SPIR-V extension.

Here's a very simple example of the sort of code that we have to worry about:

int x = ...;
auto fragment = syclx::binary_partition(sg, sg.get_local_id() % 2);
auto sum = sycl::reduce_over_group(fragment, x, sycl::plus<>());

Each work-item gets its own fragment object, and the information stored inside that object describes which other work-items are part of the same conceptual fragment. But we haven't actually changed anything about the execution model: the hardware will continue to execute all the work-items in the sub-group "together" (typically in lock-step).

If we implement reduce_over_group with no branch, our only option today would be to write something like this:

float reduce_over_group(fragment, float x, BinaryOperation BOp) {
  return __spirv_OpGroupNonUniformFAdd(x, BOp);
}

...which wouldn't work. All the work-items in the same sub-group will execute the same "dynamic instance" of the GroupNonUniformFAdd instruction, and so we'd end up reducing across the whole sub-group, instead of the two individual partitions.

The only way I could find to reduce over separate partitions using existing SPIR-V instructions like GroupNonUniformFAdd was to convert the partitions into control flow. Each partition encounters a unique dynamic instance of the SPIR-V instruction, and the reduction works as expected.

If we had SPIR-V instructions that expected a mask, we could shift that complexity to the device compiler, and the headers would just have to say:

float reduce_over_group(fragment f, float x, BinaryOperation BOp) {
  return __spirv_OpGroupMaskedFAdd(x, BOp, f.Mask);
}

Copy link
Contributor Author

@AndreiZibrov AndreiZibrov Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODOs, please confirm it looks as assumed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Pennycook, I understand why do we need the condition, but I'm still confused about the else branch.

int x = ...;
auto fragment = syclx::binary_partition(sg, sg.get_local_id() % 2);
auto val = sycl::all_of(fragment, x > 42);

What we have right now is:

if (g.get_group_id() == 1) {                                               
  return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
} else {                                                                   
  return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}                                                                          

Shouldn't it be something like?

if (current work-item is active within the current fragment) {                                               
  return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
} 

I.e. we should check mask embedded into the fragment and work-items which aren't a part of it just do nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really subtle, because SPMD is weird. It took me a long time to wrap my head around this (or, at least, to convince myself that I understood what was happening -- I might still be wrong 😆).

Each fragment object here represents two different groups with different masks. But if we wrote it as:

if (current work-item is active within the current fragment) {
  return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}

...then there's only one branch, based on one condition. It might help to think of it like this:

bool condition = /* current work-item is active within the current fragment */;
if (condition) {
  return __spirv_GroupNonUniformAll(group_scope<ParentGroup>::value, pred);
}

The compiler isn't smart enough to see that different work-items within the sub-group are computing condition by comparing against different mask values. All it sees is a branch on condition, and condition always evaluates to true (because each work-item will always be in the mask that represents which part of the sub-group it is in).

So, a branch like this effectively doesn't do anything, and we end up reducing over the whole sub-group.

Does that make sense?

Co-authored-by: John Pennycook <john.pennycook@intel.com>
Co-authored-by: John Pennycook <john.pennycook@intel.com>
Comment on lines +182 to 185
// fragment partitions its parent into two groups (0 and 1)
// We have to force each group down different control flow
// Work-items in the "false" group (0) may still be active
if (g.get_group_id() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work because we still know that our implementation has at most two groups? Do you agree that if we added some way to create more than two fragments from a function, we'd need to rewrite this as a loop?

fragment can be further subdivided by binary_partition, but each fragment instance will only "know" about its immediate parent.

BTW, why do we need the else branch here in the first place?

CASE(ext_oneapi_fragment) {
// check actual OpenCL extensions:
if (this->getBackend() == backend::opencl)
return has_extension("cl_khr_subgroup_non_uniform_vote") ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use NonUniformBitwiseAnd and it seems like we need cl_khr_subgroup_non_uniform_arithmetic for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so leaving for next PR?

// We have to force each group down different control flow
// Work-items in the "false" group (0) may still be active
if (g.get_group_id() == 1) {
return __spirv_GroupNonUniformAny(group_scope<ParentGroup>::value, pred);
} else {
return __spirv_GroupNonUniformAny(group_scope<ParentGroup>::value, pred);
}
// TODO: adding support for fragments have partitioning into more than two
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// if (g.get_group_id() == i)
// return __spirv_GroupNonUniformAll(
// group_scope<ParentGroup>::value,
// pred);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// return __spirv_GroupNonUniformBroadcast(
// group_scope<ParentGroup>::value,
// WideOCLX,
// OCLId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1315,6 +1365,15 @@ ControlBarrier(Group g, memory_scope FenceScope, memory_order Order) {
} else { \
return __spirv_GroupNonUniform##Instruction(Scope, OpInt, Arg); \
} \
// clang-format off \
/* TODO: add support for partitioning into more than two groups \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreiZibrov AndreiZibrov changed the title WIP [SYCL] Align Non-Uniform Groups with PR#14604: [SYCL] Align Non-Uniform Groups with PR#14604: Jul 2, 2025

} // namespace ext::oneapi::experimental
} // namespace _V1
} // namespace sycl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit;

Suggested change
} // namespace sycl
} // namespace sycl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a right that clang-format shouldn't pass this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any customizations to our clang-format to not pass these cases, but we normally require it anyway. I wouldn't be against a customization to our clang-format to enforce it though.

inline std::enable_if_t<sycl::is_group_v<std::decay_t<ParentGroup>> &&
std::is_same_v<ParentGroup, sycl::sub_group>,
fragment<ParentGroup>>
binary_partition(ParentGroup parent, bool predicate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14604 relaxed this to allow ParentGroup to also allow chunk and fragment. Will this be done in a follow-up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbd in next commit

inline std::enable_if_t<sycl::is_group_v<std::decay_t<ParentGroup>> &&
std::is_same_v<ParentGroup, sycl::sub_group>,
chunk<ChunkSize, ParentGroup>>
chunked_partition(ParentGroup parent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also allow chunk as a ParentGroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbd in next commit

} // namespace sycl

// chunk->fragment conversion
// must be defined after fragment class is available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the fragment class is available here and inside the definition of chunk? Could this definition be moved into the body of fragment above?

Also, seems like we need to include fragment.hpp instead of doing a forward declaration. I suspect this works because we coincidentally end up including it at top-level or because instantiations are evaluated after the classes are made available.

Copy link
Contributor Author

@AndreiZibrov AndreiZibrov Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbd in next feature commit. currently dealing with CI compilation erros

Co-authored-by: John Pennycook <john.pennycook@intel.com>
@AndreiZibrov
Copy link
Contributor Author

It looks like there is something mismatching with namespaces building via includes between Windows part(passed Build + LIT) and Linux (Build + Lit):
https://github.com/intel/llvm/actions/runs/16032949268/job/45237784966?pr=19238#:~:text=sycl/ext/oneapi/experimental/cuda/non_uniform_algorithms.hpp%3A103%3A46%3A%20error%3A%20use,%23%20%7C%20%20%20103%20%7C%20inline%20__SYCL_ALWAYS_INLINE%20std%3A%3Aenable_if_t%3Csyclex%3A%3Ais_chunk_v%3CGroup%3E%2C%20T%3E it says that the ext::oneapi has no experimental in it. I've tried relative/system path type for includes and different combinations in latest commits... so leaving for later

image image


namespace sycl {
inline namespace _V1 {
namespace detail {

using syclex = ::sycl::ext::oneapi::experimental;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using syclex = ::sycl::ext::oneapi::experimental;
namespace syclex = sycl::ext::oneapi::experimental;

I suspect that will fix the build failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, I am not too keen on defining these sorts of aliases in public headers. Granted, it is in the detail namespace, I fear it will cause us problems down the line.

Also, it's a change that isn't really related to the other changes, so it takes away from the main goal of this patch.

Copy link
Contributor Author

@AndreiZibrov AndreiZibrov Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only linux CI(only) jobs failed are saying that 'experimental' is not part of 'oneapi'
I was just trying to change the path of aimed namespace just at least to reach it on CI

I know it's not related - I'd replace it by working solution. Currently still I haven't done/reproduced such error locally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only linux CI(only) jobs failed are saying that 'experimental' is not part of 'oneapi'

More specifically, it says "error: no type named 'experimental' in namespace 'sycl::ext::oneapi'" which is true, there's no experimental type in that namespace. 😉

Even if some compilers allow this, we need to use the variants that all our supported compilers allow. I also don't see this kind of using being allowed by the using-directives, but I may also just be reading the reference wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants