Skip to content

Commit

Permalink
fix ROCm 6.x segfaults on MI50
Browse files Browse the repository at this point in the history
There is some weird interaction between inlining of shfl_xor and the
(otherwise unused) members of thread_block_tile.
The easiest way of working around it is to inline them explicitly as
__shfl_xor(_sync).
  • Loading branch information
upsj committed Aug 24, 2024
1 parent 3d035a0 commit 9c82da0
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions common/cuda_hip/components/sorting.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,15 @@ struct bitonic_warp {

__forceinline__ __device__ static void merge(ValueType* els, bool reverse)
{
auto tile =
group::tiled_partition<num_threads>(group::this_thread_block());
auto new_reverse = reverse != upper_half();
for (int i = 0; i < num_local; ++i) {
auto other = tile.shfl_xor(els[i], num_threads / 2);
// workaround for ROCm 6.x segfaults on gfx906
#ifdef GKO_COMPILING_CUDA
auto other = __shfl_xor_sync(config::full_lane_mask, els[i],
num_threads / 2);
#else
auto other = __shfl_xor(els[i], num_threads / 2);
#endif
bitonic_cas(els[i], other, new_reverse);
}
half::merge(els, reverse);
Expand Down

0 comments on commit 9c82da0

Please sign in to comment.