[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

qedawkins · 2024-09-11T17:28:23Z

The restriction to a single input and output was artificial as this op simply represents synchronization on input and output values. Additionally this removes the restriction on tensor/vector types, but for the time being this op is still only used with those types.

MaheshRavishankar

So what this mean to have multiple operations within the region?

I am actually not sure I fully understand why a region is needed and not just a value_barrier.

qedawkins · 2024-09-12T13:58:53Z

The main reason is to be able to better represent the barrier semantics of scf.forall after doing loop fusion. Currently we start with something like this

%0 = scf.forall {
  ...
}
scf.forall {
  tensor.extract_slice %0
  ...
}

And after fusion we form something like this

scf.forall {
  %0 = scf.for {
    // %0 body
  }
  iree_gpu.barrier_region %0 {
   ^bb0(%s)
    tensor.extract_slice %s
  }
  ...
}

(this actually has a bug because tensor.extract_slice is not guaranteed to bufferize to a read-effecting op, it typically just bufferizes to a memref.subview, meaning the actual read can happen later on after the barrier, but that is a bug with the loop fusion pattern).

This lowering works out ok because of some later vectorization patterns, but this requires the loop fusion pattern to find an extract_slice and for that extract_slice to actually be loading the thread local slice of data. This is too much spooky action at a distance and isn't an accurate representation of the IR before fusion. What we want to generate is this

scf.forall {
  %0 = iree_gpu.barrier_region ins(%0_dest) {
    %1 = scf.for {
      // %0 body
    }
    %2 = tensor.insert_slice %1
    iree_gpu.yield %2
  }
  tensor.extract_slice %0
  ...
}

Where the idea is that we have a barrier before and after the body of the fused forall op, instead of before and after the read of the shared memory result.

The reason this IR needs to be a region instead of an iree_gpu.value_barrier is that the value barrier would look something like this

%0 = scf.forall {
  ...
}
scf.forall {
  %alloc = bufferization.alloc_tensor()
  %0_dest = iree_gpu.value_barrier %alloc
  %1 = scf.for {
    // %0 body
  }
  %2 = tensor.insert_slice %1
  %3 = iree_gpu.value_barrier %2
  tensor.extract_slice %3
  ...
}

Where unless we made iree_gpu.value_barrier non-speculatable, would produce incorrect IR due to hoisting of the value_barrier out of parent loops.

MaheshRavishankar · 2024-09-13T01:17:44Z

Thanks @qedawkins for the explanation here and offline!

The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)

…s/results The restriction to a single input and output was artificial as this op simply represents synchronization on input and output values. Additionally this removes the restriction on tensor/vector types, but for the time being this op is still only used with those types.

The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)

qedawkins · 2024-09-19T01:59:21Z

@MaheshRavishankar can you take a look when you get a chance? This is blocking a number of follow up PRs I have.

MaheshRavishankar

THis itself looks fine, but the vectorization of barrier ops is kind of annoying.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp

qedawkins · 2024-09-19T14:15:40Z

Yeah don't worry the vectorization pattern is going away (it's broken anyway). I just needed a couple more changes dependent on this change before I could drop it so I had to make it work for now.

The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)

qedawkins requested review from MaheshRavishankar, Max191 and Groverkss September 11, 2024 17:28

qedawkins requested a review from antiagainst as a code owner September 11, 2024 17:28

qedawkins mentioned this pull request Sep 11, 2024

[Codegen][GPU] Add support for bufferizing iree_gpu.barrier_region #18497

Merged

MaheshRavishankar reviewed Sep 12, 2024

View reviewed changes

qedawkins added 2 commits September 13, 2024 12:27

Fix barrier_region builder

9691229

qedawkins force-pushed the barrier_region_multi_operands_and_results branch from e460b43 to 9691229 Compare September 13, 2024 16:27

qedawkins requested a review from MaheshRavishankar September 17, 2024 20:12

qedawkins mentioned this pull request Sep 17, 2024

[Codegen][GPU] Add pass to combine adjacent barrier_region ops #18541

Merged

qedawkins mentioned this pull request Sep 17, 2024

[Codegen][GPU] Change the location of barriers in forall fusion #18542

Open

Max191 approved these changes Sep 19, 2024

View reviewed changes

MaheshRavishankar approved these changes Sep 19, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp Show resolved Hide resolved

qedawkins merged commit c9eca66 into iree-org:main Sep 19, 2024
37 checks passed

qedawkins deleted the barrier_region_multi_operands_and_results branch September 19, 2024 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

qedawkins commented Sep 11, 2024

MaheshRavishankar left a comment

qedawkins commented Sep 12, 2024

MaheshRavishankar commented Sep 13, 2024

qedawkins commented Sep 19, 2024

MaheshRavishankar left a comment

qedawkins commented Sep 19, 2024

[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

Conversation

qedawkins commented Sep 11, 2024

MaheshRavishankar left a comment

Choose a reason for hiding this comment

qedawkins commented Sep 12, 2024

MaheshRavishankar commented Sep 13, 2024

qedawkins commented Sep 19, 2024

MaheshRavishankar left a comment

Choose a reason for hiding this comment

qedawkins commented Sep 19, 2024