Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490

Merged

Conversation

qedawkins
Copy link
Contributor

The restriction to a single input and output was artificial as this op simply represents synchronization on input and output values. Additionally this removes the restriction on tensor/vector types, but for the time being this op is still only used with those types.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what this mean to have multiple operations within the region?

I am actually not sure I fully understand why a region is needed and not just a value_barrier.

@qedawkins
Copy link
Contributor Author

The main reason is to be able to better represent the barrier semantics of scf.forall after doing loop fusion. Currently we start with something like this

%0 = scf.forall {
  ...
}
scf.forall {
  tensor.extract_slice %0
  ...
}

And after fusion we form something like this

scf.forall {
  %0 = scf.for {
    // %0 body
  }
  iree_gpu.barrier_region %0 {
   ^bb0(%s)
    tensor.extract_slice %s
  }
  ...
}

(this actually has a bug because tensor.extract_slice is not guaranteed to bufferize to a read-effecting op, it typically just bufferizes to a memref.subview, meaning the actual read can happen later on after the barrier, but that is a bug with the loop fusion pattern).

This lowering works out ok because of some later vectorization patterns, but this requires the loop fusion pattern to find an extract_slice and for that extract_slice to actually be loading the thread local slice of data. This is too much spooky action at a distance and isn't an accurate representation of the IR before fusion. What we want to generate is this

scf.forall {
  %0 = iree_gpu.barrier_region ins(%0_dest) {
    %1 = scf.for {
      // %0 body
    }
    %2 = tensor.insert_slice %1
    iree_gpu.yield %2
  }
  tensor.extract_slice %0
  ...
}

Where the idea is that we have a barrier before and after the body of the fused forall op, instead of before and after the read of the shared memory result.

The reason this IR needs to be a region instead of an iree_gpu.value_barrier is that the value barrier would look something like this

%0 = scf.forall {
  ...
}
scf.forall {
  %alloc = bufferization.alloc_tensor()
  %0_dest = iree_gpu.value_barrier %alloc
  %1 = scf.for {
    // %0 body
  }
  %2 = tensor.insert_slice %1
  %3 = iree_gpu.value_barrier %2
  tensor.extract_slice %3
  ...
}

Where unless we made iree_gpu.value_barrier non-speculatable, would produce incorrect IR due to hoisting of the value_barrier out of parent loops.

@MaheshRavishankar
Copy link
Contributor

Thanks @qedawkins for the explanation here and offline!

qedawkins added a commit to qedawkins/iree that referenced this pull request Sep 13, 2024
The way that barriers are currently inserted for forall fusion is
fragile and trying to model "WaR" conflicts on tensors (kind of). We
instead want to put the barrier around the body of the whole scf.forall.

See this comment: iree-org#18490 (comment)
…s/results

The restriction to a single input and output was artificial as this op
simply represents synchronization on input and output values.
Additionally this removes the restriction on tensor/vector types, but
for the time being this op is still only used with those types.
@qedawkins qedawkins force-pushed the barrier_region_multi_operands_and_results branch from e460b43 to 9691229 Compare September 13, 2024 16:27
qedawkins added a commit to qedawkins/iree that referenced this pull request Sep 17, 2024
The way that barriers are currently inserted for forall fusion is
fragile and trying to model "WaR" conflicts on tensors (kind of). We
instead want to put the barrier around the body of the whole scf.forall.

See this comment: iree-org#18490 (comment)
@qedawkins
Copy link
Contributor Author

@MaheshRavishankar can you take a look when you get a chance? This is blocking a number of follow up PRs I have.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis itself looks fine, but the vectorization of barrier ops is kind of annoying.

@qedawkins
Copy link
Contributor Author

Yeah don't worry the vectorization pattern is going away (it's broken anyway). I just needed a couple more changes dependent on this change before I could drop it so I had to make it work for now.

@qedawkins qedawkins merged commit c9eca66 into iree-org:main Sep 19, 2024
37 checks passed
@qedawkins qedawkins deleted the barrier_region_multi_operands_and_results branch September 19, 2024 14:16
qedawkins added a commit to qedawkins/iree that referenced this pull request Sep 19, 2024
The way that barriers are currently inserted for forall fusion is
fragile and trying to model "WaR" conflicts on tensors (kind of). We
instead want to put the barrier around the body of the whole scf.forall.

See this comment: iree-org#18490 (comment)
qedawkins added a commit to qedawkins/iree that referenced this pull request Sep 20, 2024
The way that barriers are currently inserted for forall fusion is
fragile and trying to model "WaR" conflicts on tensors (kind of). We
instead want to put the barrier around the body of the whole scf.forall.

See this comment: iree-org#18490 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants