-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490
[Codegen][GPU] Allow iree_gpu.barrier_region to take multiple operands/results #18490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what this mean to have multiple operations within the region?
I am actually not sure I fully understand why a region is needed and not just a value_barrier
.
The main reason is to be able to better represent the barrier semantics of %0 = scf.forall {
...
}
scf.forall {
tensor.extract_slice %0
...
} And after fusion we form something like this scf.forall {
%0 = scf.for {
// %0 body
}
iree_gpu.barrier_region %0 {
^bb0(%s)
tensor.extract_slice %s
}
...
} (this actually has a bug because This lowering works out ok because of some later vectorization patterns, but this requires the loop fusion pattern to find an extract_slice and for that extract_slice to actually be loading the thread local slice of data. This is too much spooky action at a distance and isn't an accurate representation of the IR before fusion. What we want to generate is this scf.forall {
%0 = iree_gpu.barrier_region ins(%0_dest) {
%1 = scf.for {
// %0 body
}
%2 = tensor.insert_slice %1
iree_gpu.yield %2
}
tensor.extract_slice %0
...
} Where the idea is that we have a barrier before and after the body of the fused forall op, instead of before and after the read of the shared memory result. The reason this IR needs to be a region instead of an %0 = scf.forall {
...
}
scf.forall {
%alloc = bufferization.alloc_tensor()
%0_dest = iree_gpu.value_barrier %alloc
%1 = scf.for {
// %0 body
}
%2 = tensor.insert_slice %1
%3 = iree_gpu.value_barrier %2
tensor.extract_slice %3
...
} Where unless we made |
Thanks @qedawkins for the explanation here and offline! |
The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)
…s/results The restriction to a single input and output was artificial as this op simply represents synchronization on input and output values. Additionally this removes the restriction on tensor/vector types, but for the time being this op is still only used with those types.
e460b43
to
9691229
Compare
The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)
@MaheshRavishankar can you take a look when you get a chance? This is blocking a number of follow up PRs I have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis itself looks fine, but the vectorization of barrier ops is kind of annoying.
compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/Transforms.cpp
Show resolved
Hide resolved
Yeah don't worry the vectorization pattern is going away (it's broken anyway). I just needed a couple more changes dependent on this change before I could drop it so I had to make it work for now. |
The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)
The way that barriers are currently inserted for forall fusion is fragile and trying to model "WaR" conflicts on tensors (kind of). We instead want to put the barrier around the body of the whole scf.forall. See this comment: iree-org#18490 (comment)
The restriction to a single input and output was artificial as this op simply represents synchronization on input and output values. Additionally this removes the restriction on tensor/vector types, but for the time being this op is still only used with those types.