-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flow] Generalize horizontal contraction fusion to cover more cases. #17880
[Flow] Generalize horizontal contraction fusion to cover more cases. #17880
Conversation
I tried to split this work in two. Might help reviewing the first and second commits separately. First one is mostly a move of the pass from GlobalOpt -> Flow. The second is a functional change to the pass. (Still need to add a test for the enhancements, working on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no :)
As I noted on @IanWood1's PR the other day, that isBitTruncate is in RegionOpUtils is a bug - it should be in LinalgExt Utils/ - and this move seems to be motivated by the fact that it's in RegionOpUtils.
We want more things like this pass out of flow and not moved into it. As a general rule we really need to move towards a default where unless something is operating on flow ops it should not be in flow - something doing global optimizations on the program with tensor/linalg ops should be before flow in global opt.
I saw the notification but didnt respond to it
|
RE 1: Ian wanted to use Flow's RegionOpUtils - a file for utilities related to flow dispatch regions - in common Util dialect code operating strictly on linalg ops (not flow ops). Here you're using Flow's RegionOpUtils - a file for utilities related to flow dispatch regions - in global opt operating strictly on linalg ops (not flow ops). So same reason in both cases: whether a linalg op has a certain trait is something related to linalg, not to flow, and anywhere we may want to ask about that linalg op should not have to pull in RegionOpUtils or depend on the flow dialect. Upstream in the linalg dialect would be a logical place for it but I know how tricky that can be, so LinalgExt (our place for stuff not in upstream linalg) is the closest appropriate place. RE 2 makes sense to me, and as you say I think what we need to do is lop off the entire flow pipeline dealing with linalg prior to dispatch region formation and move that to global opt. If this is a transient step because of that then that's unfortunate but good enough reason. We really should work towards moving all this out, though - the flow pipeline should essentially start after dispatch region formation with the input to flow being formed dispatch regions. I suspect we want a new pipeline that sits between global opt and flow that is dedicated to dispatch region formation but don't know where to put it - global opt feels more appropriate as a staging ground. My hope is at some point we start by splitting the passes into two pipelines inside of flow, then move all the passes in that first pipeline forming dispatch regions out wholesale. That'd be really really good cleanup and help avoid situations like this where something really belongs in global opt but can't due to things being in flow. |
Found Ian's PR. I am happy to rebase on top of that and land this after that.
Yes. I think thats a good idea. I started doing this, but havent fully finished yet. Once I get some time to breathe, I can cleanup and improve documentation of it. It would be really nice to have a pipeline saying "--iree-flow-form-default-dispatch-regions" that just does what it does today, and we can have others if need be. It is transient, but I will try to finish that up before I switch into the next thing. I am scared of GLobalOpt. Has become something of a dumping ground. |
If we plan to do a split like this, I think it would be worth putting up an issue describing what we plan to gain by splitting
I agree, and it seems that Mahesh agrees, but I don't think it will be easy to make progress without writing out the why somewhere. Echoing Benoit's statement in another thread, issues are a better place for broad design discussion. One aspect of what I think @benvanik you're asking for is why we want to invest so heavily in splitting dependencies/adding interfaces. What I've gathered from public discussions is that much of the reason for interfaces over a dependence where possible is to allow for plugging in downstream dialects. I haven't seen a design doc that describes what a downstream dialect would need to/could do to plug in to IREE, and adding a doc like could be excellent to point at in discussions like this (maybe something like that exists and I missed it). There also is a certain degree of not having a good place to put things related to upstream dialects. If IREE is intended as an upstream (with an actual e2e flow instead of just a basket of lit tests), then we need to define a set of critical interfaces/transformations a dialect needs to implement for IREE to know how to work with it. I've seen good samples for this for Stream/HAL, but I haven't seen the same for the kinds of transformations Flow does (dispatch region formation). I see that as a good prerequisite to splitting apart Flow's transforms and also starting to unwind the GlobalOptimization kitchen sink. |
eac0af6
to
16f7ebf
Compare
6947ab0
to
fa85e73
Compare
Marking this ready for review. Also this is for now in Flow, but will need to be moved to a separate pass pipeline that deals with dispatch region formation. See #18063 (comment) |
The regression test is failing because of a dominance error during the pass. Consider this example (actual mlir here IanWood1@de471fd):
Edit: The actual issue is the use-def relationship between the ops. %1's operand would get moved above the fused op if there was no dependency (this is tested in |
Thanks Ian! Could you post a snippet of the failing IR |
Its actually the truncation ops that have the dependency in this case. https://gist.github.com/IanWood1/f4ab6642601c6e005187f6e8e3c82a05 %3390 = linalg.matmul_transpose_b ins(%65, %_params.unet.up_blocks.2.resnets.0.time_emb_proj.weight : tensor<2x1280xf16>, tensor<320x1280xf16>) outs(%66 : tensor<2x320xf32>) -> tensor<2x320xf32>
%3391 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%3389, %__hoisted_tensor_320xf32_371, %3390, %__hoisted_tensor_320xf32_372 : tensor<2x128x128x320xf32>, tensor<320xf32>, tensor<2x320xf32>, tensor<320xf32>) outs(%49 : tensor<2x320x128x128xf16>) {
^bb0(%in: f32, %in_2002: f32, %in_2003: f32, %in_2004: f32, %out: f16):
%3459 = arith.addf %in_2003, %in_2004 : f32
%3460 = arith.addf %in, %in_2002 : f32
%3461 = arith.truncf %3459 : f32 to f16
%3462 = arith.truncf %3460 : f32 to f16
%3463 = arith.addf %3462, %3461 : f16
linalg.yield %3463 : f16
} -> tensor<2x320x128x128xf16>
// omitted long chain of IR that connects %3391 and %3411
// ...
%3411 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%inserted_slice_1973, %__hoisted_tensor_3x3x640x320xf16 : tensor<2x130x130x640xf16>, tensor<3x3x640x320xf16>) outs(%45 : tensor<2x128x128x320xf32>) -> tensor<2x128x128x320xf32>
%3412 = linalg.matmul_transpose_b ins(%65, %_params.unet.up_blocks.2.resnets.1.time_emb_proj.weight : tensor<2x1280xf16>, tensor<320x1280xf16>) outs(%66 : tensor<2x320xf32>) -> tensor<2x320xf32>
%3413 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%3411, %__hoisted_tensor_320xf32_376, %3412, %__hoisted_tensor_320xf32_377 : tensor<2x128x128x320xf32>, tensor<320xf32>, tensor<2x320xf32>, tensor<320xf32>) outs(%49 : tensor<2x320x128x128xf16>) {
^bb0(%in: f32, %in_2002: f32, %in_2003: f32, %in_2004: f32, %out: f16):
%3459 = arith.addf %in_2003, %in_2004 : f32
%3460 = arith.addf %in, %in_2002 : f32
%3461 = arith.truncf %3459 : f32 to f16
%3462 = arith.truncf %3460 : f32 to f16
%3463 = arith.addf %3462, %3461 : f16
linalg.yield %3463 : f16
} -> tensor<2x320x128x128xf16> |
1287aa6
to
782178f
Compare
Thanks @IanWood1 . Took your snippet and created as stand alone repro for it
I can reproduce the error on this. Looking more. |
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Current implementation of horizontal fusion worked only for contraction (-> truncf)? cases. To generalize it to more cases the following changes were needed 1) Instead of looking for a `linalg.generic` with a single `arith.truncf`, use `isBitTruncate` utility method. 2) Use `Operation::Equivalence` to check that the contraction operations are "similar" and any subsequent truncation operations are "similar" too. 3) Instead of trying to find an insertion point based on existing dominance relationship between operations, - Always insert the horizontally fused contraction before the first contraction - Always insert the horizontally fused truncation operation before the first truncation operation 4) Instead of generating the fills/empty along with horizontally fused operations, use separate patterns to fold concats of fills and concats of emptys into fill and empty, respectively. Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
e187f32
to
4c80b8f
Compare
@saienduri after this PR lands maybe the thresholds need adjusting |
@benvanik could you take a look again. Once this and few others land, we can move these all out of flow |
Current implementation of horizontal fusion worked only for
contraction (-> truncf)? cases. To generalize it to more cases the
following changes were needed
Instead of looking for a
linalg.generic
with a singlearith.truncf
, useisBitTruncate
utility method. To enable thisthe pass is moved to
Flow
.Use
Operation::Equivalence
to check that the contractionoperations are "similar" and any subsequent truncation operations
are "similar" too.
Instead of trying to find an insertion point based on existing
dominance relationship between operations,
contraction
the first truncation operation
Instead of generating the fills/empty along with horizontally fused
operations, use separate patterns to fold concats of fills and
concats of emptys into fill and empty, respectively.
Signed-off-by: MaheshRavishankar mahesh.ravishankar@gmail.com