Use alloca instead of malloc for 0-d memref's? Getting scalar replacement working. #65
Description
One way to hoist loop invariant memref load/store's and replace such subscripted accesses by scalars is by introducing 0-d memref's (load/store from the main memref to those 0-d memref's, then access the latter inside the loop), and rely on LLVM's mem2reg to eliminate the lowered counterparts of those 0-d memref's away, substituting (scalar) SSA values instead. However, currently, the lowering of MLIR's alloc to LLVM's malloc (in the -lower-to-llvm pass) gets in the way, but is trivially fixable as shown below.
Here's simple MLIR obtained after hoisting a loop invariant load (assume that the 0-d memref %s was automatically introduced by a pass/utility that performed the hosting for a loop invariant load on the original %A):
func @foo() {
%c0 = constant 0 : index
%cf0 = constant 0.0 : f32
%A = alloc() : memref<1024xf32>
%s = alloc() : memref<f32>
%v = load %A[%c0] : memref<1024xf32>
store %v, %s[]: memref<f32>
affine.for %i = 0 to 1024 {
%w = load %s[] : memref<f32>
%r = addf %w, %cf0 : f32
}
return
}
The current lowering to LLVM yields:
(-lower-affine -lower-to-llvm | mlir-translate -mlir-to-llvmir)
; ModuleID = 'LLVMDialectModule'
source_filename = "LLVMDialectModule"
declare i8* @malloc(i64)
declare void @free(i8*)
define void @foo() {
%1 = call i8* @malloc(i64 4096)
%2 = bitcast i8* %1 to float*
%3 = call i8* @malloc(i64 4)
%4 = bitcast i8* %3 to float*
%5 = getelementptr float, float* %2, i64 0
%6 = load float, float* %5
store float %6, float* %4
br label %7
7: ; preds = %10, %0
%8 = phi i64 [ %13, %10 ], [ 0, %0 ]
%9 = icmp slt i64 %8, 1024
br i1 %9, label %10, label %14
10: ; preds = %7
%11 = load float, float* %4
%12 = fadd float %11, 0.000000e+00
%13 = add i64 %8, 1
br label %7
14: ; preds = %7
ret void
}
'mem2reg' on this doesn't yield the desired replacement (expected?).
However, using an alloca for the single element memref makes the whole thing work:
$ opt -S /tmp/test4.mlir -mem2reg
; ModuleID = '/tmp/test4.mlir'
source_filename = "LLVMDialectModule"
declare i8* @malloc(i64)
declare void @free(i8*)
define void @foo() {
%1 = call i8* @malloc(i64 4096)
%2 = bitcast i8* %1 to float*
%3 = call i8* @malloc(i64 4)
%4 = getelementptr float, float* %2, i64 0
%5 = load float, float* %4
br label %6
6: ; preds = %9, %0
%7 = phi i64 [ %11, %9 ], [ 0, %0 ]
%8 = icmp slt i64 %7, 1024
br i1 %8, label %9, label %12
9: ; preds = %6
%10 = fadd float %5, 0.000000e+00
%11 = add i64 %7, 1
br label %6
12: ; preds = %6
ret void
}
One option here would be to have AllocOpLowering use 'alloca' for 0-d memrefs (they have a single element by definition) instead of a malloc + bitcast,
but this would require the lowering to check for lifetimes - a concern that shouldn't be part of the lowering. Instead, one would want to have a pass/utility to replace alloc's with llvm.alloca when appropriate, but this isn't "lowering path neutral" and thus not reusable infra. Should one have a new standard op alloca for stack allocation so that a pass/utility could replace with when appropriate or in the first place generate such alloca's in place of alloc? The current alloc op doesn't say where memory gets allocated from.
(On a side note, another way to accomplish such scalar replacement of accesses is by leveraging support for loop live-out scalars when the IR supports it, but there are still good reasons for doing this via single element memref's and then relying on mem2reg, and also being able to go from one form to the other in MLIR itself.)