Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling for llvm-cpu without targeting a specific CPU is a bad experience #18561

Open
stellaraccident opened this issue Sep 19, 2024 · 30 comments
Assignees
Labels
codegen/llvm LLVM code generation compiler backend documentation ✏️ Improvements or additions to documentation

Comments

@stellaraccident
Copy link
Collaborator

I've seen multiple people falling down this hole: they run iree-compile on their model, targeting CPU. Then they get performance that is 10x-100x off of any reasonable expectation. Then they either go away silently or report back about poor experiences (not always reporting flags and such).

There are good reasons why a compiler like IREE shouldn't make assumptions about what the CPU target is, but on the other hand, it will almost always produce a a grossly subpar experience to not specify a target CPU, since the generic target (at least on X86) lacks so many features as to be basically useless for any high performance numerics.

I've even fallen down this hole recently and had to go remember the incantation to select a specific CPU. In the case I was working on (an f16 CPU LLM), performance was 100x different between not specifying a target CPU and specifying "host". We need to guide people better than this.

Proposal

As mentioned, there are good reasons for a compiler to not make too many assumptions without being told what to do. But I think we can/should actively warn, possibly with a link to the documentation site when the compiler is invoked without the user specifying a target CPU. Whatever the warning is should be very explicit that the user should pass --iree-llvmcpu-target-cpu=host to target the precise CPU they are running on. We should possibly also accept "generic" or something like that for if the user really wants to target the default and not get the warning. I basically want to guard against the case where the user has not specified anything and the compiler just silently generates 100x too slow of code. In almost all cases, it will be better for the user to say something and we should guide them on a proper choice.

@ScottTodd
Copy link
Member

We need to guide people better than this.

Yes please! #15487

@ScottTodd ScottTodd added documentation ✏️ Improvements or additions to documentation codegen/llvm LLVM code generation compiler backend labels Sep 20, 2024
@stellaraccident
Copy link
Collaborator Author

Let's stop over thinking this and do something simple like I suggest. Open to other options but would like to see this improved.

@ScottTodd
Copy link
Member

The suggested proposal SGTM. I might even want to default to having LLVM use the current host for the target CPU and available features (if those are different), then have users explicitly pass "generic" for the lowest common denominator.

We could apply similar logic to the GPU backends - try to detect devices on the system (shell out to vulkaninfo / rocm-smi / nvidia-smi?) and default to what is available, but still support cross compilation with explicit device info and a "generic" target where possible.

@benvanik
Copy link
Collaborator

benvanik commented Sep 20, 2024

Yuck - that is not a cheap thing to do and has a high risk of flakes - I am still not sure why proper documentation is insufficient? You must specify your target device (--iree-hal-target-device=) when compiling so also specifying a "use my hardware info" is fine. Just change the documentation to include both flags and then a user has a choice and knows what to do if they want to change things. Anything automatic is going to have issues. Users aren't coming in to iree-compile command line invocations blind - if all the docs specify the flag and they choose not to copy/paste it that's on them.

@benvanik
Copy link
Collaborator

Note that this is also what clang does - https://clang.llvm.org/docs/HIPSupport.html - you must pass --offload-arch= to compile HIP code and if you want the native host target you must pass --offload-arch=native. nvcc does the same thing - you pass -arch=[some gpu] or -arch=native.

@benvanik
Copy link
Collaborator

(if there's such a big concern about documentation not fixing this issue then I'd be ok with making compilation fail if the user doesn't specify an arch for a backend - whether a particular arg, generic, native, etc - but guessing is bad)

@ScottTodd
Copy link
Member

We can certainly update the docs (https://iree.dev/guides/deployment-configurations/cpu/#compile-a-program) and start with a warning from the compiler if information is omitted and generic is used as the default.

I'm seeing a proliferation of flags (mainly in rocm usage, but also cpu) and the documentation can't keep up. I want more of that to be captured somewhere - docs, samples, the compiler itself, etc.

See one example here:

ROCM_COMPILE_FLAGS = [
"--iree-hal-target-backends=rocm",
f"--iree-hip-target={rocm_chip}",
"--iree-opt-const-eval=false",
f"--iree-codegen-transform-dialect-library={iree_test_path_extension}/attention_and_matmul_spec.mlir",
"--iree-global-opt-propagate-transposes=true",
"--iree-dispatch-creation-enable-fuse-horizontal-contractions=true",
"--iree-dispatch-creation-enable-aggressive-fusion=true",
"--iree-opt-aggressively-propagate-transposes=true",
"--iree-opt-outer-dim-concat=true",
"--iree-vm-target-truncate-unsupported-floats",
"--iree-llvmgpu-enable-prefetch=true",
"--iree-opt-data-tiling=false",
"--iree-codegen-gpu-native-math-precision=true",
"--iree-codegen-llvmgpu-use-vector-distribution",
"--iree-hip-waves-per-eu=2",
"--iree-execution-model=async-external",
"--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline,iree-preprocessing-pad-to-intrinsics)",
"--iree-scheduling-dump-statistics-format=json",
"--iree-scheduling-dump-statistics-file=compilation_info.json",
]

@benvanik
Copy link
Collaborator

That's insanity - besides the debug flags (dumping statistics/etc) if any of those are required that's a bug. I think Mahesh has said it before: a feature is not done until it's on by default and if all of those flags are needed to make the model compile or perform then the engineering was never completed. The only two flags required there should be --iree-hal-target-device=hip (that's using the old deprecated flag) and --iree-hip-target= (which could be native if we wanted to do what clang does and invoke amdgpu-arch if it's present).

@stellaraccident
Copy link
Collaborator Author

stellaraccident commented Sep 20, 2024

I'm fine making --iree-llvmcpu-target-cpu=<something> required. Given the proliferation of things out there, I think that the way to get there may be to first do what I am suggesting: make it issue a warning if not specified and incorrectly/implicitly defaulting to a generic CPU (with a note that this flag will soon be required).

Agreed on all of the other points. Need to burn down all of the other flags. I'm just starting with this one.

@ScottTodd
Copy link
Member

I can take a pass at this, unless someone else wants to.

Plan:

  • Emit a warning if --iree-llvmcpu-target-cpu is omitted
  • (Maybe?) emit a warning if iree-llvmcpu-target-cpu-features is omitted
  • Update docs at https://iree.dev/guides/deployment-configurations/cpu/
  • Audit usage of --iree-hal-target-backends=llvm-cpu in-tree and set those flags explicitly
    • Also start switching the repo over to --iree-hal-target-device?
  • (Later) make one or both of those flags required

@ScottTodd ScottTodd self-assigned this Sep 20, 2024
@ScottTodd
Copy link
Member

Can someone clarify why we have all three of these flags?

  • --iree-llvmcpu-target-triple
  • --iree-llvmcpu-target-cpu
  • --iree-llvmcpu-target-cpu-features

It seems like the triple could be a superset of the cpu? Is there some redundancy there? I see some riscv sample code setting both:

RISCV_64=(
--iree-llvmcpu-target-triple=riscv64-pc-linux-elf
--iree-llvmcpu-target-cpu=generic-rv64
--iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+c
--iree-llvmcpu-target-abi=lp64d
)

but even our microkernels blog post (highlighting cpu performance work) only includes a few of the flags:

Basic compilation command line:
```bash
$ iree-compile matmul.mlir -o /tmp/matmul.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=all
```

Oh, the linalg tutorial from @bjacob explains that target-cpu is for x86, but target-cpu-features is for other architectures?

* To run on GPU or other non-CPU targets, explore other values for
`--iree-hal-target-backends=`. You will then need to pass a matching
`--device=` to `iree-run-module` below.
* To cross-compile, explore `--iree-llvmcpu-target-triple=`.
* To enable higher CPU performance by enabling CPU features:
* On x86, explore `--iree-llvmcpu-target-cpu=` (e.g.
`--iree-llvmcpu-target-cpu=znver4` to target AMD Zen4).
* On other architectures, explore `--iree-llvmcpu-target-cpu-features=`.
* To optimize for running on the same machine that the compilation ran
on, pass `--iree-llvmcpu-target-cpu=host`. That works regardless of
CPU architecture.

There is some complicated logic and then a few calls into LLVM itself in https://github.com/iree-org/iree/blob/main/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp.

(This is why I filed #15487 - I've wanted someone directly familiar with LLVM CPU to be driving this)

@stellaraccident
Copy link
Collaborator Author

This stuff always grows into a bit of a hairball. The condition we are trying to guard is that iree-llvmcpu-target-cpu being empty should never drive a decision (should be a warning now and an error eventually). Will need to peel back the decision tree to that.

@benvanik
Copy link
Collaborator

The CPU flags mirror LLVM - we can't remove them, but we could more intelligently populate them - maybe - triple is often not enough. I think we do ask for defaults from LLVM today. I'm hesitant to suggest we diverge from clang behavior as then we have to support that (and if the issue here is that our documentation sucks adding bespoke stuff only hurts that).

@ScottTodd
Copy link
Member

Made some progress stepping through the details:

  • I tried this resnet50 ONNX model with and without --iree-llvmcpu-target-cpu=host on my system. I see about 140ms with the flag and 170ms without:

    See logs and commands used here
    # (Download the file and upgrade it to ONNX version >= 17)
    $ iree-compile.exe \
      resnet50-v2-7_version17.mlir \
      --iree-hal-target-backends=llvm-cpu \
      -o resnet_noflags.vmfb
    $ iree-compile.exe \
      resnet50-v2-7_version17.mlir \
      --iree-hal-target-backends=llvm-cpu \
      --iree-llvmcpu-target-cpu=host \
      -o resnet_targetcpu_host.vmfb
    
    $ iree-benchmark-module.exe \
      --module=resnet_noflags.vmfb \
      --device=local-task \
      --function=main \
      --input=1x3x224x224xf32
    2024-09-20T14:05:40-07:00
    Running iree-benchmark-module.exe
    Run on (64 X 3693 MHz CPU s)
    CPU Caches:
      L1 Data 32 KiB (x32)
      L1 Instruction 32 KiB (x32)
      L2 Unified 512 KiB (x32)
      L3 Unified 16384 KiB (x8)
    ***WARNING*** Library was built as DEBUG. Timings may be affected.
    -----------------------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations UserCounters...
    -----------------------------------------------------------------------------------------
    BM_main/process_time/real_time        171 ms         2898 ms            4 items_per_second=5.86105/s
    
    $ iree-benchmark-module.exe \
      --module=resnet_targetcpu_host.vmfb \
      --device=local-task \
      --function=main \
      --input=1x3x224x224xf32
    2024-09-20T14:06:24-07:00
    Running iree-benchmark-module.exe
    Run on (64 X 3693 MHz CPU s)
    CPU Caches:
      L1 Data 32 KiB (x32)
      L1 Instruction 32 KiB (x32)
      L2 Unified 512 KiB (x32)
      L3 Unified 16384 KiB (x8)
    ***WARNING*** Library was built as DEBUG. Timings may be affected.
    -----------------------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations UserCounters...
    -----------------------------------------------------------------------------------------
    BM_main/process_time/real_time        138 ms         2134 ms            5 items_per_second=7.24555/s
    
  • JitGlobals runs the CPU compilation pipeline for the host, regardless of flags and requested devices. That's fine, it's an implementation detail of the compiler. We shouldn't emit any warnings on this path. Relevant code:

    static std::string
    resolveTargetDevice(const IREE::HAL::TargetRegistry &targetRegistry) {
    if (clJitTargetDevice.empty()) {
    // Default - choose something we have.
    // First llvm-cpu then vmvx.
    if (targetRegistry.getTargetDevice("llvm-cpu")) {
    return std::string("llvm-cpu");
    } else {
    return std::string("vmvx");
    }
    }
    // Set the target.
    std::optional<IREE::HAL::DeviceTargetAttr> targetAttr =
    targetDevice->getHostDeviceTarget(&getContext(), *targetRegistry.value);
    {
    if (!targetAttr) {
    emitError(UnknownLoc::get(&getContext()))
    << "consteval requested backend " << requestedTargetDevice
    << " cannot target the host";
    signalPassFailure();
    return;
    }

  • The targetCPU flag default is set here:

    // Default device options.
    std::string targetTriple = "";
    std::string targetCPU = "generic";
    std::string targetCPUFeatures = "";
    I'm thinking I'll replace that with empty string then add some logic that warns and sets it back to that default later. Need to still watch for how JitGlobals, default target construction from CLI flags, and explicit target construction from --iree-hal-target-device or program IR interact with all the code paths.

@ScottTodd
Copy link
Member

@marbre pointed out that for bare metal arm, the target handling is letting some "errors" fall through:

if (triple.isX86()) {
llvm::SmallVector<llvm::StringRef> cpuFeatureList;
addCpuFeatures(llvm::X86::getFeaturesForCPU, cpuFeatureList);
} else if (triple.isRISCV64()) {
llvm::SmallVector<std::string> cpuFeatureList;
addCpuFeatures(llvm::RISCV::getFeaturesForCPU, cpuFeatureList);
} else {
llvm::errs()
<< "error: Resolution of target CPU to target CPU features is not "
"implemented on "
"this target architecture. Pass explicit CPU features "
"instead of a CPU "
"on this architecture, or implement that.\n";
return false;
}
if (!resolveCPUAndCPUFeatures(cpu, cpuFeatures, llvm::Triple(triple),
target.cpu, target.cpuFeatures)) {
// Something bad happened, and our target might not be what the user expects
// but we need to continue to avoid breaking existing users. Hopefully
// resolveCPUAndCPUFeatures logged a helpful error already.
}
return target;
}

Sample logs: https://github.com/iree-org/iree-bare-metal-arm/actions/runs/10923467370/job/30320173426#step:11:262

[158/258] Generating simple_mul_int_bytecode_module_static_c_module_emitc.h, simple_mul_int_bytecode_module_static_c_module.o, simple_mul_int_bytecode_module_static_c_module.h
error: Resolution of target CPU to target CPU features is not implemented on this target architecture. Pass explicit CPU features instead of a CPU on this architecture, or implement that.
error: Resolution of target CPU to target CPU features is not implemented on this target architecture. Pass explicit CPU features instead of a CPU on this architecture, or implement that.

Flags for those logs: https://github.com/iree-org/iree-bare-metal-arm/blob/23deb47d546786e7bd64fc6edd51a3095b6c1817/samples/simple_embedding/CMakeLists.txt#L98-L109

We may want to amend that logic here too. The concern about not "breaking existing users" is potentially leaving performance on the table with that style of error reporting.

@ScottTodd
Copy link
Member

More context for my previous comment: #15387

@bjacob
Copy link
Contributor

bjacob commented Sep 23, 2024

Oh yeah, I only cared about ARM when I wrote that :-D

@ScottTodd
Copy link
Member

I noticed that we override the targetTriple (--iree-llvm-target-triple=) string when using embedded linking:

if (target.linkEmbedded) {
// Force the triple to something compatible with embedded linking.
targetTriple.setVendor(llvm::Triple::VendorType::UnknownVendor);
targetTriple.setEnvironment(llvm::Triple::EnvironmentType::EABI);
targetTriple.setOS(llvm::Triple::OSType::UnknownOS);
targetTriple.setObjectFormat(llvm::Triple::ObjectFormatType::ELF);
target.triple = targetTriple.str();
}

However, we only override parts of the triple, not the full object/string. In particular, that code appears to leave the "arch" unchanged. Possible values for that are in https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/TargetParser/Triple.h. Does that mean that if you compile on x86_64, your code generated with llvm-cpu won't be compatible with aarch64?

I'm wondering if this other default should be changed to an explicit "host" too:

if (targetTriple.empty()) {
targetTriple = llvm::sys::getProcessTriple();
}

@ScottTodd
Copy link
Member

Does that mean that if you compile on x86_64, your code generated with llvm-cpu won't be compatible with aarch64?

Answering my own question - yes. Compiled with embedded linking and

  • --iree-llvmcpu-target-triple=aarch64-pc-linux-elf
    • This program failed to load on Windows x86_64 with <vm>:0: NOT_FOUND; HAL device `__device_0` not found or unavailable: #hal.device.target<"local", [#hal.executable.target<"llvm-cpu", "embedded-elf-arm_64", {cpu = "", cpu_features = "+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32", native_vector_size = 16 : i64, target_triple = "aarch64-unknown-unknown-eabi-elf"}>]>;
  • --iree-llvmcpu-target-triple=x86_64-pc-linux-elf
    • I was able to run this program on Windows x86_64

I'm still wondering if we want to default the target triple to llvm::sys::getProcessTriple() or also make that an explicit "host lets LLVM decide what to do" option. At least there we have no generic choice to fall back on, right?

@ScottTodd
Copy link
Member

Either way, we could have our docs explain OS, arch, features, etc.

OS: matters when using the system linker (for full integration with debug tools). Does not matter with embedded linking mode.
Arch: Inferred from the hosting process or set explicitly
Features: Can be inferred from the cpu on some architectures (x86 and riscv64), otherwise must be set explicitly

@ScottTodd
Copy link
Member

Ehhh... we only support the "host" CPU name on x86?

outCpu = triple.isX86() ? llvm::sys::getHostCPUName().str() : "";

The https://github.com/llvm/llvm-project/blob/main/llvm/lib/TargetParser/Host.cpp file supports plenty of other architectures though...

@benvanik
Copy link
Collaborator

You're unfortunately finding the results of engineers not caring about anything but the exact configuration they are looking at in the moment they author something. Has been an issue for the lifetime of the project and probably always will be 😢

@stellaraccident
Copy link
Collaborator Author

Thanks for digging into it, Scott. Feel free to loop one of the backend engineers in if you need help untangling it. I'm happy to nominate others to care in detail.

@ScottTodd
Copy link
Member

I think I see enough of the pieces now to refactor the code a bit and add some helpful warnings and documentation.

I'm not sure how I'll test my changes though, since a fair portion of this is different depending on the architecture of the host machine running the compiler and I only have x86_64 dev machines.

It would be helpful to get some more eyes on the various configurations we want to support and then do some manual QA testing that the compiler either detects the right features and generates good code, or bails with a helpful error.

@stellaraccident
Copy link
Collaborator Author

Maybe more of a unit test via some magic env var or test only flag: --iree-testing-assume-host= then a lit test variant for each arch branch that runs device assignment and validates. We're not looking to test llvm here, just ensure that we're not fumbling the flag parsing.

@ScottTodd
Copy link
Member

That could work, yeah. When I say "test my changes" here, I'm still just referring to local development "testing", not automated CI testing - that would be a nice bonus.

@stellaraccident
Copy link
Collaborator Author

Well, if you have the knobs to verify locally, then you're more than halfway to a lit test. That's how most of these things in llvm proper get tested.

@ScottTodd
Copy link
Member

Pushed an initial attempt at reworking how the target init is handled: #18587 . I could pass that off to someone else and context switch to other tasks 🤔

@bjacob
Copy link
Contributor

bjacob commented Sep 24, 2024

Sorry, I had not kept up with the discussion here, was heads down in GPU data tiling.

Here are the difficulties that I know of:

  1. Different concepts are more or less relevant on different CPU architectures:
    • On x86, people want to talk in terms of "CPU" (meaning microarchitecture) such as znver4 or cascadelake. People do not typically want to talk in terms of CPU features on x86 because that is very cumbersome. For example, just enabling the baseline AVX-512 feature set on x86 is a combination of 5 features, each with long names; a typical compilation relies on > 10 CPU features.
    • On RISC-V, people want to talk in terms of CPU features, and there are many, but they are not too cumbersome thanks to very short names, e.g. +z,+a,+m. The "CPU" string is not much used on RISC-V, according to RISC-V folks I asked back then, due to the very modular nature of the architecture.
    • On Arm, people typically specify baseline Arm architecture version plus a few CPU features, e.g. armv8.2-a+i8mm. The CPU names are also not much used on Arm; when targeting Android, fragmentation makes that hard anyway.
  2. To map a CPU name to CPU features, LLVM has a nice utility function doing that... on x86, but not on other architectures.
    • You know me, if that utility function had been available outside of x86, I would not have special-cased x86.
    • I tried implementing that on other architectures, but the ways I could see were either still architecture-specific in some way, or felt too heavy.

Here is what I would do:

  1. When --iree-llvmcpu-target-triple is host (or unspecified, so defaults to host), default to --iree-llvmcpu-target-cpu=host.
    • This matches the behavior of hipcc, so I presume also nvcc.
  2. When --iree-llvmcpu-target-triple is not host, your Warn when --iree-llvmcpu-target-cpu defaults to "generic". #18587 sounds like a good way to go. Could dump a list of recognized CPU names for the specified target architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/llvm LLVM code generation compiler backend documentation ✏️ Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants