Following `manual_clamp` suggestion results in slower code #12826

okaneco · 2024-05-20T05:15:58Z

Summary

I noticed when clamping and casting from i32 to u8, using clamp(0, 255) as u8 produces unnecessary instructions compared to .max(0).min(255) as u8. If a loop is auto-vectorized, the branches in clamp result in slower code than manual clamping.

I couldn't find a label for this, but it would be akin to I-suggestion-causes-perf-regression.

Currently, the lint is set to warn but following the suggestion inhibits optimization. I don't believe it should fire on the "branchless" patterns which are semantically different.

// 1
input.max(min).min(max)

// 2
let mut x = input;
if x < min { x = min; }
if x > max { x = max; }

Lint Name

manual_clamp

Lint Description

I also had a small issue with the wording in the current description.

Why is this bad?
clamp is much shorter, easier to read, and doesn’t use any control flow.
https://rust-lang.github.io/rust-clippy/master/index.html#/manual_clamp

I slightly disagree with the reasoning here.
I understand the user doesn't have to add any control flow, but the control flow within the clamp implementation is different enough to affect performance in some cases. It is not strictly a "better" clamping method than manually clamping, especially for primitive integers.

Reproducer

#[inline(never)]
pub fn clamp(input: &[i32], output: &mut [u8]) {
    for (&i, o) in input.iter().zip(output.iter_mut()) {
        *o = i.clamp(0, 255) as u8;
    }
}

#[inline(never)]
pub fn manual_clamp(input: &[i32], output: &mut [u8]) {
    for (&i, o) in input.iter().zip(output.iter_mut()) {
        *o = i.max(0).min(255) as u8;
    }
}

Assembly output - https://rust.godbolt.org/z/rdoh97d3v (1.78, but same output on nightly)
The main difference is in the label .LBB0_4 where extra work is being done by the clamp code.

Version

rustc 1.80.0-nightly (d84b90375 2024-05-19)
binary: rustc
commit-hash: d84b9037541f45dc2c52a41d723265af211c0497
commit-date: 2024-05-19
host: x86_64-pc-windows-msvc
release: 1.80.0-nightly
LLVM version: 18.1.4

Additional Labels

No response

The text was updated successfully, but these errors were encountered:

blyxyas · 2024-05-22T11:14:18Z

I will look into this from the upstream compiler. This really shouldn't happen.
Thanks for the report ❤️

okaneco · 2024-05-22T16:05:13Z

Thanks.
I have two examples of real code from the image-webp crate that helped motivate this report.

https://rust.godbolt.org/z/3rnY8d94v
https://rust.godbolt.org/z/53T7n9PGx

blyxyas · 2024-05-23T10:42:31Z

I'm currently working on a patch on upstream compiler, it should fix this behaviour (that, more of a bug with Clippy, it's more of a possible optimization with the standard library).

Note that a difference this big only happens with clamp(0, 255), using other ranges like 40, 200 results in a more equal assembly. I'll still open the PR to the standard library.

blyxyas · 2024-05-24T12:23:39Z

Okis, the PR has been merged, as that lands we should see an improvement (I'll test in a few on nightly). I think that this PR can now be closed, as the new inline clamp function results in smaller assembly than doing it manually.

What do you think?

Make `clamp` inline Context: rust-lang/rust-clippy#12826 This results in slightly more optimized assembly. (And most important, it's now less than lines than just manually clamping a value)

Rollup merge of rust-lang#125455 - blyxyas:opt-clamp, r=joboet Make `clamp` inline Context: rust-lang/rust-clippy#12826 This results in slightly more optimized assembly. (And most important, it's now less than lines than just manually clamping a value)

okaneco · 2024-05-24T21:18:33Z

That sounds good, thanks for doing that.

I reported the "bug" here because when I inlined the clamp definition, it was still producing the selects. I assumed that with the way clamp is currently written, it wouldn't be possible to produce the same output as the manual clamp.
https://rust.godbolt.org/z/Ex4zvExsb

It's definitely more of a performance optimization upstream and probably most related to this specific saturating truncation case. Hopefully clamp and manual clamp can produce equivalent results soon for this.

I agree, it makes more sense to file an issue upstream so it can be tracked and closed there, or closed by regression tests being added if there isn't an improvement.

okaneco added the C-bug Category: Clippy is not doing the correct thing label May 20, 2024

blyxyas added the performance-project For issues and PRs related to the Clippy Performance Project label May 22, 2024

blyxyas mentioned this issue May 23, 2024

Make clamp inline rust-lang/rust#125455

Merged

okaneco closed this as completed May 24, 2024

okaneco mentioned this issue May 29, 2024

More instructions generated for Ord::clamp than manual max(X).min(Y) for saturating truncating cast from i32 to u8 rust-lang/rust#125738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following `manual_clamp` suggestion results in slower code #12826

Following `manual_clamp` suggestion results in slower code #12826

okaneco commented May 20, 2024 •

edited

Loading

blyxyas commented May 22, 2024

okaneco commented May 22, 2024

blyxyas commented May 23, 2024

blyxyas commented May 24, 2024

okaneco commented May 24, 2024

Following manual_clamp suggestion results in slower code #12826

Following manual_clamp suggestion results in slower code #12826

Comments

okaneco commented May 20, 2024 • edited Loading

Summary

Lint Name

Lint Description

Reproducer

Version

Additional Labels

blyxyas commented May 22, 2024

okaneco commented May 22, 2024

blyxyas commented May 23, 2024

blyxyas commented May 24, 2024

okaneco commented May 24, 2024

Following `manual_clamp` suggestion results in slower code #12826

Following `manual_clamp` suggestion results in slower code #12826

okaneco commented May 20, 2024 •

edited

Loading