Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should "pairs of scalars" use LLVM vectors? #48426

Closed
eddyb opened this issue Feb 22, 2018 · 14 comments
Closed

Should "pairs of scalars" use LLVM vectors? #48426

eddyb opened this issue Feb 22, 2018 · 14 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@eddyb
Copy link
Member

eddyb commented Feb 22, 2018

As seen on reddit, we could pass e.g. (f32, f32) not as two floats, but one <2 x float>.
It's unclear how well LLVM would be able to deal with it.

cc @dotdash @alexcrichton

@eddyb eddyb added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 22, 2018
@hanna-kruppe
Copy link
Contributor

I'm going to assume this would only impact parameter passing and the callee would immediately extractelement into two scalars. Otherwise this seems too big of a burden on our codegen (and most code would still need the extractvectors, but ad-hoc when needed instead of once on entry).

There's two aspects here, impact on optimizations and impact on backend/codegen.

For the former, I estimate that unless the value really is a logical vector and used as such (as in the linked post), this just introduces redundant vector creations and destructuring at no benefit. More IR is not good of course, and because these are dictated by the function signature, they can't easily be removed. Additionally, they might hamper inference of attributes such as nocapture (for pointers), returned, range metadata, etc. because there's no way to attach these to the individual vector elements. That could potentially harm some interprocedural optimizations.

The advantage for optimizations is really just that if you do vector-esque operations on the elements, instcombine or the SLP vectorizer may fold the "extract - do scalar ops - build vector" sequence into a vector op. But I don't think this will be very useful:

  • it is flaky: it only tilts the heuristics in favor of doing the optimization because there's no need to create the vector from scalars, but if you do anything with the pair that can't be done as a vector operation, it might decide to not vectorize anything
  • it would only help for vectors of length two
  • we're gaining better support for explicitly using SIMD anyway
  • small functions like the one in the linked discussion should be inlined anway, and once they are inlined any difference that the calling convention might have made is erased

As for the backend, unless the target has a suitable vector register, the value will have to be legalized (in this case, most likely split into scalars or smaller vectors, or promoted to a longer vector). Having illegal values in the function signature is a bit more tricky because these don't just pass through the general SelectionDAG legalizer (which is generally quite robust) but are also inspected by the calling convention code, which may not be equipped to handle such types if Clang doesn't use them in function signatures. I've certainly seen more bugs there than when legalizing types within a function, especially in less mainstream backends (i.e., pretty much everything except x86 and Arm).

However, when a suitable vector register is available (or the backend legalizes it well) then this would generally pass the pair in that register, which would be nice for avoiding stack traffic. But then again, passing in a vector register means you need extra instructions if the scalars in the pair ever need to, well, be in scalar registers. I expect that to be the most common case.


tl;dr Passing vectors maybe helps optimizing literal vector types of length two a bit, and passes them a bit more efficiently, but in all other cases makes the IR more opaque, risks unearthing backend bugs, and likely causes redundant data traffic before and after calls.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 22, 2018

I guess having extra registers for passing pairs could still be beneficial for functions that receive a scalar pair and forward it without ever operating on the contained scalars (i.e., just forwards it, or only copies it around). Not sure how common that is, though, and it still requires traffic between vector and scalar registers when the pair is built by the caller and when it's finally used by other code.

@eddyb eddyb added A-codegen Area: Code generation and removed A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Feb 22, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2018

@eddyb how isn't this just a bug in LLVM SLP vectorizer?

@hanna-kruppe
Copy link
Contributor

@gnzlbg The vectorizers can't change the function's ABI which is the core of the issue here. You're right that within a function you can recognize and vectorize straight-line code with data parallelism, but if the data is not yet in the right format for that, you pay for converting to and from the vector representaton.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2018

The vectorizers can't change the function's ABI which is the core of the issue here.

Aren't both ABIs equivalent in this case? If so, LLVM should be able to leverage that. Otherwise that would be another LLVM bug.

For the case that @eddyb was talking about everything is inlined, so the ABI shouldn't really matter.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Mar 28, 2018

ABI is a huge complicated mess but the short story is, if we emit a function in IR that takes two float arguments then on x86 those two arguments are passed in two different xmm registers, whereas if it takes one <2 x float> argument it's passed in one xmm register, so the ABI is different, end of story.

For the case that @eddyb was talking about everything is inlined, so the ABI shouldn't really matter.

Which case are you referring to? The reddit discussion focuses on a trivial function in isolation (taken from the blog post discussed, which contains a few more examples but they're all similarly trivial):

pub struct Vec2 {
    pub x: f32,
    pub y: f32
}

pub fn add (a: Vec2, b: Vec2) -> Vec2 {
    Vec2 {
        x: a.x + b.x,
        y: a.y + b.y
    }
}

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2018

Which case are you referring to?

This: https://godbolt.org/g/CyiLFs

Independently of the ABI specified for add (which is a problem on its own), what bothers me is that in the foo function LLVM does not transform that scalar code to a single addps.

@hanna-kruppe
Copy link
Contributor

OK, good to know, but I don't think that's in any way related to the ABI decision that this issue is about.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2018

I think it is very relevant. If the code would optimize properly, which ABI we choose would not matter for crate private or #[inline] functions. It would only matter for crate-public non-#[inline] functions when ThinLTO or FatLTO are disabled.

The functions in the reddit post are all tiny, I would expect them to often be inlined, or marked with #[inline] when exported by a crate, and thus to not be affected by this.

@hanna-kruppe
Copy link
Contributor

This issue matters for every call (that passes a scalar pair) that isn't inlined, of which there are many. What you describe has a completely different root cause, and likely a completely different fix. So I don't see the point in discussing it here, it should be a separate issue.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 28, 2018

I fully agree with this:

The advantage for optimizations is really just that if you do vector-esque operations on the elements, instcombine or the SLP vectorizer may fold the "extract - do scalar ops - build vector" sequence into a vector op. But I don't think this will be very useful

@jkordish jkordish added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Apr 18, 2018
@steveklabnik
Copy link
Member

Triage: not sure if anything is going on here, or if this issue is pulling its weight. That is, this feels like a small discussion from a year ago, with no clear resolution.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jul 23, 2019

I think that the current default is better, and therefore we should not change it.

The current default is unsurprising: user writes scalar code and gets scalar semantics, while users that explicitly use logical vectors get vector semantics. One can change from one to the other and what happens is explicit in the code.

If we change the defaults to use vectors for pairs, at best, the user code becomes faster, and at worst we generate extra IR that might end up producing slower scalar code.

@workingjubilee
Copy link
Member

I agree that this issue is not "pulling its weight" and it is grounded on a somewhat shaky basis. There has been some fussing about with the ABI recently in order to undo performance regressions that were induced by previously trying to play too many games with LLVM's understanding of the code, see #85265

There is almost certainly some profit to be had, but it requires a more sophisticated plan-of-attack which would be, essentially, off-topic here.

@workingjubilee workingjubilee closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants