-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vc has bad performance with Intel C/C++ compiler on Linux #135
Comments
Thanks a lot @amadio ! May I use the code in the Vc benchmark suite (still under construction)? What license and copyright? I'd like to use BSD 3-clause, if possible. Here's the result running on my Skylake Desktop system (seems to be the same CPU you're using) compiled with GCC 5.2:
|
Hi Matthias, You can use the code with any open license you want, and modify it to your needs. A very similar version is in VecCore, where I also test performance with fake SIMD classes and UME::SIMD. The performance of Vc with GCC is usually very good, same for Clang, but for ICC I always see performance degradation, and the scalar version of Vc is oddly the fastest. If you do not have a license for ICC, I can run the benchmarks for you, but I strongly encourage you to also apply for an open source license so that you can test Vc with ICC too. |
Thanks. I have access to ICC - it's just more tedious to get to compared to the system-provided compilers. My main problem is the time constraints: working 12h weeks and prioritizing standardization of SIMD programming in C++ leaves little time for Vc maintenance currently. |
Same observation here, I've recently implemented some worst-case SIMD-scenarios (branching, nesting, early returns, while-loop) in Vc, for a paper where we compared different implementation strategies for these cases. Vc did well with clang and gcc, but totally failed with the Intel compiler. Benchmark codes will be public soon. |
Actually, I'd rather like to say "the Intel compiler totally failed". There are numerous issues with ICC where it fails to parse C++11 (or even C++98) �correctly, so Vc contains workarounds specifically for ICC. After all, correctness must go first. It has to compile and produce the correct results, only then can I look for performance. This is a really frustrating chapter of Vc development. I'll investigate as soon as I can. However, I'd be very thankful for any more in-depth analysis where the Intel compiler optimizer fails. Then we might be able to get it fixed via a bug report to Intel and/or a workaround in Vc. |
@noma Do you mind sending a link to your paper? I'm interested in this kind of work. |
@mattkretz You are right, let's blame the compiler. :-) |
@mattkretz @noma Just to let you know, using On a separate note, Vc-1.2.0 seems to be broken with ICC-16.0.3. The library compiles and installs, but I cannot compile anything against it. |
@mattkretz This seems to indicates that somewhere somehow the Vc headers are switching icc to not longer optimize as well. Do you have any clue what this 'switch' might be? I am hoping that this is a 'simple disable optimization' flag and maybe the latest version of icc no longer need it (to compile Vc). I am also hoping that it would then help Vc's performance. At the moment, this means that any bench-marking done using ICC shows Vc as not being competitive (at all) with other Vector libraries (e.g. UME-SIMD) or even well-auto-vectorized scalar code ... and thus make Vc unfairly look bad .... (i.e. I really encourage you to find what is triggering ICC to get into this not-so-great optimize mode). Cheers, |
@mattkretz FYI, Intel representative are using this issue to claim to their HPC customer "This [Vc] library is not longer reliable and must be avoided at all cost" ... i.e. this make Vc looks very bad ... So I strongly recommend to re-examine this issue if only to be to clearly explain why 'whatever is setting icc in its bad mode' is necessary. Thanks. |
Thanks for the warning @pcanal . I still don't know exactly what is causing it. But the major difference with ICC is the implementation of scalar element aliasing onto the intrinsic vector objects. With GCC and clang I can use either vector builtins or the |
Yes but in addition to this, 'just' including the Vc headers while not using Vc at all, reduce the performance of code produced by ICC so there is also something that is 'changing the mode' of ICC globally lurking somewhere in the header file ... at least (re)discovering what that is would be very helpful thanks. |
@pcanal do you have a testcase for this? This is very strange. I don't recall any |
@amadio Can you remind us the file and compilation option to reproduce the 'icc-is-slowdwon-by-including-Vc' problem? |
Below is an easy way to reproduce the problem. Notice that the performance of the
|
@mattkrez @amadio Good news for Vc. It turns out that Guilherme's conclusion was incorrect. The problem is not linked to Vc at all but to the number of testing loops/functions in Guilherme's example. The slowdown can be reproduced simply applying the patch below and executing $ CC=icc CXX=icpc cmake $SRC_DIR/VecGeom/VecCore -DBUILD_TESTING=ON -DTARGET_ISA=native -DVC=OFF Note in particular that the result for scalarwrapper backend is not even stable ( 120ms vs 40ms ) .... So this problem is solely a limitation of ICC ... Cheers, diff --git a/VecCore/bench/quadratic.cc b/VecCore/bench/quadratic.cc
#ifdef VECCORE_ENABLE_VC |
This means that the Vc have no unexpected slowdown, as @mattkretz mentioned Vc is still slower on icc:
whereas with gcc the Vc vector backend is in the same ball as the optimized scalar or the intrisics (see initial post) |
After investigating things a bit deeper, I agree with @pcanal. There is no slowdown in unrelated code, although Vc's performance is still slower with ICC in general. ICC generates streaming stores in some situations, which makes some code seem faster. We had identified it before, but the test was not corrected properly to eliminate these streaming stores in benchmarking, only the intrinsics version was corrected to not have streaming stores. Now I have pretty good reason (from other tests and after inspecting the assembly generated in this test) to believe that the performance problem between ICC and Vc is a stack alignment + ABI problem (using stack vs registers when passing things around). The solution is probably a combination of perfect forwarding in Vc and changes to ABI conventions for unions and structs in ICC. |
The issue I see here is that ICC inserts lots of unnecessary unaligned loads and stores into the critical path. I have no idea yet what confuses ICC that much. The ABI issue is a good idea, but I have already ensured that this never breaks again. There are ABI unit tests in Vc. I.e. unless the ABI tests fail for you, you can be fairly certain that Vc vector objects passed by value are actually passed via registers. |
This is so frustrating... __m256 x = {}; compiles to vmovups 0x2030c(%rip),%ymm0 # 42bc20 instead of vxorps %ymm0,%ymm0 Likewise, any implicit load/store from/to the stack uses unaligned moves, even though the compiler correctly ensures 32-Byte alignment of the stack pointer. |
And to expand the test case: struct Storage { __m256 data; };
Storage foo() {
__m256 tmp = {};
Storage s{tmp};
s.data = _mm256_add_ps(s.data, s.data);
asm volatile("vmovaps %0,%0" :"+x"(s.data));
return s;
} compiles to: 000000000040b8d0 <foo()>:
40b8d0:· push %rbp
40b8d1:· mov %rsp,%rbp
40b8d4:· and $0xffffffffffffffe0,%rsp
40b8d8:· vmovups 0x202c0(%rip),%ymm0 # 42bba0 <tmp.119499.0.0.76>
40b8e0:· vaddps %ymm0,%ymm0,%ymm1
40b8e4:· vmovups %ymm1,-0x20(%rsp)
40b8ea:· vmovaps %ymm1,%ymm1
40b8ee:· vmovups %ymm1,-0x20(%rsp)
40b8f4:· vmovups -0x20(%rsp),%ymm0
40b8fa:· mov %rbp,%rsp
40b8fd:· pop %rbp
40b8fe:· retq Sorry, but that just shows how confused ICC is about any minimal abstraction on top of SIMD intrinsics. I have no idea at this point other than to write bug reports. Edit: compiler flags were |
And in case the inline asm case doesn't seem motivating enough. Here's another deal-breaker: struct Storage {
Storage() : data(_mm256_setzero_ps()) {}
Storage(__m256 x) : data(x) {}
__m256 data;
};
struct Vector {
Vector(const float *mem) { d = _mm256_load_ps(mem); }
Storage d;
};
Vector foo() {
const float mem[8] = {1, 2, 3, 4, 5, 6, 7, 8};
Vector s(mem);
return s;
} Compiles to: 000000000040b8d0 <foo()>:
40b8d0:· push %rbp
40b8d1:· mov %rsp,%rbp
40b8d4:· and $0xffffffffffffffe0,%rsp
40b8d8:· vmovups 0x202c0(%rip),%ymm0 # 42bba0 <mem.119470.0.0.79>
40b8e0:· vmovups %ymm0,-0x20(%rsp)
40b8e6:· vmovups -0x20(%rsp),%ymm0
40b8ec:· mov %rbp,%rsp
40b8ef:· pop %rbp
40b8f0:· retq WTF. The following is what I expect: vmovaps 0xb1a0b1a(%rip),%ymm0
retq (Note that GCC5 also emits lots of function call boilerplate, but at least the AVX code boils down to a single |
Alright, identified one issue that I can avoid: Using |
Here's another finding. Test case: Vector foo(Vector a, Vector b, Mask k) {
a(k) = b;
return a;
} GCC compiles this to: 0000000000000000 <foo(Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Mask<float, Vc_1::VectorAbi::Avx>)>:
0:· vblendvps %ymm2,%ymm1,%ymm0,%ymm0
6:· retq ICC compiles it to: 0000000000000000 <foo(Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Mask<float, Vc_1::VectorAbi::Avx>)>:
0:· push %rbp
1:· mov %rsp,%rbp
4:· and $0xffffffffffffffe0,%rsp
8:· sub $0x100,%rsp
f:· vmovups %ymm0,0xc0(%rsp)
18:· lea 0xc0(%rsp),%rax
20:· vmovups %ymm1,0x20(%rax)
25:· vmovups %ymm2,(%rsp)
2a:· vmovups %ymm2,-0x80(%rax)
2f:· mov %rax,0x20(%rsp)
34:· vmovups 0x20(%rsp),%ymm3
3a:· vmovups %ymm3,-0x60(%rax)
3f:· vmovups -0x80(%rax),%ymm4
44:· vmovups %ymm4,-0x40(%rax)
49:· mov -0x60(%rax),%rdx
4d:· mov %rdx,-0x20(%rax)
51:· mov 0xa0(%rsp),%rax
59:· vmovups 0x80(%rsp),%ymm1
62:· vmovups (%rax),%ymm0
66:· vblendvps %ymm1,0xe0(%rsp),%ymm0,%ymm2
71:· vmovups %ymm2,(%rax)
75:· vmovups 0xc0(%rsp),%ymm0
7e:· mov %rbp,%rsp
81:· pop %rbp
82:· retq |
The use of __m256[id] as default constructed arguments to the load functions works fine for GCC and clang. ICC, however, generates a dead store and thus significant overhead. Refs: gh-135 Signed-off-by: Matthias Kretz <kretz@kde.org>
ICC can do proper static propagation when a SIMD object is initialized with _mm(256)_setzero. It fails to generate proper code for __m256[id](). Refs: gh-135 Signed-off-by: Matthias Kretz <kretz@kde.org>
Vc master, compiled with ICC 16.0.2:
|
@mzyzak FYI. You reported ICC performance issues too. Can you please retest with Vc master? |
Another FYI: Version 17.0 of the Intel compiler was released yesterday (2016-09-06). |
Thanks @noma. I'll ask for the new version to get installed on our infrastructure. |
@mattkretz Thanks for working on this. I get much better results with ICC now. |
Did you report those issues to our compiler team? |
@rolandschulz A colleague at CERN took this up for me. (I still wrote the test cases.) It was as frustrating as every time I did it myself. The support engineer on Premier Support needs hand holding to understand, reproduce, and escalate the issue. If only reporting such issues wouldn't require days of work and involve more motivating feedback. :-( |
Vc has much worse performance when compiled with ICC than GCC, as shown in the example below.
The test case is solving many quadratic equations, given coefficients a, b, c. The source code can be found at http://pastebin.com/hr6nPDmJ (quadratic.cc)
Testcase
Here is a session on my computer:
Notice that Vc code is slower than the auto-vectorized code that ICC generates.
AVX intrinsics code provided for reference.
Note: It used to be the case that even the scalar code would see its performance degraded, and while this is not true for this particular example, I still see places in which merely including Vc degrades performance of code that is not using Vc at all when using the Intel compiler, probably due to options changed in Vc headers.
The text was updated successfully, but these errors were encountered: