Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

llvmbot · 2020-02-11T16:33:45Z


Bugzilla Link	44873
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor
CC	@topperc,@LebedevRI,@RKSimon,@rotateright,@vinc17fr

Extended Description

x87 floating-point results are effectively unstable due to possible excess precision. Without extra care, this instability can taint everything around and lead to nonsense.

Instability is not limited to FP numbers, it extends to integers too:

#include <stdio.h>

__attribute__((noipa,optnone)) // imagine it in a separate TU
static int opaque(int i) { return i; }

int main()
{
    int z = opaque(1) + 0x1p-60 == 1;

    printf("z = %d\n", z);
    if (z) 
        puts("z is one");
}

$ clang -std=c11 -pedantic -Wall -Wextra -Wno-unknown-attributes -m32 -march=i686 -O3 test.c && ./a.out
z = 0
z is one

----------------------------------------------------------------------
clang x86-64 version: clang version 11.0.0 (https://github.com/llvm/llvm-project 14ecbd7b8ded18af6c95f6a9957da541d1ec0e80)
----------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

llvmbot · 2020-02-11T16:35:39Z

And instability of integers then easily taints surrounding code:

#include <stdio.h>

__attribute__((optnone)) // imagine it in a separate TU
static int opaque(int i) { return i; }

static void f(int a)
{
    int z = opaque(0);

    printf("z = %d\n", z);
    if (z == a && a == 1)
        printf("z = %d\n", z);
}

int main()
{
    f(opaque(1) + 0x1p-60 == 1);
}

$ clang -std=c11 -pedantic -Wall -Wextra -m32 -march=i686 -O3 test.c && ./a.out
z = 0
z = 1

----------------------------------------------------------------------
clang x86-64 version: clang version 11.0.0 (https://github.com/llvm/llvm-project 14ecbd7b8ded18af6c95f6a9957da541d1ec0e80)
----------------------------------------------------------------------

gcc bug -- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93681

LebedevRI · 2020-02-11T16:40:11Z

int z = opaque(1) + 0x1p-60 == 1;

0x1p-60 there is just a double
https://godbolt.org/z/mJS3RN
It's as expected if it is long double: https://godbolt.org/z/NVvJht

llvmbot · 2020-02-11T22:37:36Z

int z = opaque(1) + 0x1p-60 == 1;
0x1p-60 there is just a double
https://godbolt.org/z/mJS3RN
It's as expected if it is long double: https://godbolt.org/z/NVvJht

Yeah, 1 + 0x1p-60 is the simplest expression I could think of where both operands fit into double exactly but the result doesn't.
x87 evaluates it as long double and gets the exact result. But if the result is spilled into memory (as double) it's rounded to just 1.

If you force the expression to honest long double (e.g., by casting any of the operands to it) then everything becomes stable -- the result both in a register and in memory has the same type and the same value. The same happens with SSE -- the result is stored as double everywhere and there are no discrepancies.

Muon · 2023-02-16T05:35:03Z

This issue in particular should be resolvable by either always storing the result to memory before performing a comparison, or by setting the x87 FPU's precision to the destination's precision.

Muon · 2023-10-21T02:01:39Z

I'm trying to fix the contradictory control flow here by promoting f32 and f64 ops to intel_fp80 on the x87. This wouldn't make the numerical results conform to IEEE 754, but it would at least make them consistent. However, I'm running into trouble because it appears that LLVM doesn't actually properly support promotion for floating-point operations. I tried removing the f32/f64 registers from the x87, setting the operations to Promote, and then fiddling with this part of TargetLoweringBase.cpp:

llvm-project/llvm/lib/CodeGen/TargetLoweringBase.cpp

Lines 1414 to 1419 in 14bc11a

    
           if (!isTypeLegal(MVT::f32)) { 
        
             NumRegistersForVT[MVT::f32] = NumRegistersForVT[MVT::i32]; 
        
             RegisterTypeForVT[MVT::f32] = RegisterTypeForVT[MVT::i32]; 
        
             TransformToType[MVT::f32] = MVT::i32; 
        
             ValueTypeActions.setTypeAction(MVT::f32, TypeSoftenFloat); 
        
           }

But now I'm getting errors because it's trying to promote constants to intel_fp80, and I don't know how to tell it that the x87 can load those from memory as a widening conversion.

Can anyone point me in the right direction? Pinging @d0k @topperc since they were the last to touch that part of the code.

Muon · 2023-11-23T13:16:09Z

It is sufficient to read from a global variable to reproduce this (godbolt):

#include <stdio.h>

float c = 1.0f;

static void f(int z) {
    printf("z = %d\n", z);
    if (z) 
        printf("z = %d\n", z);
}

int main(void) {
    f(c + 0x1p-52f == c);
}

This version should also reproduce correctly on Windows as well.

hvdijk · 2023-11-28T15:51:21Z

I'm trying to fix the contradictory control flow here by promoting f32 and f64 ops to intel_fp80 on the x87.

I'm looking at this and realising that that can avoid the instability, but cannot generate correct results. For f32, it can work but for f64, it cannot. Because intel_fp80 has less than double the precision of f64, it is possible to get double rounding errors where adding two f64 values gives a different result than promoting both to intel_fp80, performing the addition there, and then rounding back to f64. It would still be an improvement, but the work needed to get that done might be work that would need to be thrown out again later.

Muon · 2023-11-28T23:21:53Z

Yes, I'm only trying to fix the control flow issue for now. Actually getting correct results will involve a performance hit, and my previous inquiries suggest that this would only be merged if it was behind a compiler flag.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

hvdijk · 2023-11-29T01:10:01Z

Yes, I'm only trying to fix the control flow issue for now. Actually getting correct results will involve a performance hit, and my previous inquiries suggest that this would only be merged if it was behind a compiler flag.

I do not see how it's possible to fix even that without a performance hit, in general, but I can imagine that there is a possibility that this performance hit will be lower than that of any alternative.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

For addition, first consider round to nearest, because it is the most complicated rounding mode. For round to nearest, in order for double rounding to happen, and to produce wrong results, you need the round to extended precision to be inexact, and produce a value which is exactly at a halfway point for the reduced precision which then gets rounded the wrong way in the second round operation. But producing a value which is exactly at a halfway point requires the most significant extended precision bit to be set after the round to extended precision, which requires either that or the next bit to be set in the infinitely precise original result. Meanwhile, having the round to extended precision be inexact requires a bit after the extended precision bits to be set in the infinitely precise result. These cannot both be true when performing an addition if the extended precision adds sufficient bits. I'm not sure how many extra bits of precision are needed exactly, but it should be little more than twice the original precision. For the other rounding modes, it's much simpler, it's either round to zero or round away from zero (round up and round down are one of those, depending on the sign), round to zero is just discarding the extra precision bits, and round away from zero, if inexact, ensures that at least one bit will be set in the extra precision bits so that the second conversion rounds correctly. I don't believe double rounding should ever be an issue with fp32 additions promoted to fp80, in any rounding mode supported by x87. Happy to be corrected if wrong, of course.

That said, this is kind of distracting from the point, which is not how much precision is enough for addition to be safe in extended precision, it's that fp80 is definitely not precise enough for fp64 operations to be safely promoted to fp80.

The only way I think we can get correct results is if we let LLVM alter the x87 control word to set the precision to that needed by operations. And I actually suspect that if done right, that will have a lower impact on performance in code which does not mix different FP types, because it means LLVM can just set the precision at the start of a function, perform all the operations it needs, and then restore the precision at the end, whereas leaving x87 in its default precision would require extra instructions to ensure correct rounding after every operation. Code which does mix different FP types would need the control word to be changed throughout the function though.

Muon · 2023-11-29T01:36:10Z

I agree that a true fix would be immensely preferable. However, I want to add that it's not sufficient to set the control word to the right precision, because that doesn't modify the exponent range of the x87 registers. You still need to do extra ops to ensure correct handling of underflow, subnormals, and overflow.

Also, yes, there is a performance cost to promoting all ops to fp80. Since changing the control word plays havoc with pipelining, I suspect this is cheaper, but only profiling will tell.

(Also, indeed, double rounding is never a problem for directed rounding modes.)

hvdijk · 2023-11-29T02:03:06Z

However, I want to add that it's not sufficient to set the control word to the right precision, because that doesn't modify the exponent range of the x87 registers.

Thanks for that addition. You are correct, we need that as well.

RalfJung · 2023-11-29T16:22:12Z

I'm looking at this and realising that that can avoid the instability, but cannot generate correct results. For f32, it can work but for f64, it cannot. Because intel_fp80 has less than double the precision of f64, it is possible to get double rounding errors where adding two f64 values gives a different result than promoting both to intel_fp80, performing the addition there, and then rounding back to f64. It would still be an improvement, but the work needed to get that done might be work that would need to be thrown out again later.

Yeah a naive translation will give wrong results. But getting the exact results seems to be possible; Java is supposedly precisely emulating fp64 on the x87. See here and here for details.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

Muon · 2023-11-30T04:50:21Z

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

Fantastic, thank you for this. Reading through it, the extended format needs to have 2p + 2 digits of precision to guarantee innocuous double rounding for all of (+, -, *, /, sqrt). I've also now got a little test case to demonstrate that double rounding doesn't work for emulating f64 addition using f80 addition: https://godbolt.org/z/j11xbqzr3

I've also come up with a spill-free testcase which miscompiles (godbolt). The following program should always exit with status code 0, but when compiled with -m32 -mno-sse and -O1 or higher, it exits with code 127:

int main(void) {
    volatile float c = 1.0f;
    float d = c + 0x1p-52f;
    c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
    return c == d ? 0 : 127;
}

Accordingly, the following program should never segfault, but it does when compiled with -m32 -mno-sse and -O1 or higher (godbolt):

int* p = 0;

int sane(void) {
    volatile float c = 1.0f;
    float d = c + 0x1p-52f;
    c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
    return c == d;
}


int main(void) {
    if (!sane()) {
        *p = 1;
    }
}

These aren't quite the same bug as initially reported: there's no contradictory control flow here. In fact, I don't think it's possible for contradictory control flow to happen unless there is a spill: if there are no spills, then all variables are still in registers, so repeating a comparison will give the same result as it did the first time.

These testcases also show that promoting all ops to intel_fp80 isn't necessarily a fix either, unless extra care is taken to reload values after a narrowing store to memory. If these promotions mean that d retains excess precision until the comparison (after all, it hasn't changed since its initial assignment), then c == d will still be false, because c = d forces a narrowing store to memory since c is volatile.

I think, in fact, that might be the crux of the matter: an extended precision value, once created, may either be narrowed, or used otherwise, but not both on the same path. If this is not the case, then there may be conflicts, where one part of the code sees the extended version of that value, and another part sees the narrowed version.

jcranmer-intel · 2023-12-04T20:35:13Z

Yeah a naive translation will give wrong results. But getting the exact results seems to be possible; Java is supposedly precisely emulating fp64 on the x87. See here and here for details.

The emulation works by using the precision control set to 53 bits (and using essentially fptrunc to clamp the resulting exponent). Which does make it trickier if you want to also support x86_fp80 at the same time, since now you have to juggle the FCW changes interspersed with FP instructions, and that requires strictfp if you're doing it at the level of LLVM IR.

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

In the interests of pedantry, you also have need the exponent range to be slightly larger in the second type (two extra bits is sufficient I believe). It might be worth having a helper method in fltSemantics to return if the type can emulate another type without double rounding.

RalfJung · 2023-12-05T06:48:40Z

The emulation works by using the precision control set to 53 bits (and using essentially fptrunc to clamp the resulting exponent). Which does make it trickier if you want to also support x86_fp80 at the same time, since now you have to juggle the FCW changes interspersed with FP instructions, and that requires strictfp if you're doing it at the level of LLVM IR.

Yeah I was assuming this would have to be done in the backend. But still it's probably not easy; I just wanted to raise the possibility.

jyknight · 2023-12-08T05:28:21Z

If LLVM were to set the precision control to 53-bits for operations on double values, I think it'll need to restore the mode before every external function call, as well as before executing any operation on an fp80 value.

And even that only enables the implementation of the complex code sequences listed in https://open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn326.pdf which IMO are themselves well beyond what's worthwhile to implement. Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

jcranmer-intel · 2023-12-12T19:04:29Z

Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

I was going to argue against this, but on further reflection, the main value-add you would get from LLVM would be supporting optimizations on the FP control word in conjunction with an inlineable soft-float library, such that operations can be bucketed into those that require the same FP control word bits, and then ordered so as to minimize flag transitions. Such optimizations I think could be useful in other situations (do GPU libm libraries have flag-twiddling operations that might be omittable with such optimizations?), but there is still a lot of work needed to enable such optimizations, and if x87 is the only thing that benefits from it... it's not the highest priority thing to solve.

Muon · 2023-12-21T01:51:16Z

And even that only enables the implementation of the complex code sequences listed in https://open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn326.pdf which IMO are themselves well beyond what's worthwhile to implement. Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

In a certain sense, those code sequences are exactly what an x87-accelerated softfloat library would be. That said, it's not clear to me that this is too complex to put into LLVM. Each sequence is only 6-7 instructions. What's so bad about that?

Muon · 2024-04-23T13:14:28Z

We found another possibly-related miscompilation at rust-lang/rust#114479 (comment). This one triggers a segfault from a safe Rust program by exploiting the fact that signaling NaNs get quieted upon being loaded into an x87 register. Furthermore, it does not use bitcasting, and it also does not include any NaNs per se. Instead, LLVM merrily loads an integer with an SNaN bit pattern into a floating-point register.

arsenm · 2024-04-23T13:18:14Z

LLVM merrily loads an integer with an SNaN bit pattern into a floating-point register.

This sounds like a different issue altogether

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

efriedma-quic mentioned this issue Feb 18, 2023

numerical differences in x87 fpu code generated by clang-14 and clang-15 #60837

Open

This was referenced Jun 28, 2023

Comparing to infinity is buggy on x87 rust-lang/rust#72327

Closed

add notes about non-compliant FP behavior on 32bit x86 targets rust-lang/rust#113053

Merged

RalfJung mentioned this issue Aug 4, 2023

Tracking issue: 32bit x86 targets without SSE2 have unsound floating point behavior rust-lang/rust#114479

Open

RalfJung mentioned this issue Sep 17, 2023

[LangRef] Specify NaN behavior more precisely #66579

Merged

Endilll added confirmed Verified by a second party floating-point Floating-point math labels Nov 23, 2023

hvdijk mentioned this issue Nov 29, 2023

Wrong floating point promotion during type legalisation #73805

Closed

jcranmer-intel mentioned this issue Feb 6, 2024

Basic floating-point operations are underspecified #60942

Open

This was referenced Apr 23, 2024

Loading/storing float/double on 32-bit x86 without SSE can cause the value to mutate #89791

Closed

Difference between compile-time and runtime float precision on 32-bit x86 without SSE can cause miscompilation leading to segfault #89885

Open

jcranmer-intel mentioned this issue Jul 9, 2024

LLVM miscompiles consecutive half operations by using too much precision on several backends #97975

Open

jcranmer-intel mentioned this issue Jul 30, 2024

[C11] Claim conformance to WG14 N1396 #101214

Merged

RalfJung mentioned this issue Aug 6, 2024

LangRef: state explicitly that floats generally behave according to IEEE-754 #102140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

llvmbot commented Feb 11, 2020 •

edited by RKSimon

Loading

llvmbot commented Feb 11, 2020 •

edited by RKSimon

Loading

LebedevRI commented Feb 11, 2020 •

edited by jyknight

Loading

llvmbot commented Feb 11, 2020

Muon commented Feb 16, 2023

Muon commented Oct 21, 2023 •

edited

Loading

Muon commented Nov 23, 2023 •

edited

Loading

hvdijk commented Nov 28, 2023

Muon commented Nov 28, 2023

hvdijk commented Nov 29, 2023

Muon commented Nov 29, 2023 •

edited

Loading

hvdijk commented Nov 29, 2023

RalfJung commented Nov 29, 2023

Muon commented Nov 30, 2023 •

edited

Loading

jcranmer-intel commented Dec 4, 2023

RalfJung commented Dec 5, 2023

jyknight commented Dec 8, 2023

jcranmer-intel commented Dec 12, 2023

Muon commented Dec 21, 2023

Muon commented Apr 23, 2024 •

edited

Loading

arsenm commented Apr 23, 2024

Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

Comments

llvmbot commented Feb 11, 2020 • edited by RKSimon Loading

Extended Description

llvmbot commented Feb 11, 2020 • edited by RKSimon Loading

LebedevRI commented Feb 11, 2020 • edited by jyknight Loading

llvmbot commented Feb 11, 2020

Muon commented Feb 16, 2023

Muon commented Oct 21, 2023 • edited Loading

Muon commented Nov 23, 2023 • edited Loading

hvdijk commented Nov 28, 2023

Muon commented Nov 28, 2023

hvdijk commented Nov 29, 2023

Muon commented Nov 29, 2023 • edited Loading

hvdijk commented Nov 29, 2023

RalfJung commented Nov 29, 2023

Muon commented Nov 30, 2023 • edited Loading

jcranmer-intel commented Dec 4, 2023

RalfJung commented Dec 5, 2023

jyknight commented Dec 8, 2023

jcranmer-intel commented Dec 12, 2023

Muon commented Dec 21, 2023

Muon commented Apr 23, 2024 • edited Loading

arsenm commented Apr 23, 2024

llvmbot commented Feb 11, 2020 •

edited by RKSimon

Loading

llvmbot commented Feb 11, 2020 •

edited by RKSimon

Loading

LebedevRI commented Feb 11, 2020 •

edited by jyknight

Loading

Muon commented Oct 21, 2023 •

edited

Loading

Muon commented Nov 23, 2023 •

edited

Loading

Muon commented Nov 29, 2023 •

edited

Loading

Muon commented Nov 30, 2023 •

edited

Loading

Muon commented Apr 23, 2024 •

edited

Loading