Clarifications on comparisons with NaN #21936

kgpai · 2024-02-15T00:09:53Z

Current Behavior

Currently in Presto NaN() seems to behave like its > infinity,

select ARRAY_SORT(a) from  (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a); 
—  [ "-Infinity", 1,  "Infinity",  "NaN" ]

select ARRAY_SORT_Desc(a) from  (values array[NAN(), INFINITY(), -INFINITY(), 1]) as T(a); 
—  ["NaN", "Infinity", 1, "-Infinity"]

This is not just limited to array_sort and you can see this behavior in order by too:

SELECT x, y from (values (1, 1.0), (2, NAN()), (3, -NAN()), (4, INFINITY()), (5, -INFINITY())) as T(x, y) order by y;

— 
5	-Infinity
1	1
4	Infinity
2	NaN
3	NaN

However running

SELECT NAN() > INFINITY()

-- False

gives a contradictory answer.

Expected Behavior

Comparison with NaN is not defined so we should confirm to https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN.

Context

Velox is trying to ensure correctness with Presto Java but we are seeing inconsistencies in behavior and wanted clarifications on these.

The text was updated successfully, but these errors were encountered:

kgpai · 2024-02-15T00:10:37Z

cc: @mbasmanova @spershin @tdcmeehan @Yuhta @kagamiori

mbasmanova · 2024-02-15T00:11:52Z

CC: @kaikalur @feilong-liu

tdcmeehan · 2024-02-23T14:44:52Z

As discussed in #21877, I am not sure if it's most appropriate to follow the IEEE standard, or to follow the behavior of most engines. It appears that behavior of NaN is not defined in the spec, so we are actually free to follow whatever behavior is most intuitive to users--in my opinion, this should be consistent with that of other engines. The real requirement is to make them consistent, and document the behavior well for functions.

rschlussel · 2024-02-29T18:24:14Z

I've been looking into the NaN inconsistencies, and wanted to give an update/summary of my thinking. If we can get alignment on the solution we can start tackling the problem

reference document of presto’s treatment of NaN for comparison and ordering across all functions and operators: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498

What does IEEE-754 say about NaN

IEEE specifies that operations on NaN return false, and that comparison operations return false except for inequality which returns true. (I found a section online here after seeing references to total order: https://gist.github.com/leto/176888/e1bf1cf76d56a3da34b6be5a0f6b95139a747299)
It also specifies a total order of doubles as follows

-quietNan
-signalingNan
regular expected ordering of doubles
+signalingNaN
+quietNan

what people usually expect should be true about numbers

There are a couple concepts that people intuitively expect to be consistent:

equality/inequality and distinctness, groupings, etc.
<, >, <=, >= and sort order

Presto current behavior

comparison - basic operations follow IEEE754 (nan <> nan). joins do not match NaN keys, but distinct and group bys do consider NaN to be equal (deduplicates, considered a single group, etc.). Many map and array function do not consider NaN to be equal to itself, (will show up multiple times in maps, array distincts, unions, etc.), but the exact behavior is inconsistent.
ordering - basic operations follow IEEE754 (NaN < or > anything is false). Order by puts NaNs as biggest. min and max have inconsistent behavior. greatest and least throw an error. topn functions on arrays and maps are inconsistent, but generally sort NaN as biggest. File format min/max stats differ by file format and may also differ based on who wrote the file.

A lot of the Presto inconsistency comes from the following:
In Java, basic comparison operations on double type (<, >, =, !=, >-, <=) behave according to the IEEE754 definition. However, Double the object class has equals, compare() and hashcode all defined so that NaN are equal to each other and greater than anything else. Some of our functions and operators use the Object comparison or structures while others use the primitive definition (and some probably use a combo). There are still other operators that have special handling for nulls. Additionally, the primitive definition itself can lead to bugs when used for sorting if nan is not handled separately (e.g. #22040).

what to do:

How should we decide what the behavior of NaN should be?

useful and reasonably intuitive behavior for users
reasonably standard
reasonable to implement and maintain as new functions and features are added.

Options

IEEE definition without total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. min, max etc. return nan. Would throw an error for sorting.
Pros:
a) IEEE754 compliant
cons:
a) not useful behavior for users if we throw an error for order by queries when input contains nan. (and possibly not sql compliant?)
b) implementation might be tricky. and keeping consistency might be tricky.
IEEE compliant with total order - Any basic comparisons with NaN return false, NaN<> NaN is true. NaN are not de-distincted. unclear what min/max should do. For sorting uses total order
Pros:
a) IEEE754 compliant
cons:
a) not intuitive that basic comparison operations don’t match sort order. operations like min/max or topn don't obviously fall into distinctions between sort and comparison
b) might be surprising to users that nans wouldn’t be deduplicated for distinct, group by, etc.
c) having comparison operations behave differently than sort makes this prone to inconsistency/bugs.
NaN=NaN and NaN is biggest.
Pros:
a) many other DBs use this definition
b) easy to maintain consistency across different kinds of functions and operations because of consistency with expectations that <> and distinct are the same and > is the same as > for the purposes of sorting.
c) behavior is reasonably useful for users
Cons:
a) not IEEE compliant

Conclusion

I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.

Yuhta · 2024-02-29T20:13:54Z

@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).

Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?

rschlussel · 2024-03-01T16:31:17Z

@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though).

Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN?

That's a great point. I guess I was assuming that most users are not using the db for very mathematical use cases, but mostly nan are bad data and they are doing basic things like getting the top x by price. But it would be a good idea to ask some customers their preferences as well.

I am very hesitant about option 1. I think throwing an exception would likely be a bad user experience in most cases, especially given that there is nothing mandating we do that, and it seems like most dbs don't.
I tried to see what the sql spec had to say about it. It says

If PVi and QVi are not the null value and the result of “PVi QVi” is Unknown, then the relative ordering of PVi and QVi is implementation-dependent."

That to me suggests that for NaN values (where by IEEE comparisons are considered unordered), the sort order would be implementation dependent, but it doesn't seem like it should throw an exception.

For option 3, yes, we would redefine comparison operations to reflect the new definition of NaN (what makes it simple is that notions of distinctnesss, equality/inequality, ordering comparisons, and sorting are all behaving the same. The definition isn't standard to what you'd expect in mathematical computing, but it is internally consistent and easy to understand).

Yuhta · 2024-03-01T18:10:39Z

@rschlussel OK then at least we can have some consistent design in this space. The implementation would be messy though, because you will need to normalize all the different NaNs to the same place as +NaN on every newly generated floating point blocks if you want it to be efficient, and need to be careful whenever doing floating number comparison, we need to use the custom comparator instead of the built in operators.

mbasmanova · 2024-03-05T17:51:09Z

@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237

rschlussel · 2024-03-05T20:53:32Z

@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237

Thank you! It sounds like we are in agreement then. Let me collect approvals so I can move forward with this as well.

tdcmeehan · 2024-03-05T21:07:23Z

I think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant.

Agreed on this--this also seems most consistent with other engines, and it's easy to document as we see above.

czentgr · 2024-03-05T22:03:21Z

Thanks for the comments on PR facebookincubator/velox#7237.
I will add the requested information and check.

kagamiori · 2024-03-19T17:54:09Z

Hi @rschlussel, not sure if you're already aware of this, but I found another "inconsistent" handling of NaN in Presto: distinct aggregation considers two NaN to be duplicates, while set_agg() function considers two NaN to be distinct.

SELECT
    SET_AGG(c0)
FROM (
    VALUES
        (1.1),
        (NAN()),
        (NAN())
) t(c0); --- result is [1.1, NaN, NaN]

SELECT DISTINCT
    c0
FROM (
    VALUES
        (1.1),
        (NAN()),
        (NAN())
) t(c0); --- result are two rows: 1.1 and NaN

Since the Presto documentation of set_agg says "Returns an array created from the distinct input x elements", I expect these two queries to treat NaN in the same way.

rschlussel · 2024-03-19T19:16:19Z

Thank you, yes, i did know about this difference. I made a wiki about how we handle nans across all of our functions that do comparison or ordering on doubles https://github.com/prestodb/presto/wiki/Presto-NaN-behavior

kgpai added the bug label Feb 15, 2024

kgpai mentioned this issue Feb 24, 2024

Ensure Spark Aggregation/Expression Fuzzers work with PR : #8732 facebookincubator/velox#8739

Open

rschlussel mentioned this issue Feb 28, 2024

map_top_n returns wrong results if NaN appears in the input #22040

Closed

rschlussel mentioned this issue Feb 29, 2024

Aggregate functions Max/Min return different results based on order of NaN for floating point types. #21877

Closed

kagamiori mentioned this issue Feb 29, 2024

Fix equality check for simple floating types in RowContainer facebookincubator/velox#7780

Closed

mbasmanova mentioned this issue Mar 5, 2024

Document floating point semantics facebookincubator/velox#7237

Closed

Real-Chen-Happy mentioned this issue Apr 2, 2024

Greatest/Least Function Check NaN on REAL data types? #22391

Closed

rschlussel mentioned this issue May 31, 2024

Implement new NaN behavior #22386

Merged

6 tasks

elharo closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifications on comparisons with NaN #21936

Clarifications on comparisons with NaN #21936

kgpai commented Feb 15, 2024

kgpai commented Feb 15, 2024

mbasmanova commented Feb 15, 2024

tdcmeehan commented Feb 23, 2024

rschlussel commented Feb 29, 2024

Yuhta commented Feb 29, 2024 •

edited

Loading

rschlussel commented Mar 1, 2024

Yuhta commented Mar 1, 2024 •

edited

Loading

mbasmanova commented Mar 5, 2024 •

edited

Loading

rschlussel commented Mar 5, 2024

tdcmeehan commented Mar 5, 2024

czentgr commented Mar 5, 2024

kagamiori commented Mar 19, 2024 •

edited

Loading

rschlussel commented Mar 19, 2024

Clarifications on comparisons with NaN #21936

Clarifications on comparisons with NaN #21936

Comments

kgpai commented Feb 15, 2024

Current Behavior

Expected Behavior

Context

kgpai commented Feb 15, 2024

mbasmanova commented Feb 15, 2024

tdcmeehan commented Feb 23, 2024

rschlussel commented Feb 29, 2024

What does IEEE-754 say about NaN

what people usually expect should be true about numbers

Presto current behavior

what to do:

How should we decide what the behavior of NaN should be?

Options

Conclusion

Yuhta commented Feb 29, 2024 • edited Loading

rschlussel commented Mar 1, 2024

Yuhta commented Mar 1, 2024 • edited Loading

mbasmanova commented Mar 5, 2024 • edited Loading

rschlussel commented Mar 5, 2024

tdcmeehan commented Mar 5, 2024

czentgr commented Mar 5, 2024

kagamiori commented Mar 19, 2024 • edited Loading

rschlussel commented Mar 19, 2024

Yuhta commented Feb 29, 2024 •

edited

Loading

Yuhta commented Mar 1, 2024 •

edited

Loading

mbasmanova commented Mar 5, 2024 •

edited

Loading

kagamiori commented Mar 19, 2024 •

edited

Loading