-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifications on comparisons with NaN #21936
Comments
As discussed in #21877, I am not sure if it's most appropriate to follow the IEEE standard, or to follow the behavior of most engines. It appears that behavior of NaN is not defined in the spec, so we are actually free to follow whatever behavior is most intuitive to users--in my opinion, this should be consistent with that of other engines. The real requirement is to make them consistent, and document the behavior well for functions. |
I've been looking into the NaN inconsistencies, and wanted to give an update/summary of my thinking. If we can get alignment on the solution we can start tackling the problem reference document of presto’s treatment of NaN for comparison and ordering across all functions and operators: https://docs.google.com/spreadsheets/d/1KclsskH88CLHMiu_Q2P25S3QfPc9H3Tu7h_9MXP10Dk/edit#gid=128250498 What does IEEE-754 say about NaNIEEE specifies that operations on NaN return false, and that comparison operations return false except for inequality which returns true. (I found a section online here after seeing references to total order: https://gist.github.com/leto/176888/e1bf1cf76d56a3da34b6be5a0f6b95139a747299)
what people usually expect should be true about numbersThere are a couple concepts that people intuitively expect to be consistent:
Presto current behavior
A lot of the Presto inconsistency comes from the following: what to do:How should we decide what the behavior of NaN should be?
Options
ConclusionI think option 3 (nan=nan and sorts largest) will be simplest to implement and maintain consistently, is reasonable behavior for users, and is aligned with many other dbs, and therefore reasonably standard even though it's not IEEE compliant. |
@rschlussel Not sure what users we have in mind but from a numerical computing background, treating NaN as a normal number is very counter intuitive (making it larger than Inf is even insane, how could something be larger than Inf?!). And it definitely will introduce more bugs and harder to maintain than option 1 (throwing is always the safest way; not alway the most friendly though). Assume we want to go with option 3, do we want to fix the comparison operations to reflect this new definition of NaN? |
That's a great point. I guess I was assuming that most users are not using the db for very mathematical use cases, but mostly nan are bad data and they are doing basic things like getting the top x by price. But it would be a good idea to ask some customers their preferences as well. I am very hesitant about option 1. I think throwing an exception would likely be a bad user experience in most cases, especially given that there is nothing mandating we do that, and it seems like most dbs don't.
That to me suggests that for NaN values (where by IEEE comparisons are considered unordered), the sort order would be implementation dependent, but it doesn't seem like it should throw an exception. For option 3, yes, we would redefine comparison operations to reflect the new definition of NaN (what makes it simple is that notions of distinctnesss, equality/inequality, ordering comparisons, and sorting are all behaving the same. The definition isn't standard to what you'd expect in mathematical computing, but it is internally consistent and easy to understand). |
@rschlussel OK then at least we can have some consistent design in this space. The implementation would be messy though, because you will need to normalize all the different NaNs to the same place as +NaN on every newly generated floating point blocks if you want it to be efficient, and need to be careful whenever doing floating number comparison, we need to use the custom comparator instead of the built in operators. |
@czentgr proposed the following behavior for Velox: facebookincubator/velox#7237 |
Thank you! It sounds like we are in agreement then. Let me collect approvals so I can move forward with this as well. |
Agreed on this--this also seems most consistent with other engines, and it's easy to document as we see above. |
Thanks for the comments on PR facebookincubator/velox#7237. |
Hi @rschlussel, not sure if you're already aware of this, but I found another "inconsistent" handling of NaN in Presto: distinct aggregation considers two NaN to be duplicates, while set_agg() function considers two NaN to be distinct.
Since the Presto documentation of set_agg says "Returns an array created from the distinct input x elements", I expect these two queries to treat NaN in the same way. |
Thank you, yes, i did know about this difference. I made a wiki about how we handle nans across all of our functions that do comparison or ordering on doubles https://github.com/prestodb/presto/wiki/Presto-NaN-behavior |
Current Behavior
Currently in Presto NaN() seems to behave like its > infinity,
This is not just limited to array_sort and you can see this behavior in order by too:
However running
gives a contradictory answer.
Expected Behavior
Comparison with NaN is not defined so we should confirm to https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN.
Context
Velox is trying to ensure correctness with Presto Java but we are seeing inconsistencies in behavior and wanted clarifications on these.
The text was updated successfully, but these errors were encountered: