Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include NaN in Parquet stats (again) #264

Closed
crepererum opened this issue May 6, 2021 · 19 comments
Closed

Include NaN in Parquet stats (again) #264

crepererum opened this issue May 6, 2021 · 19 comments

Comments

@crepererum
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
After #256 we completely ignore NaNs in parquet statistics. However, there are good reasons to fully order floats and include NaN somewhere:

Describe the solution you'd like
Put NaNs at the end of the float scale, so the order is:

  1. -inf
  2. "ordinary" numbers
  3. +inf
  4. NaN

Describe alternatives you've considered
Keeping the status quo (aka current master).

Additional context
.

@crepererum crepererum changed the title NaN in Parquet stats (again) Include NaN in Parquet stats (again) May 6, 2021
@NGA-TRAN
Copy link
Contributor

NGA-TRAN commented May 6, 2021

I think a few links to other DBs related documents will help us feel more confident that in the future our customers won't ask us to do differently. A few name: SQL standard (SQL99, ...), PG, Oracle, Snowflake, Vertica.

@crepererum
Copy link
Contributor Author

crepererum commented May 6, 2021

Prior Art

PostgreSQL

In most implementations of the “not-a-number” concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

https://www.postgresql.org/docs/13/datatype-numeric.html

=> Follows this proposal

CockroachDB

Follows PostgreSQL, see cockroachdb/cockroach#18860

=> Follows this proposal

Oracle

The floating-point value NaN is greater than any other numeric value and is equal to itself.

https://docs.oracle.com/database/121/TTSQL/types.htm#TTSQL165

=> Follows this proposal

Snowflake

Comparison semantics for 'NaN' differ from the IEEE 754 standard in the following ways:

Condition Snowflake IEEE 754 Comment
'NaN' = 'NaN' TRUE FALSE In Snowflake, 'NaN' values are all equal.
'NaN' > X where X is any FLOAT value, including infinity (other than NaN itself). TRUE FALSE Snowflake treats 'NaN' as greater than any other FLOAT value, including infinity.

https://docs.snowflake.com/en/sql-reference/data-types-numeric.html#float-float4-float8

=> Follows this proposal

Vertica

Vertica follows the IEEE definition of NaNs (IEEE 754). The SQL standards do not specify how floating point works in detail. [...]

Rules

  • -0 == +0
  • 1/0 = Infinity
  • 0/0 == Nan
  • NaN != anything (even NaN)

[...]

Sort Order (Ascending)

  • NaN
  • -Inf
  • numbers
  • +Inf
  • NULL

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Numeric/DOUBLEPRECISIONFLOAT.htm

=> Does NOT follow this proposal

SQL Standard

Not open, and scanning through the drafts I've found does not reveal more than "implementation defined". I trust Vertica when they say:

The SQL standards do not specify how floating point works in detail.

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Numeric/DOUBLEPRECISIONFLOAT.htm

=> Undefined

Rust Std

Normally floats are only partially ordered, but there is some proposed nightly API that orders them:

The values are ordered in following order:

  • Negative quiet NaN
  • Negative signaling NaN
  • Negative infinity
  • Negative numbers
  • Negative subnormal numbers
  • Negative zero
  • Positive zero
  • Positive subnormal numbers
  • Positive numbers
  • Positive infinity
  • Positive signaling NaN
  • Positive quiet NaN

https://doc.rust-lang.org/nightly/std/primitive.f64.html#method.total_cmp

=> Does NOT follow this proposal

Rust ordered-float

NaN is sorted as greater than all other values and equal to itself, in contradiction with the IEEE standard.

https://docs.rs/ordered-float/2.2.0/ordered_float/struct.OrderedFloat.html

=> Follows this proposal

IEEE totalOrder

IEEE apparently also defines some total order. It's a bit more complicated and puts NaNs on both ends based on sign bit, signalling/quiet bit and payload.

=> Does NOT follow this proposal

@alamb
Copy link
Contributor

alamb commented May 6, 2021

Some other potentially related prior work: apache/arrow#9416, where @ritchie46 was trying to sort out how to handle NaNa

@alamb
Copy link
Contributor

alamb commented May 6, 2021

@jorgecarleitao
Copy link
Member

IMO we should use the total order defined in arrow according to IEEE 754 totalOrder, which AFAIK is also used to compute the min and max of an array.

@crepererum
Copy link
Contributor Author

For what it's worth: With IEEE, if a database built on top of arrow / datafusion uses a dedicated NaN representation, it could archive whatever sorting it wants by picking the "right" NaN (pick a positive one and you get the PostgreSQL behavior).

@alamb
Copy link
Contributor

alamb commented May 6, 2021

I think keeping "Nan" as the min/max value in the stats is unlikely going to be useful for query processing, much like storing NULLs as a min/max value is not super helpful). I would personally suggest keeping them out of the statistics and adding additional metadata (like "contains NaNs" or whatever) if that is important information to retain.

If we really want to add NaN to the stats I think it would help to articulate an actual usecase where having a NaN there would be useful

@crepererum
Copy link
Contributor Author

If we really want to add NaN to the stats I think it would help to articulate an actual usecase where having a NaN there would be useful

The following come to my mind:

  • PostgreSQL-style SQL (which IIRC DataFusion follows) allows X == NaN to be queried. For that, "contains NaN" is an important information
  • PostgreSQL-style SQL (again for DataFusion) sorts NaNs at the end (aka after +inf) of the spectrum, so at least "contains NaN" would be useful.

However, "contains NaN" only makes sense for Parquet Float and Double and the current parquet standard stat do NOT provide us any way to express this. And since for most query cases the point can be made that NaN shall be included into the total order somewhere, I would rather put that info into min/max instead of creating some convoluted edge case, i.e.

Query processors rely on min/max to figure out the full value range. However for Float and Double (and only for these) there is also this special flag that does not exist for any other data type that tells you about the existence of that one value class.

@jorgecarleitao
Copy link
Member

Think about this, this seems to be an issue present in all implementations that use the parquet format. Maybe we should raise this over the parquet mailing list? The question seems to be what do implementations expect from min, max for float types, and whether they themselves use nans, infinites, etc, or not.

Maybe this motivates a clarification on the parquet specification itself?

@alamb
Copy link
Contributor

alamb commented May 7, 2021

I am coming at this from a query processing point of view and using statistics to rule out entire row groups (or equivalents). Having a clearly defined sort order for floats is important for sorting, but for statistics I feel like including Nan values in a column effectively "poisons" the effective use

I'll copy / paste my example from https://github.com/influxdata/influxdb_iox/pull/1448#discussion_r628230059. In that case, where your data was like

f
---
1.1
2.1
... (1 Billion other values between 1.1 and 9.9)
Nan
9.9

If you have a predicate that is like f > 10.0 (which does not evaluate to true for the one Nan row) the query engine will have to scan 1 Billion extra rows due to the presence of a single null value

Queries that are specifically looking for Nan I think are much less important to optimize.

I like the idea of getting some consensus on what to do with Nans and statistics on the parquet mailing list

@alamb
Copy link
Contributor

alamb commented May 7, 2021

I tried to see what postgres kept for statistics, and it turns out it keeps histograms not just min/max:

alamb=# select * from example;
  a  
-----
   1
   2
   3
 NaN
(4 rows)

alamb=# select * from pg_stats where tablename = 'example';
 schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram 
------------+-----------+---------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------+-------------------+------------------------+----------------------
 public     | example   | a       | f         |         0 |         8 |         -1 |                  |                   | {1,2,3,NaN}      |           1 |                   |                        | 
(1 row)

@emkornfield
Copy link
Contributor

I believe there is an open issue for Parquet to define consistent ordering here, I'll see if I can dig it up tonight.

@nevi-me
Copy link
Contributor

nevi-me commented May 11, 2021

@emkornfield is it this one?

@crepererum
Copy link
Contributor Author

I'm a bit low on resources this week. I can certainly write down the options (which are more than the proposal above) and how the link to use cases and prior art next week and post it to the ticket.

@emkornfield
Copy link
Contributor

Yes, sorry, PARQUET-1222 is the one and I see it was already mentioned.

JFinis added a commit to JFinis/parquet-format that referenced this issue Mar 22, 2023
This commit proposes an improvement for handling of NaN values in FLOAT
and DOUBLE type columns. The goal is to allow reading engines,
regardless of how they order NaN w.r.t. other values, to efficiently use
statistics for scan pruning while NaN values are present, which
currently is impossible in most cases. This is to be accomplished in a
fully backward compatible manner, so that existing readers and writers
do not need to be updated immediatly but can migrate over time to make
use of the improved semantics.

There was already [work on improving the handling of float and double
columns](apache#185) which laid
good ground work for backward compatible improvements, but it wasn't
sufficient to fix all the problems with NaN values, which are laid out
hereinafter.

Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in DOUBLE or FLOAT columns.
Concretely the following problems exist:

As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However,
NaN is ordered in most database systems. For example,
Postgres, DB2, and Oracle treat NaN as greater than any other value,
while MSSQL and MySQL treat it as less than any other value. An
overview over what different systems are doing can be found
[here](apache/arrow-rs#264 (comment)).
The gist of it is that different systems with different semantics exist
w.r.t. NaNs and most of the systems do order NaNs; either less than or
greater than all other values.

For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate
`x > 1.0` *should* include NaN values. If a page has `max = 0.0` now, the
engine would *not* be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.

Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with `min = 2.0` couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem on
first sight.

Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all. Conversely, it would be nice if Parquet would enable scan pruning
in these cases, regardless of whether the reader and writer agree upon
whether NaN is smaller, greater, or incomparible to all other values.

Note that the problem exist especially if the Parquet file *doesn't*
include any NaNs, so this is not only a problem in the case where NaNs
are present; it is a problem for the way more common case of NaNs not
being present.

There is currently no well-defined way to write a spec-conforming
ColumnIndex once a page has only NaN (and possibly null) values.
NaN values should not be
included in min/max bounds, but if a page contains only NaN values, then
there is no other value to put into the min/max bounds. However, bounds
in a ColumnIndex are non-optional, so we *have to* put something in here.
The spec does not describe what engines should do in this case.
Parquet-mr takes the safe route and does not write a column index once
NaNs are present. But this is a huge pessimization, as a single page
containing NaNs will prevent the emission for a column index for that
column chunk, so even pages in that chunk that don't contain NaNs will
not be indexed.

It would be nice if there was a defined way of writing the `ColumnIndex`
when NaNs (and especially only-NaN pages) are present.

The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is now
unclear whether the min/max value wasn't written because there were only
NaNs, or simply because the writing engine did decide to omit them for
whatever other reason, which is allowed by the spec as the field is
optional.

Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, you can skip this page if you are looking
for only non-NaN values" or "no stats written, you have to read this
page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.

The proposed solution for handling NaNs in statistics is akin to what
[Apache Iceberg](https://iceberg.apache.org/spec/) does: add an
*optional* `nan_count` field to `Statistics` and an *optional*
`nan_counts` list to `ColumnIndex`. This way, all places where
statistics are being retained can specify whether NaN values are
present. This already solves the first problem, as now queries wanting
to retain NaNs can check whether the count is > 0 to see whether a page
or column chunk contains NaNs.

Adding `nan_count`/`nan_counts` fields does not solve the problem of
only-NaN pages, yet. But since we have a new optional field in both the
`Statistics` object and the `ColumnIndex` object, we can tie a stricter
semantics to the existence of this field. I.e., we can mandate that writers
who write this optional field have to treat NaNs in a specific way.

We basically have two options for treating only-NaN pages or column
chunks:
* Order the writer to write NaN as min/max in this case.
* Order the writer to write nothing, i.e.,
    * omit the `min_value` / `max_value` in `Statistics`
    * write byte[0] in the min_values/max_values entry of the
      `ColumnIndex`

I propose to go with the first option of writing NaN as min/max in case
of only-Nan pages & column chunks. A section depicting the
decision process for this follows below.

Thus, to solve the problem of only-NaN pages, the comments in the spec
are extended to mandate the following behavior:

* Once a writer writes the `nan_count`/`nan_counts` fields, they have
  to: a) never write NaN into min/max if there are non-NaN non-Null
  values and b) always write min=max=NaN if the only non-null values in
  a page are NaN
* A reader observing that `nan_count`/`nan_counts` field was written can
  then rely on that if min or max are Nan, then both have to be NaN and
  this means that the only non-NULL values are NaN.

Here are the cons of each approach and how to mitigate them:

CONs for writing NaN in this case:
* Writing NaN breaks with the "don't write NaN into min and max bounds"
  rule.
    * However, one could argue that breaking the rule in this edge case
      is okay, as since if NaN is the only value in a page, then it
      doesn't matter where to sort NaN w.r.t. other values, as there are
      no other values. It is the only value in the page, so it is the
      min and max of the page
    * Breaking this rule has no consequences on existing readers, as they
      should ignore NaN anyway, i.e., treat it as if it wasn't written,
      so legacy readers should treat both cases the same anyway.
* There might be existing writers that have written NaN for min & max
  for pages that do not only contain NaN but also other values, so a
  reader couldn't rely on min=max=NaN to mean that the only non-null
  value is NaN
    * However, as specified, we can enforce a stricter semantics once
      the `nan_count` field is written: We could define that once a
      writing engine writes this field, it has to a) never write NaN
      into min/max if there are non-NaN non-Null values and b) always
      write min=max=NaN if the only non-null values in a page are NaN.
      Then, readers could rely on the semantics once they observe that
      the `nan_count` field was written.
* NaNs take more space than not writing the field or writing byte[0] in
  the column index. This space overhead should usually be negligible.

In conclusion, there is no big con for writing NaN. It can be
implemented in a fully backward compatible way that still allows future
writers and readers to apply a more strict semantics.

CONs for writing nothing in this case:
* Writing byte[0] to a ColumnIndex might break older readers who expect
  the `min_values`/`max_values` field to be a value with correct length
  unless `null_pages` is true for the entry.
  * Although readers should be lenient enough and handle wrongly sized
    min/max values gracefully by ignoring them we cannot be sure this is
    the case for any reader. Thus, we might legacy spec-conforming readers
    to reject the new Parquet file, which is bad.
* Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as
  it first looks as if the writing engine has decided to not write them
  for whatever reason. In this case, the page could never be pruned by a
  reader, as the reader couldn't know which values are in there. Yes, we
  could define that a writer may not omit min/max if they write
  `null_count` and must only omit them if a page has only NaNs, but this
  seems to be quite fishy semancially.

In conclusion, the cons for the NaN approach have mitigations and
can be handled in a backward compatible way, while the cons for writing
nothing might be non-backward-compatible. Therefore, I propose to write
NaN as min/max for only-nan pages & column chunks.

The suggested change is fully backward compatible both in the read and
write direction:

Older readers not supporting `nan_count`/`nan_counts` yet can stay as is.
As the fields are optional, readers not supporting them will simply
ignore them.

The spec already today mandates that if a reader sees `NaN` in min or
max fields they should ignore it. They can continue doing so.

No older reader will have regressed performance; any page that an older
reader would have skipped before can still be skipped.

Older writers can continue writing files without
`nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets
min=max=NaN for a page that does contain non-NaN values, readers
supporting this new semantics will not misinterpret these bounds, as the
writer will not write `nan_count`/`nan_counts`, so the new more strict
semantics does not apply when reading.

As `nan_count`/`nan_counts` are local to the
scopes where they apply (column index, page, column chunk), even
stiching together row groups from a writer that didn't write them and a
writer that does write them works. This would result in a file where
some pages / column indexes / column chunks would have them set while
others wouldn't.

This proposal definitly does not require a *major* version bump, as it
is fully backward compatible.

I do not fully understand the versioning policy of parquet, so I don't
know whether this change would require a minor version bump. One could
argue that it is not necessary as the mere existence of the
`nan_count`/`nan_counts` field would be the "feature flag" that would
indicate whether a writer supported this change or not. There wouldn't
be a version check necessary in a reader; they would just need to
check for the existence of the `nan_count`/`nan_counts` fields.

As thrift encodes missing optional fields with zero bytes in the compact
protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the
new optional fields.

We could simply define NaN to be smaller or greater than all other
values and then allow NaN in the respective bound.

This however has many drawbacks:
* NaN is the widest bound possible, so adding NaNs to min and max isn't
  very useful, as it makes pruning for non-NaN values almost impossible
  in the respective direction.
* As mentioned, not all systems agree on whether NaN is larger or
  smaller than all other values. If we decided for one, systems that
  choose the other semantics couldn't filter effectively.
* This contradicts the current spec of not writing NaN to min/max, so it
  would make older readers no longer skip pages they could skip before.

We could add a new ColumnOrder that specifies NaN to be smaller or
greater than all other values, or even supports -NaN and +NaN ordering
them as smaller and larger than all values, respectively. For example,
Iceberg mandates the following sort order:

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Once we define such an order, we could again allow NaN (and potentially
-NaN) in bounds again.

This however has the following drawbacks:
* As with the previous alternative: NaN is the widest bound possible, so
  adding NaNs to min and max isn't very useful, as it makes pruning for
  non-NaN values almost impossible in the respective direction. If we
  even allow -NaN and +NaN, a page containing both would have no
  meaningful min and max and wouldn't allow any pruning at all.
* Most systems don't support -NaN, as mathematically speaking, it is
  nonsense. The only reason why it exists is that the physical
  reprsentation of floats has a sign bit that also exists for NaN
  representations.
* The fact that NaNs being so unuseful for min/max bounds is displayed
  by the fact that even though Iceberg has such a well defined sort
  order,
  they still do what is proposed in this proposal and *do not* include
  -NaN/NaN into min/max bounds and rather track them through nan_counts.

All in all, any alternative putting NaN into min/max bounds (except for
only-NaN-pages) suffers from the problem that NaN bounds are too wide
and therefore not useful for pruning.

The column index does allow `byte[0]` values already, in case a page
contains only nulls. We could enable the same for only-NaN pages by not
storing only the `nan_counts`, but also the `value_counts` of a page.
Then, one could check whether a page in the column index contains only
NaNs by checking  `nan_count + nulls_count = value_count`.
Hoewever, this would mean yet another list in the column
index, making the column index bigger and more expensive to deserialize.
And while the `nan_counts` list only exists for FLOAT/DOUBLE columns,
the `value_counts` list would exist for all columns and therefore take
up considerably more space.
Also, this would not be backward compatible, as an older reader wouldn't
know of the new lists, so it would see a `byte[0]` and would need to treat
it as invalid.

All in all, the extra list doesn't seem to add enough value for its cost
and reduced backward compatibility.

As long as we don't do anything, columns containing NaNs will almost be
useless for scan pruning. The problems outlined will persist, making double
columns almost unprunable for some predicates. That is not satisfactory.

Even with this improvement which fixes the semantics of NaN values in statistics,
NaN values are still a problem in some other places as there is still not a
defined order for them, so the `boundary_order` in a column index and
the `SortingColumn` would still have undefined placement for `NaN`.

This mainly wasn't tackled for two reasons:
* It is an orthogonal issue. This improvement is about enabling NaNs in statistics,
  so after this change all statistics can handle NaN in a well-defined way.
  Sort odering of columns to leverage `boundary_order` or `SortingColumn` needs
  to be solved in a different way anyway, as the mere information about whether
  (only or some) NaNs are present isn't enough here but it needs to be defined
  whether they come before or after all values.
  * Even though we could fix both statistics and sort order by just defining NaN
    to be smaller or greater than other values, doing so for statistics is *not*
    a good idea, as having NaN in bounds makes too wide bounds that aren't helpful
    for filtering.
  * If sort ordering will be fixed by a different commit one day, the design of this
    commit shouldn't have a (negative) influence on that future design, as NaN counts
    and not including NaNs into bounds is a good thing to do anyway.
* While fixing statistics with NaN counts is pretty uncontested w.r.t. design
  alternatives (see the respective section for a discussion why), the design to be
  chosen for sort orders isn't that clear:
  * We could define a new `ColumnOrder` with well defined NaN ordering. This
    would be the cleanest fix, but also a "very big gun", as this would be the first
    non-default column order in existence.
  * We could define a `nans_first` fields which tells whether NaNs are to be sorted
    before or after other values, akin to the already existing field `nulls_first`.
    This would be a more micro-invasive change, but it would be less clean IMHO, as
    there is a tool for defining column ordering--the `ColumnOrder`--and not using
    that tool but working around it feels hacky.

Thus, sort ordering of NaNs wasn't tackled in this commit. They can be tackled by a
follow-up change if necessary.
JFinis added a commit to JFinis/parquet-format that referenced this issue Mar 22, 2023
This commit proposes an improvement for handling of NaN values in FLOAT
and DOUBLE type columns. The goal is to allow reading engines,
regardless of how they order NaN w.r.t. other values, to efficiently use
statistics for scan pruning while NaN values are present, which
currently is impossible in most cases. This is to be accomplished in a
fully backward compatible manner, so that existing readers and writers
do not need to be updated immediatly but can migrate over time to make
use of the improved semantics.

There was already [work on improving the handling of float and double
columns](apache#185) which laid
good ground work for backward compatible improvements, but it wasn't
sufficient to fix all the problems with NaN values, which are laid out
hereinafter.

Problem Description
===================

Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in DOUBLE or FLOAT columns.
Concretely the following problems exist:

Statistics don't tell whether NaNs are present
----------------------------------------------

As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However,
NaN is ordered in most database systems. For example,
Postgres, DB2, and Oracle treat NaN as greater than any other value,
while MSSQL and MySQL treat it as less than any other value. An
overview over what different systems are doing can be found
[here](apache/arrow-rs#264 (comment)).
The gist of it is that different systems with different semantics exist
w.r.t. NaNs and most of the systems do order NaNs; either less than or
greater than all other values.

For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate
`x > 1.0` *should* include NaN values. If a page has `max = 0.0` now, the
engine would *not* be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.

Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with `min = 2.0` couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem on
first sight.

Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all. Conversely, it would be nice if Parquet would enable scan pruning
in these cases, regardless of whether the reader and writer agree upon
whether NaN is smaller, greater, or incomparible to all other values.

Note that the problem exist especially if the Parquet file *doesn't*
include any NaNs, so this is not only a problem in the case where NaNs
are present; it is a problem for the way more common case of NaNs not
being present.

Handling NaNs in a ColumnIndex
------------------------------

There is currently no well-defined way to write a spec-conforming
ColumnIndex once a page has only NaN (and possibly null) values.
NaN values should not be
included in min/max bounds, but if a page contains only NaN values, then
there is no other value to put into the min/max bounds. However, bounds
in a ColumnIndex are non-optional, so we *have to* put something in here.
The spec does not describe what engines should do in this case.
Parquet-mr takes the safe route and does not write a column index once
NaNs are present. But this is a huge pessimization, as a single page
containing NaNs will prevent the emission for a column index for that
column chunk, so even pages in that chunk that don't contain NaNs will
not be indexed.

It would be nice if there was a defined way of writing the `ColumnIndex`
when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks
---------------------------------------

Note: Hereinafter, whenever the term *only-NaN* is used, it refers to
a page column chunk, whose only non-null values are NaNs. E.g., an
only-NaN page is allowed to have a mixture of null values and NaNs or
only NaNs, but no non-NaN non-null values.

The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is now
unclear whether the min/max value wasn't written because there were only
NaNs, or simply because the writing engine did decide to omit them for
whatever other reason, which is allowed by the spec as the field is
optional.

Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, you can skip this page if you are looking
for only non-NaN values" or "no stats written, you have to read this
page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.

Proposed solution
=================

Adding NaN counts
-----------------

The proposed solution for handling NaNs in statistics is akin to what
[Apache Iceberg](https://iceberg.apache.org/spec/) does: add an
*optional* `nan_count` field to `Statistics` and an *optional*
`nan_counts` list to `ColumnIndex`. This way, all places where
statistics are being retained can specify whether NaN values are
present. This already solves the first problem, as now queries wanting
to retain NaNs can check whether the count is > 0 to see whether a page
or column chunk contains NaNs.

Handling NaN-only pages & column chunks
---------------------------------------

Adding `nan_count`/`nan_counts` fields does not solve the problem of
only-NaN pages, yet. But since we have a new optional field in both the
`Statistics` object and the `ColumnIndex` object, we can tie a stricter
semantics to the existence of this field. I.e., we can mandate that writers
who write this optional field have to treat NaNs in a specific way.

We basically have two options for treating only-NaN pages or column
chunks:
* Order the writer to write NaN as min/max in this case.
* Order the writer to write nothing, i.e.,
    * omit the `min_value` / `max_value` in `Statistics`
    * write byte[0] in the min_values/max_values entry of the
      `ColumnIndex`

I propose to go with the first option of writing NaN as min/max in case
of only-Nan pages & column chunks. A section depicting the
decision process for this follows below.

Thus, to solve the problem of only-NaN pages, the comments in the spec
are extended to mandate the following behavior:

* Once a writer writes the `nan_count`/`nan_counts` fields, they have
  to: a) never write NaN into min/max if there are non-NaN non-Null
  values and b) always write min=max=NaN if the only non-null values in
  a page are NaN
* A reader observing that `nan_count`/`nan_counts` field was written can
  then rely on that if min or max are Nan, then both have to be NaN and
  this means that the only non-NULL values are NaN.

Should we write NaN or nothing for only-NaN pages & column chunks?
------------------------------------------------------------------

Here are the cons of each approach and how to mitigate them:

CONs for writing NaN in this case:
* Writing NaN breaks with the "don't write NaN into min and max bounds"
  rule.
    * However, one could argue that breaking the rule in this edge case
      is okay, as since if NaN is the only value in a page, then it
      doesn't matter where to sort NaN w.r.t. other values, as there are
      no other values. It is the only value in the page, so it is the
      min and max of the page
    * Breaking this rule has no consequences on existing readers, as they
      should ignore NaN anyway, i.e., treat it as if it wasn't written,
      so legacy readers should treat both cases the same anyway.
* There might be existing writers that have written NaN for min & max
  for pages that do not only contain NaN but also other values, so a
  reader couldn't rely on min=max=NaN to mean that the only non-null
  value is NaN
    * However, as specified, we can enforce a stricter semantics once
      the `nan_count` field is written: We could define that once a
      writing engine writes this field, it has to a) never write NaN
      into min/max if there are non-NaN non-Null values and b) always
      write min=max=NaN if the only non-null values in a page are NaN.
      Then, readers could rely on the semantics once they observe that
      the `nan_count` field was written.
* NaNs take more space than not writing the field or writing byte[0] in
  the column index. This space overhead should usually be negligible.

In conclusion, there is no big con for writing NaN. It can be
implemented in a fully backward compatible way that still allows future
writers and readers to apply a more strict semantics.

CONs for writing nothing in this case:
* Writing byte[0] to a ColumnIndex might break older readers who expect
  the `min_values`/`max_values` field to be a value with correct length
  unless `null_pages` is true for the entry.
  * Although readers should be lenient enough and handle wrongly sized
    min/max values gracefully by ignoring them we cannot be sure this is
    the case for any reader. Thus, we might legacy spec-conforming readers
    to reject the new Parquet file, which is bad.
* Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as
  it first looks as if the writing engine has decided to not write them
  for whatever reason. In this case, the page could never be pruned by a
  reader, as the reader couldn't know which values are in there. Yes, we
  could define that a writer may not omit min/max if they write
  `null_count` and must only omit them if a page has only NaNs, but this
  seems to be quite fishy semancially.

In conclusion, the cons for the NaN approach have mitigations and
can be handled in a backward compatible way, while the cons for writing
nothing might be non-backward-compatible. Therefore, I propose to write
NaN as min/max for only-nan pages & column chunks.

Considerations
==============

Backward compatibility
----------------------

The suggested change is fully backward compatible both in the read and
write direction:

Older readers not supporting `nan_count`/`nan_counts` yet can stay as is.
As the fields are optional, readers not supporting them will simply
ignore them.

The spec already today mandates that if a reader sees `NaN` in min or
max fields they should ignore it. They can continue doing so.

No older reader will have regressed performance; any page that an older
reader would have skipped before can still be skipped.

Older writers can continue writing files without
`nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets
min=max=NaN for a page that does contain non-NaN values, readers
supporting this new semantics will not misinterpret these bounds, as the
writer will not write `nan_count`/`nan_counts`, so the new more strict
semantics does not apply when reading.

As `nan_count`/`nan_counts` are local to the
scopes where they apply (column index, page, column chunk), even
stiching together row groups from a writer that didn't write them and a
writer that does write them works. This would result in a file where
some pages / column indexes / column chunks would have them set while
others wouldn't.

Versioning
----------

This proposal definitly does not require a *major* version bump, as it
is fully backward compatible.

I do not fully understand the versioning policy of parquet, so I don't
know whether this change would require a minor version bump. One could
argue that it is not necessary as the mere existence of the
`nan_count`/`nan_counts` field would be the "feature flag" that would
indicate whether a writer supported this change or not. There wouldn't
be a version check necessary in a reader; they would just need to
check for the existence of the `nan_count`/`nan_counts` fields.

No Unnecessary Overhead
-----------------------

As thrift encodes missing optional fields with zero bytes in the compact
protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the
new optional fields.

Design alternatives and why they were not chosen
================================================

Ordering NaN before or after all other values
---------------------------------------------

We could simply define NaN to be smaller or greater than all other
values and then allow NaN in the respective bound.

This however has many drawbacks:
* NaN is the widest bound possible, so adding NaNs to min and max isn't
  very useful, as it makes pruning for non-NaN values almost impossible
  in the respective direction.
* As mentioned, not all systems agree on whether NaN is larger or
  smaller than all other values. If we decided for one, systems that
  choose the other semantics couldn't filter effectively.
* This contradicts the current spec of not writing NaN to min/max, so it
  would make older readers no longer skip pages they could skip before.

Adding a new ColumnOrder
------------------------

We could add a new ColumnOrder that specifies NaN to be smaller or
greater than all other values, or even supports -NaN and +NaN ordering
them as smaller and larger than all values, respectively. For example,
Iceberg mandates the following sort order:

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Once we define such an order, we could again allow NaN (and potentially
-NaN) in bounds again.

This however has the following drawbacks:
* As with the previous alternative: NaN is the widest bound possible, so
  adding NaNs to min and max isn't very useful, as it makes pruning for
  non-NaN values almost impossible in the respective direction. If we
  even allow -NaN and +NaN, a page containing both would have no
  meaningful min and max and wouldn't allow any pruning at all.
* Most systems don't support -NaN, as mathematically speaking, it is
  nonsense. The only reason why it exists is that the physical
  reprsentation of floats has a sign bit that also exists for NaN
  representations.
* The fact that NaNs being so unuseful for min/max bounds is displayed
  by the fact that even though Iceberg has such a well defined sort
  order,
  they still do what is proposed in this proposal and *do not* include
  -NaN/NaN into min/max bounds and rather track them through nan_counts.

All in all, any alternative putting NaN into min/max bounds (except for
only-NaN-pages) suffers from the problem that NaN bounds are too wide
and therefore not useful for pruning.

Writing a second `value_counts` list in the ColumnIndex
-------------------------------------------------------

The column index does allow `byte[0]` values already, in case a page
contains only nulls. We could enable the same for only-NaN pages by not
storing only the `nan_counts`, but also the `value_counts` of a page.
Then, one could check whether a page in the column index contains only
NaNs by checking  `nan_count + nulls_count = value_count`.
Hoewever, this would mean yet another list in the column
index, making the column index bigger and more expensive to deserialize.
And while the `nan_counts` list only exists for FLOAT/DOUBLE columns,
the `value_counts` list would exist for all columns and therefore take
up considerably more space.
Also, this would not be backward compatible, as an older reader wouldn't
know of the new lists, so it would see a `byte[0]` and would need to treat
it as invalid.

All in all, the extra list doesn't seem to add enough value for its cost
and reduced backward compatibility.

Do nothing
----------

As long as we don't do anything, columns containing NaNs will almost be
useless for scan pruning. The problems outlined will persist, making double
columns almost unprunable for some predicates. That is not satisfactory.

Why wasn't sort order tackled?
==============================

Even with this improvement which fixes the semantics of NaN values in statistics,
NaN values are still a problem in some other places as there is still not a
defined order for them, so the `boundary_order` in a column index and
the `SortingColumn` would still have undefined placement for `NaN`.

This mainly wasn't tackled for two reasons:
* It is an orthogonal issue. This improvement is about enabling NaNs in statistics,
  so after this change all statistics can handle NaN in a well-defined way.
  Sort odering of columns to leverage `boundary_order` or `SortingColumn` needs
  to be solved in a different way anyway, as the mere information about whether
  (only or some) NaNs are present isn't enough here but it needs to be defined
  whether they come before or after all values.
  * Even though we could fix both statistics and sort order by just defining NaN
    to be smaller or greater than other values, doing so for statistics is *not*
    a good idea, as having NaN in bounds makes too wide bounds that aren't helpful
    for filtering.
  * If sort ordering will be fixed by a different commit one day, the design of this
    commit shouldn't have a (negative) influence on that future design, as NaN counts
    and not including NaNs into bounds is a good thing to do anyway.
* While fixing statistics with NaN counts is pretty uncontested w.r.t. design
  alternatives (see the respective section for a discussion why), the design to be
  chosen for sort orders isn't that clear:
  * We could define a new `ColumnOrder` with well defined NaN ordering. This
    would be the cleanest fix, but also a "very big gun", as this would be the first
    non-default column order in existence.
  * We could define a `nans_first` fields which tells whether NaNs are to be sorted
    before or after other values, akin to the already existing field `nulls_first`.
    This would be a more micro-invasive change, but it would be less clean IMHO, as
    there is a tool for defining column ordering--the `ColumnOrder`--and not using
    that tool but working around it feels hacky.

Thus, sort ordering of NaNs wasn't tackled in this commit. They can be tackled by a
follow-up change if necessary.
JFinis added a commit to JFinis/parquet-format that referenced this issue Mar 22, 2023
This commit proposes an improvement for handling of NaN values in FLOAT
and DOUBLE type columns. The goal is to allow reading engines,
regardless of how they order NaN w.r.t. other values, to efficiently use
statistics for scan pruning while NaN values are present, which
currently is impossible in most cases. This is to be accomplished in a
fully backward compatible manner, so that existing readers and writers
do not need to be updated immediatly but can migrate over time to make
use of the improved semantics.

There was already [work on improving the handling of float and double
columns](apache#185) which laid
good ground work for backward compatible improvements, but it wasn't
sufficient to fix all the problems with NaN values, which are laid out
hereinafter.

Problem Description
===================

Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in DOUBLE or FLOAT
columns.  Concretely the following problems exist:

Statistics don't tell whether NaNs are present
----------------------------------------------

As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However, NaN
is ordered in most database systems. For example, Postgres, DB2, and
Oracle treat NaN as greater than any other value, while MSSQL and MySQL
treat it as less than any other value. An overview over what different
systems are doing can be found
[here](apache/arrow-rs#264 (comment)).
The gist of it is that different systems with different semantics exist
w.r.t.  NaNs and most of the systems do order NaNs; either less than or
greater than all other values.

For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate `x >
1.0` *should* include NaN values. If a page has `max = 0.0` now, the
engine would *not* be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.

Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with `min = 2.0` couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem on
first sight.

Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all.  Conversely, it would be nice if Parquet would enable scan pruning
in these cases, regardless of whether the reader and writer agree upon
whether NaN is smaller, greater, or incomparible to all other values.

Note that the problem exist especially if the Parquet file *doesn't*
include any NaNs, so this is not only a problem in the case where NaNs
are present; it is a problem for the way more common case of NaNs not
being present.

Handling NaNs in a ColumnIndex
------------------------------

There is currently no well-defined way to write a spec-conforming
ColumnIndex once a page has only NaN (and possibly null) values.  NaN
values should not be included in min/max bounds, but if a page contains
only NaN values, then there is no other value to put into the min/max
bounds. However, bounds in a ColumnIndex are non-optional, so we *have
to* put something in here.  The spec does not describe what engines
should do in this case.  Parquet-mr takes the safe route and does not
write a column index once NaNs are present. But this is a huge
pessimization, as a single page containing NaNs will prevent the
emission for a column index for that column chunk, so even pages in that
chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the `ColumnIndex`
when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks
---------------------------------------

Note: Hereinafter, whenever the term *only-NaN* is used, it refers to a
page column chunk, whose only non-null values are NaNs. E.g., an
only-NaN page is allowed to have a mixture of null values and NaNs or
only NaNs, but no non-NaN non-null values.

The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is now
unclear whether the min/max value wasn't written because there were only
NaNs, or simply because the writing engine did decide to omit them for
whatever other reason, which is allowed by the spec as the field is
optional.

Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, you can skip this page if you are looking
for only non-NaN values" or "no stats written, you have to read this
page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.

Proposed solution
=================

Adding NaN counts
-----------------

The proposed solution for handling NaNs in statistics is akin to what
[Apache Iceberg](https://iceberg.apache.org/spec/) does: add an
*optional* `nan_count` field to `Statistics` and an *optional*
`nan_counts` list to `ColumnIndex`.  This way, all places where
statistics are being retained can specify whether NaN values are
present. This already solves the first problem, as now queries wanting
to retain NaNs can check whether the count is > 0 to see whether a page
or column chunk contains NaNs.

Handling NaN-only pages & column chunks
---------------------------------------

Adding `nan_count`/`nan_counts` fields does not solve the problem of
only-NaN pages, yet. But since we have a new optional field in both the
`Statistics` object and the `ColumnIndex` object, we can tie a stricter
semantics to the existence of this field. I.e., we can mandate that
writers who write this optional field have to treat NaNs in a specific
way.

We basically have two options for treating only-NaN pages or column
chunks:
* Order the writer to write NaN as min/max in this case.
* Order the writer to write nothing, i.e.,
    * omit the `min_value` / `max_value` in `Statistics`
    * write byte[0] in the min_values/max_values entry of the
      `ColumnIndex`

I propose to go with the first option of writing NaN as min/max in case
of only-Nan pages & column chunks. A section depicting the decision
process for this follows below.

Thus, to solve the problem of only-NaN pages, the comments in the spec
are extended to mandate the following behavior:

* Once a writer writes the `nan_count`/`nan_counts` fields, they have
  to: a) never write NaN into min/max if there are non-NaN non-Null
  values and b) always write min=max=NaN if the only non-null values in
  a page are NaN
* A reader observing that `nan_count`/`nan_counts` field was written can
  then rely on that if min or max are Nan, then both have to be NaN and
  this means that the only non-NULL values are NaN.

Should we write NaN or nothing for only-NaN pages & column chunks?
------------------------------------------------------------------

Here are the cons of each approach and how to mitigate them:

CONs for writing NaN in this case:
* Writing NaN breaks with the "don't write NaN into min and max bounds"
  rule.
    * However, one could argue that breaking the rule in this edge case
      is okay, as since if NaN is the only value in a page, then it
      doesn't matter where to sort NaN w.r.t. other values, as there are
      no other values. It is the only value in the page, so it is the
      min and max of the page
    * Breaking this rule has no consequences on existing readers, as
      they should ignore NaN anyway, i.e., treat it as if it wasn't
      written, so legacy readers should treat both cases the same
      anyway.
* There might be existing writers that have written NaN for min & max
  for pages that do not only contain NaN but also other values, so a
  reader couldn't rely on min=max=NaN to mean that the only non-null
  value is NaN
    * However, as specified, we can enforce a stricter semantics once
      the `nan_count` field is written: We could define that once a
      writing engine writes this field, it has to a) never write NaN
      into min/max if there are non-NaN non-Null values and b) always
      write min=max=NaN if the only non-null values in a page are NaN.
      Then, readers could rely on the semantics once they observe that
      the `nan_count` field was written.
* NaNs take more space than not writing the field or writing byte[0] in
  the column index. This space overhead should usually be negligible.

In conclusion, there is no big con for writing NaN. It can be
implemented in a fully backward compatible way that still allows future
writers and readers to apply a more strict semantics.

CONs for writing nothing in this case:
* Writing byte[0] to a ColumnIndex might break older readers who expect
  the `min_values`/`max_values` field to be a value with correct length
  unless `null_pages` is true for the entry.
  * Although readers should be lenient enough and handle wrongly sized
    min/max values gracefully by ignoring them we cannot be sure this is
    the case for any reader. Thus, we might legacy spec-conforming
    readers to reject the new Parquet file, which is bad.
* Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as
  it first looks as if the writing engine has decided to not write them
  for whatever reason. In this case, the page could never be pruned by a
  reader, as the reader couldn't know which values are in there. Yes, we
  could define that a writer may not omit min/max if they write
  `null_count` and must only omit them if a page has only NaNs, but this
  seems to be quite fishy semancially.

In conclusion, the cons for the NaN approach have mitigations and can be
handled in a backward compatible way, while the cons for writing nothing
might be non-backward-compatible. Therefore, I propose to write NaN as
min/max for only-nan pages & column chunks.

Considerations
==============

Backward compatibility
----------------------

The suggested change is fully backward compatible both in the read and
write direction:

Older readers not supporting `nan_count`/`nan_counts` yet can stay as
is.  As the fields are optional, readers not supporting them will simply
ignore them.

The spec already today mandates that if a reader sees `NaN` in min or
max fields they should ignore it. They can continue doing so.

No older reader will have regressed performance; any page that an older
reader would have skipped before can still be skipped.

Older writers can continue writing files without
`nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets
min=max=NaN for a page that does contain non-NaN values, readers
supporting this new semantics will not misinterpret these bounds, as the
writer will not write `nan_count`/`nan_counts`, so the new more strict
semantics does not apply when reading.

As `nan_count`/`nan_counts` are local to the scopes where they apply
(column index, page, column chunk), even stiching together row groups
from a writer that didn't write them and a writer that does write them
works. This would result in a file where some pages / column indexes /
column chunks would have them set while others wouldn't.

Versioning
----------

This proposal definitly does not require a *major* version bump, as it
is fully backward compatible.

I do not fully understand the versioning policy of parquet, so I don't
know whether this change would require a minor version bump. One could
argue that it is not necessary as the mere existence of the
`nan_count`/`nan_counts` field would be the "feature flag" that would
indicate whether a writer supported this change or not. There wouldn't
be a version check necessary in a reader; they would just need to check
for the existence of the `nan_count`/`nan_counts` fields.

No Unnecessary Overhead
-----------------------

As thrift encodes missing optional fields with zero bytes in the compact
protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the
new optional fields.

Design alternatives and why they were not chosen
================================================

Ordering NaN before or after all other values
---------------------------------------------

We could simply define NaN to be smaller or greater than all other
values and then allow NaN in the respective bound.

This however has many drawbacks:
* NaN is the widest bound possible, so adding NaNs to min and max isn't
  very useful, as it makes pruning for non-NaN values almost impossible
  in the respective direction.
* As mentioned, not all systems agree on whether NaN is larger or
  smaller than all other values. If we decided for one, systems that
  choose the other semantics couldn't filter effectively.
* This contradicts the current spec of not writing NaN to min/max, so it
  would make older readers no longer skip pages they could skip before.

Adding a new ColumnOrder
------------------------

We could add a new ColumnOrder that specifies NaN to be smaller or
greater than all other values, or even supports -NaN and +NaN ordering
them as smaller and larger than all values, respectively. For example,
Iceberg mandates the following sort order:

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Once we define such an order, we could again allow NaN (and potentially
-NaN) in bounds again.

This however has the following drawbacks:
* As with the previous alternative: NaN is the widest bound possible, so
  adding NaNs to min and max isn't very useful, as it makes pruning for
  non-NaN values almost impossible in the respective direction. If we
  even allow -NaN and +NaN, a page containing both would have no
  meaningful min and max and wouldn't allow any pruning at all.
* Most systems don't support -NaN, as mathematically speaking, it is
  nonsense.  The only reason why it exists is that the physical
  reprsentation of floats has a sign bit that also exists for NaN
  representations.
* The fact that NaNs being so unuseful for min/max bounds is displayed
  by the fact that even though Iceberg has such a well defined sort
  order,  they still do what is proposed in this proposal and *do not*
  include -NaN/NaN into min/max bounds and rather track them through
  nan_counts.

All in all, any alternative putting NaN into min/max bounds (except for
only-NaN-pages) suffers from the problem that NaN bounds are too wide
and therefore not useful for pruning.

Writing a second `value_counts` list in the ColumnIndex
-------------------------------------------------------

The column index does allow `byte[0]` values already, in case a page
contains only nulls. We could enable the same for only-NaN pages by not
storing only the `nan_counts`, but also the `value_counts` of a page.
Then, one could check whether a page in the column index contains only
NaNs by checking  `nan_count + nulls_count = value_count`.  Hoewever,
this would mean yet another list in the column index, making the column
index bigger and more expensive to deserialize.  And while the
`nan_counts` list only exists for FLOAT/DOUBLE columns, the
`value_counts` list would exist for all columns and therefore take up
considerably more space.  Also, this would not be backward compatible,
as an older reader wouldn't know of the new lists, so it would see a
`byte[0]` and would need to treat it as invalid.

All in all, the extra list doesn't seem to add enough value for its cost
and reduced backward compatibility.

Do nothing
----------

As long as we don't do anything, columns containing NaNs will almost be
useless for scan pruning. The problems outlined will persist, making
double columns almost unprunable for some predicates. That is not
satisfactory.

Why wasn't sort order tackled?
==============================

Even with this improvement which fixes the semantics of NaN values in
statistics, NaN values are still a problem in some other places as there
is still not a defined order for them, so the `boundary_order` in a
column index and the `SortingColumn` would still have undefined
placement for `NaN`.

This mainly wasn't tackled for two reasons:
* It is an orthogonal issue. This improvement is about enabling NaNs in
  statistics, so after this change all statistics can handle NaN in a
  well-defined way.  Sort odering of columns to leverage
  `boundary_order` or `SortingColumn` needs to be solved in a different
  way anyway, as the mere information about whether (only or some) NaNs
  are present isn't enough here but it needs to be defined whether they
  come before or after all values.
  * Even though we could fix both statistics and sort order by just
    defining NaN to be smaller or greater than other values, doing so
    for statistics is *not* a good idea, as having NaN in bounds makes
    too wide bounds that aren't helpful for filtering.
  * If sort ordering will be fixed by a different commit one day, the
    design of this commit shouldn't have a (negative) influence on that
    future design, as NaN counts and not including NaNs into bounds is a
    good thing to do anyway.
* While fixing statistics with NaN counts is pretty uncontested w.r.t.
  design alternatives (see the respective section for a discussion why),
  the design to be chosen for sort orders isn't that clear:
  * We could define a new `ColumnOrder` with well defined NaN ordering.
    This would be the cleanest fix, but also a "very big gun", as this
    would be the first non-default column order in existence.
  * We could define a `nans_first` fields which tells whether NaNs are
    to be sorted before or after other values, akin to the already
    existing field `nulls_first`.  This would be a more micro-invasive
    change, but it would be less clean IMHO, as there is a tool for
    defining column ordering--the `ColumnOrder`--and not using that tool
    but working around it feels hacky.

Thus, sort ordering of NaNs wasn't tackled in this commit. They can be
tackled by a follow-up change if necessary.
JFinis added a commit to JFinis/parquet-format that referenced this issue Mar 22, 2023
This commit proposes an improvement for handling of NaN values in FLOAT
and DOUBLE type columns. The goal is to allow reading engines,
regardless of how they order NaN w.r.t. other values, to efficiently use
statistics for scan pruning while NaN values are present, which
currently is impossible in most cases. This is to be accomplished in a
fully backward compatible manner, so that existing readers and writers
do not need to be updated immediatly but can migrate over time to make
use of the improved semantics.

There was already [work on improving the handling of float and double
columns](apache#185) which laid
good ground work for backward compatible improvements, but it wasn't
sufficient to fix all the problems with NaN values, which are laid out
hereinafter.

Problem Description
===================

Currently, the way NaN values are to be handled in statistics inhibits
most scan pruning once NaN values are present in DOUBLE or FLOAT
columns.  Concretely the following problems exist:

Statistics don't tell whether NaNs are present
----------------------------------------------

As NaN values are not to be incorporated in min/max bounds, a reader
cannot know whether NaN values are present. This might seem to be not
too problematic, as most queries will not filter for NaNs. However, NaN
is ordered in most database systems. For example, Postgres, DB2, and
Oracle treat NaN as greater than any other value, while MSSQL and MySQL
treat it as less than any other value. An overview over what different
systems are doing can be found
[here](apache/arrow-rs#264 (comment)).
The gist of it is that different systems with different semantics exist
w.r.t.  NaNs and most of the systems do order NaNs; either less than or
greater than all other values.

For example, if the semantics of the reading query engine mandate that
NaN is to be treated greater than all other values, the predicate `x >
1.0` *should* include NaN values. If a page has `max = 0.0` now, the
engine would *not* be able to skip the page, as the page might contain
NaNs which would need to be included in the query result.

Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
to be less than all other values by the reading engine. Again, a page
with `min = 2.0` couldn't be skipped in this case by the reader.

Thus, even if a user doesn't query for NaN explicitly, they might use
other predictes that need to filter or retain NaNs in the semantics of
the reading engine, so the fact that we currently can't know whether a
page or row group contains NaN is a bigger problem than it might seem on
first sight.

Currently, any predicate that needs to retain NaNs cannot use min and
max bounds in Parquet and therefore cannot be used for scan pruning at
all.  Conversely, it would be nice if Parquet would enable scan pruning
in these cases, regardless of whether the reader and writer agree upon
whether NaN is smaller, greater, or incomparible to all other values.

Note that the problem exist especially if the Parquet file *doesn't*
include any NaNs, so this is not only a problem in the case where NaNs
are present; it is a problem for the way more common case of NaNs not
being present.

Handling NaNs in a ColumnIndex
------------------------------

There is currently no well-defined way to write a spec-conforming
ColumnIndex once a page has only NaN (and possibly null) values.  NaN
values should not be included in min/max bounds, but if a page contains
only NaN values, then there is no other value to put into the min/max
bounds. However, bounds in a ColumnIndex are non-optional, so we *have
to* put something in here.  The spec does not describe what engines
should do in this case.  Parquet-mr takes the safe route and does not
write a column index once NaNs are present. But this is a huge
pessimization, as a single page containing NaNs will prevent the
emission for a column index for that column chunk, so even pages in that
chunk that don't contain NaNs will not be indexed.

It would be nice if there was a defined way of writing the `ColumnIndex`
when NaNs (and especially only-NaN pages) are present.

Handling only-NaN pages & column chunks
---------------------------------------

Note: Hereinafter, whenever the term *only-NaN* is used, it refers to a
page column chunk, whose only non-null values are NaNs. E.g., an
only-NaN page is allowed to have a mixture of null values and NaNs or
only NaNs, but no non-NaN non-null values.

The `Statistics` objects stored in page headers and in the file footer
have a similar, albeit smaller problem: `min_value` and `max_value` are
optional here, so it is easier to not include NaNs in the min/max in
case of an only-NaN page or column chunk: Simply omit these optional
fields. However, this brings a semantic ambiguity with it, as it is now
unclear whether the min/max value wasn't written because there were only
NaNs, or simply because the writing engine did decide to omit them for
whatever other reason, which is allowed by the spec as the field is
optional.

Consequently, a reader cannot know whether missing `min_value` and
`max_value` means "only NaNs, you can skip this page if you are looking
for only non-NaN values" or "no stats written, you have to read this
page as it is undefined what values it contains".

It would be nice if we could handle NaNs in a way that would allow scan
pruning for these only-NaN pages.

Proposed solution
=================

Adding NaN counts
-----------------

The proposed solution for handling NaNs in statistics is akin to what
[Apache Iceberg](https://iceberg.apache.org/spec/) does: add an
*optional* `nan_count` field to `Statistics` and an *optional*
`nan_counts` list to `ColumnIndex`.  This way, all places where
statistics are being retained can specify whether NaN values are
present. This already solves the first problem, as now queries wanting
to retain NaNs can check whether the count is > 0 to see whether a page
or column chunk contains NaNs.

Handling NaN-only pages & column chunks
---------------------------------------

Adding `nan_count`/`nan_counts` fields does not solve the problem of
only-NaN pages, yet. But since we have a new optional field in both the
`Statistics` object and the `ColumnIndex` object, we can tie a stricter
semantics to the existence of this field. I.e., we can mandate that
writers who write this optional field have to treat NaNs in a specific
way.

We basically have two options for treating only-NaN pages or column
chunks:
* Order the writer to write NaN as min/max in this case.
* Order the writer to write nothing, i.e.,
    * omit the `min_value` / `max_value` in `Statistics`
    * write byte[0] in the min_values/max_values entry of the
      `ColumnIndex`

I propose to go with the first option of writing NaN as min/max in case
of only-Nan pages & column chunks. A section depicting the decision
process for this follows below.

Thus, to solve the problem of only-NaN pages, the comments in the spec
are extended to mandate the following behavior:

* Once a writer writes the `nan_count`/`nan_counts` fields, they have
  to: a) never write NaN into min/max if there are non-NaN non-Null
  values and b) always write min=max=NaN if the only non-null values in
  a page are NaN
* A reader observing that `nan_count`/`nan_counts` field was written can
  then rely on that if min or max are Nan, then both have to be NaN and
  this means that the only non-NULL values are NaN.

Should we write NaN or nothing for only-NaN pages & column chunks?
------------------------------------------------------------------

Here are the cons of each approach and how to mitigate them:

CONs for writing NaN in this case:
* Writing NaN breaks with the "don't write NaN into min and max bounds"
  rule.
    * However, one could argue that breaking the rule in this edge case
      is okay, as since if NaN is the only value in a page, then it
      doesn't matter where to sort NaN w.r.t. other values, as there are
      no other values. It is the only value in the page, so it is the
      min and max of the page
    * Breaking this rule has no consequences on existing readers, as
      they should ignore NaN anyway, i.e., treat it as if it wasn't
      written, so legacy readers should treat both cases the same
      anyway.
* There might be existing writers that have written NaN for min & max
  for pages that do not only contain NaN but also other values, so a
  reader couldn't rely on min=max=NaN to mean that the only non-null
  value is NaN
    * However, as specified, we can enforce a stricter semantics once
      the `nan_count` field is written: We could define that once a
      writing engine writes this field, it has to a) never write NaN
      into min/max if there are non-NaN non-Null values and b) always
      write min=max=NaN if the only non-null values in a page are NaN.
      Then, readers could rely on the semantics once they observe that
      the `nan_count` field was written.
* NaNs take more space than not writing the field or writing byte[0] in
  the column index. This space overhead should usually be negligible.

In conclusion, there is no big con for writing NaN. It can be
implemented in a fully backward compatible way that still allows future
writers and readers to apply a more strict semantics.

CONs for writing nothing in this case:
* Writing byte[0] to a ColumnIndex might break older readers who expect
  the `min_values`/`max_values` field to be a value with correct length
  unless `null_pages` is true for the entry.
  * Although readers should be lenient enough and handle wrongly sized
    min/max values gracefully by ignoring them we cannot be sure this is
    the case for any reader. Thus, we might legacy spec-conforming
    readers to reject the new Parquet file, which is bad.
* Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as
  it first looks as if the writing engine has decided to not write them
  for whatever reason. In this case, the page could never be pruned by a
  reader, as the reader couldn't know which values are in there. Yes, we
  could define that a writer may not omit min/max if they write
  `null_count` and must only omit them if a page has only NaNs, but this
  seems to be quite fishy semancially.

In conclusion, the cons for the NaN approach have mitigations and can be
handled in a backward compatible way, while the cons for writing nothing
might be non-backward-compatible. Therefore, I propose to write NaN as
min/max for only-nan pages & column chunks.

Considerations
==============

Backward compatibility
----------------------

The suggested change is fully backward compatible both in the read and
write direction:

Older readers not supporting `nan_count`/`nan_counts` yet can stay as
is.  As the fields are optional, readers not supporting them will simply
ignore them.

The spec already today mandates that if a reader sees `NaN` in min or
max fields they should ignore it. They can continue doing so.

No older reader will have regressed performance; any page that an older
reader would have skipped before can still be skipped.

Older writers can continue writing files without
`nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets
min=max=NaN for a page that does contain non-NaN values, readers
supporting this new semantics will not misinterpret these bounds, as the
writer will not write `nan_count`/`nan_counts`, so the new more strict
semantics does not apply when reading.

As `nan_count`/`nan_counts` are local to the scopes where they apply
(column index, page, column chunk), even stiching together row groups
from a writer that didn't write them and a writer that does write them
works. This would result in a file where some pages / column indexes /
column chunks would have them set while others wouldn't.

Versioning
----------

This proposal definitly does not require a *major* version bump, as it
is fully backward compatible.

I do not fully understand the versioning policy of parquet, so I don't
know whether this change would require a minor version bump. One could
argue that it is not necessary as the mere existence of the
`nan_count`/`nan_counts` field would be the "feature flag" that would
indicate whether a writer supported this change or not. There wouldn't
be a version check necessary in a reader; they would just need to check
for the existence of the `nan_count`/`nan_counts` fields.

No Unnecessary Overhead
-----------------------

As thrift encodes missing optional fields with zero bytes in the compact
protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the
new optional fields.

Design alternatives and why they were not chosen
================================================

Ordering NaN before or after all other values
---------------------------------------------

We could simply define NaN to be smaller or greater than all other
values and then allow NaN in the respective bound.

This however has many drawbacks:
* NaN is the widest bound possible, so adding NaNs to min and max isn't
  very useful, as it makes pruning for non-NaN values almost impossible
  in the respective direction.
* As mentioned, not all systems agree on whether NaN is larger or
  smaller than all other values. If we decided for one, systems that
  choose the other semantics couldn't filter effectively.
* This contradicts the current spec of not writing NaN to min/max, so it
  would make older readers no longer skip pages they could skip before.

Adding a new ColumnOrder
------------------------

We could add a new ColumnOrder that specifies NaN to be smaller or
greater than all other values, or even supports -NaN and +NaN ordering
them as smaller and larger than all values, respectively. For example,
Iceberg mandates the following sort order:

-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN

Once we define such an order, we could again allow NaN (and potentially
-NaN) in bounds again.

This however has the following drawbacks:
* As with the previous alternative: NaN is the widest bound possible, so
  adding NaNs to min and max isn't very useful, as it makes pruning for
  non-NaN values almost impossible in the respective direction. If we
  even allow -NaN and +NaN, a page containing both would have no
  meaningful min and max and wouldn't allow any pruning at all.
* Most systems don't support -NaN, as mathematically speaking, it is
  nonsense.  The only reason why it exists is that the physical
  reprsentation of floats has a sign bit that also exists for NaN
  representations.
* The fact that NaNs being so unuseful for min/max bounds is displayed
  by the fact that even though Iceberg has such a well defined sort
  order,  they still do what is proposed in this proposal and *do not*
  include -NaN/NaN into min/max bounds and rather track them through
  nan_counts.

All in all, any alternative putting NaN into min/max bounds (except for
only-NaN-pages) suffers from the problem that NaN bounds are too wide
and therefore not useful for pruning.

Writing a second `value_counts` list in the ColumnIndex
-------------------------------------------------------

The column index does allow `byte[0]` values already, in case a page
contains only nulls. We could enable the same for only-NaN pages by not
storing only the `nan_counts`, but also the `value_counts` of a page.
Then, one could check whether a page in the column index contains only
NaNs by checking  `nan_count + nulls_count = value_count`.  Hoewever,
this would mean yet another list in the column index, making the column
index bigger and more expensive to deserialize.  And while the
`nan_counts` list only exists for FLOAT/DOUBLE columns, the
`value_counts` list would exist for all columns and therefore take up
considerably more space.  Also, this would not be backward compatible,
as an older reader wouldn't know of the new lists, so it would see a
`byte[0]` and would need to treat it as invalid.

All in all, the extra list doesn't seem to add enough value for its cost
and reduced backward compatibility.

Do nothing
----------

As long as we don't do anything, columns containing NaNs will almost be
useless for scan pruning. The problems outlined will persist, making
double columns almost unprunable for some predicates. That is not
satisfactory.

Why wasn't sort order tackled?
==============================

Even with this improvement which fixes the semantics of NaN values in
statistics, NaN values are still a problem in some other places as there
is still not a defined order for them, so the `boundary_order` in a
column index and the `SortingColumn` would still have undefined
placement for `NaN`.

This mainly wasn't tackled for two reasons:
* It is an orthogonal issue. This improvement is about enabling NaNs in
  statistics, so after this change all statistics can handle NaN in a
  well-defined way.  Sort odering of columns to leverage
  `boundary_order` or `SortingColumn` needs to be solved in a different
  way anyway, as the mere information about whether (only or some) NaNs
  are present isn't enough here but it needs to be defined whether they
  come before or after all values.
  * Even though we could fix both statistics and sort order by just
    defining NaN to be smaller or greater than other values, doing so
    for statistics is *not* a good idea, as having NaN in bounds makes
    too wide bounds that aren't helpful for filtering.
  * If sort ordering will be fixed by a different commit one day, the
    design of this commit shouldn't have a (negative) influence on that
    future design, as NaN counts and not including NaNs into bounds is a
    good thing to do anyway.
* While fixing statistics with NaN counts is pretty uncontested w.r.t.
  design alternatives (see the respective section for a discussion why),
  the design to be chosen for sort orders isn't that clear:
  * We could define a new `ColumnOrder` with well defined NaN ordering.
    This would be the cleanest fix, but also a "very big gun", as this
    would be the first non-default column order in existence.
  * We could define a `nans_first` fields which tells whether NaNs are
    to be sorted before or after other values, akin to the already
    existing field `nulls_first`.  This would be a more micro-invasive
    change, but it would be less clean IMHO, as there is a tool for
    defining column ordering--the `ColumnOrder`--and not using that tool
    but working around it feels hacky.

Thus, sort ordering of NaNs wasn't tackled in this commit. They can be
tackled by a follow-up change if necessary.
@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 7, 2023

I think this should be good to close now?

PARQUET-1222 has been merged (apache/parquet-format#185) a while back and looks like parquet crate is already supporting this new defined behaviour, e.g.

fn test_double_statistics_nan_start() {
let stats = statistics_roundtrip::<DoubleType>(&[f64::NAN, 1.0, 2.0]);
assert!(stats.has_min_max_set());
assert!(stats.is_min_max_backwards_compatible());
if let Statistics::Double(stats) = stats {
assert_eq!(stats.min(), &1.0);
assert_eq!(stats.max(), &2.0);
} else {
panic!("expecting Statistics::Float");
}
}

There's PARQUET-2249 which seems related but might be better off as separate issue if/when it gets merged

cc @tustvold @alamb

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

I a not sure the current status on this one

@crepererum
Copy link
Contributor Author

The format now clearly states:

NaNs should not be written to min or max statistics fields.

So parquet-rs is currently standard complaint and we should NOT diverge from the standard.

An extra counter field is currently proposed here: apache/parquet-format#196

When this is merged I think we should implement this, but I think in the meantime we can close this ticket to make it clear that we don't plan to take any action (apart from following the upstream standard of course).

@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

Agreed, I intend to write up the proposal on apache/parquet-format#196 to define a TypeDefinedOrder using the totalOrder predicate, as I think that would be the cleanest way to bring an end to this entire category of problem

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants