[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

res-life · 2023-08-24T04:08:34Z

Describe the bug
Parquet float/double statistic is wrong when float/double column contains NaN.
For example, a double column contains 2 values [NaN, 1.0d].

The CPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: NaN.

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

The GPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: 1.0.

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Steps/Code to reproduce bug
CPU:

import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.ExampleParquetWriter;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

public class TestDoubleStatistic {

  @Test
  public void testDouble() throws IOException {
    MessageType schema = MessageTypeParser.parseMessageType(
      "message schema {\n" +
        "    repeated double d;\n" +
        "}");

    File file = new File("/tmp/test-001.parquet");
    if (file.exists()) {
      file.delete();
    }
    Path fsPath = new Path(file.getAbsolutePath());
    SimpleGroupFactory factory = new SimpleGroupFactory(schema);

    ParquetWriter writer = ExampleParquetWriter.builder(fsPath)
      .withType(schema)
      .build();

    Group group = factory.newGroup()
      .append("d", Double.NaN)
      .append("d", 1.0d);
    writer.write(group);
    writer.close();

    System.out.println("ok");
  }
}

parquet-tools inspect --detail /tmp/test-001.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

GPU:

TEST_F(MyDebugTests, TestDoubleStatistic)
{
  auto nan = 0.0 / 0.0;
  auto d = std::vector<double> {nan, 1.0};
  float64_col dc(d.begin(), d.end());
  cudf::table_view expected({dc});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("c1");

  auto filepath = "/tmp/TestDoubleStatistic.parquet";
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(expected_metadata);
  cudf::io::write_parquet(out_opts);
}

parquet-tools inspect --detail TestDoubleStatistic.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Expected behavior
Make float/double stat consistent with CPU.

Environment overview (please complete the following information)

Environment details
cuDF: branch-23.10
parquet: apache-parquet-1.12.2

Additional context
parquet-tools link

Parquet convert from org.apache.parquet.format.Statistics to org.apache.parquet.column.statistics:

https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L123-L126

        if (min.isNaN() || max.isNaN()) {
          stats.setMinMax(0.0, 0.0);
          ((Statistics<?>) stats).hasNonNullValue = false;
        }

If there are NaN, then org.apache.parquet.column.statistics min/max are converted to 0.0/0.0 and make they are invalid:

hasNonNullValue = false

So GPU should make sure min/max contains a NaN.

The text was updated successfully, but these errors were encountered:

etseidl · 2023-09-04T19:04:11Z

Statistics behavior in the presence of NaN values is currently in flux. See the discussion in apache/parquet-format#196

From the current parquet thrift specification:

   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.
   *     - If the computed max value is zero (whether negative or positive),
   *       `+0.0` should be written into the max statistics field.
   *     - If the computed min value is zero (whether negative or positive),
   *       `-0.0` should be written into the min statistics field.

res-life · 2023-09-13T02:59:53Z

@etseidl You are right, but it's a new version behavior. And there are also an extra field nan_count to record the nan count in the new version. For the older version of Parquet, there are no nan_count, so the problem is GPU does not follow the old version and not follow the new version. The CPU file in the reproduce steps is generated by Parquet API, I think GPU should follow the same behavior.

Actually, CPU Spark 330 returns wrong result when reading the GPU file with spark.sql.parquet.aggregatePushdown enabled.

# enable agg push down
spark.conf.set("spark.sql.parquet.aggregatePushdown", "true") 

# use V2 datasource
spark.conf.set("spark.sql.sources.useV1SourceList", "") 

df = spark.read.parquet("/tmp/my-parquet")
df.createOrReplaceTempView("tab")

# agg push down, directly use the max stat, it's wrong.
spark.sql("select max(v) from tab").show()
+------+
|max(v)|
+------+
|   2.0|
+------+

spark.read.parquet("/tmp/my-parquet").show() # show all values
+---+
|  v|
+---+
|1.0|
|2.0|
|NaN|
+---+

#add additinal filter will disable this agg push down, then gets correct result.
spark.sql("select max(v) from tab where v != 1.0").show()  
+------+
|max(v)|
+------+
|   NaN|
+------+

In scala, the min/max value are NaN for values (1.0, 2.0, NaN)

scala> val array = Array(1.0d, 2.0d, Double.NaN)
array: Array[Double] = Array(1.0, 2.0, NaN)

scala> array.max
res2: Double = NaN

scala> array.min
res3: Double = NaN

The GPU generated file:
a.parquet.zip

parquet-cli meta a.parquet

Row group 0:  count: 3  14.00 B records  start: 4  total: 42 B
--------------------------------------------------------------------------------
   type      encodings count     avg size   nulls   min / max
v  DOUBLE    S RB_     3         14.00 B    0       "1.0" / "2.0"

@revans2

etseidl · 2023-09-13T04:38:16Z

@res-life I mentioned the new discussion of NaN handling to say things might change at some point in the future. But given the current parquet specification, here's what it has to say about floating point ordering:

   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
   * 
   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.
   *     - If the computed max value is zero (whether negative or positive),
   *       `+0.0` should be written into the max statistics field.
   *     - If the computed min value is zero (whether negative or positive),
   *       `-0.0` should be written into the min statistics field.

I'm no expert, but looking at issues such as https://issues.apache.org/jira/browse/PARQUET-1222 I'd contend that the current GPU behavior is not incorrect. The specification was ambiguous in the past, so different implementations had different (incompatible it seems) orderings and min/max behaviors. I believe the current GPU behavior of not writing NaN is consistent with the current spec. Being different from the parquet-mr implementation is not necessarily a bug.

GregoryKimball · 2023-09-27T02:23:07Z

Thank you all for the discussion. I'll close this for now.

res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 24, 2023

res-life mentioned this issue Aug 25, 2023

Statistics tests for Parquet files written by GPU NVIDIA/spark-rapids#8762

Closed

res-life mentioned this issue Sep 13, 2023

Add test cases for Parquet statistics [databricks] NVIDIA/spark-rapids#9090

Merged

GregoryKimball closed this as completed Sep 27, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

res-life commented Aug 24, 2023

etseidl commented Sep 4, 2023

res-life commented Sep 13, 2023 •

edited

Loading

etseidl commented Sep 13, 2023

GregoryKimball commented Sep 27, 2023

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

Comments

res-life commented Aug 24, 2023

etseidl commented Sep 4, 2023

res-life commented Sep 13, 2023 • edited Loading

etseidl commented Sep 13, 2023

GregoryKimball commented Sep 27, 2023

res-life commented Sep 13, 2023 •

edited

Loading