Pre-compute parquet stats in arrow writer #512

nevi-me · 2021-06-30T07:12:32Z

Which issue does this PR close?

None, I'm opening this to bank some work that I did while investigating #385

Rationale for this change

The parquet writer computes row group stats record-by-record when writing. There's an alternative of providing computed stats to avoid this process.

This would allow us to also pass in the distinct count of records, as that seems to be desirable for IOx.

What changes are included in this PR?

Computes the stats using arrow::compute for some column types.
The PR is incomplete, as I want to solicit feedback first.

This is on top of #511, so should be reviewed after it.

Are there any user-facing changes?

No

There are no noticeable performance changes, per:

cargo bench -p parquet --bench arrow_writer

write_batch primitive/1024 values
                        time:   [1.5005 ms 1.5055 ms 1.5112 ms]
                        thrpt:  [66.781 MiB/s 67.035 MiB/s 67.261 MiB/s]
                 change:
                        time:   [-0.0027% +0.7862% +1.5464%] (p = 0.05 < 0.05)
                        thrpt:  [-1.5229% -0.7801% +0.0027%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
write_batch primitive/4096 values
                        time:   [5.2132 ms 5.2259 ms 5.2392 ms]
                        thrpt:  [75.460 MiB/s 75.653 MiB/s 75.838 MiB/s]
                 change:
                        time:   [-1.2864% -0.8463% -0.4253%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4272% +0.8535% +1.3031%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

- non-null primitive should have def = 0, was misinterpreting the spec - list increments 1 if not null, or 2 if null This fixes these issues, and updates the tests

nevi-me · 2021-06-30T07:15:51Z

CC @crepererum @alamb, relates to https://github.com/influxdata/influxdb_iox/issues/1712

May you please check if this would be useful. I've left the distinct count as None as we'd need an arrow::compute kernel that does a distinct count.

@Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to arrow::compute work?

codecov-commenter · 2021-06-30T07:30:12Z

Codecov Report

Merging #512 (7907cc4) into master (f1a831f) will increase coverage by 0.04%.
The diff coverage is 99.00%.

❗ Current head 7907cc4 differs from pull request most recent head 453122d. Consider uploading reports for the commit 453122d to get more accurate results

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
+ Coverage   82.74%   82.79%   +0.04%     
==========================================
  Files         165      165              
  Lines       45686    45749      +63     
==========================================
+ Hits        37805    37876      +71     
+ Misses       7881     7873       -8

Impacted Files	Coverage Δ
parquet/src/arrow/levels.rs	`83.08% <94.44%> (+0.29%)`	⬆️
parquet/src/arrow/arrow_writer.rs	`98.22% <100.00%> (+0.18%)`	⬆️
arrow/src/array/transform/boolean.rs	`76.92% <0.00%> (-7.70%)`	⬇️
parquet/src/column/writer.rs	`93.06% <0.00%> (-0.24%)`	⬇️
parquet/src/encodings/encoding.rs	`94.85% <0.00%> (-0.20%)`	⬇️
parquet/src/util/bit_packing.rs	`99.96% <0.00%> (-0.01%)`	⬇️
arrow/src/array/equal/variable_size.rs	`100.00% <0.00%> (ø)`
parquet/src/schema/types.rs	`88.18% <0.00%> (+0.11%)`	⬆️
parquet/src/encodings/rle.rs	`92.96% <0.00%> (+0.24%)`	⬆️
parquet/src/arrow/record_reader.rs	`93.73% <0.00%> (+0.28%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a831f...453122d. Read the comment docs.

alamb · 2021-06-30T20:49:12Z

May you please check if this would be useful. I've left the distinct count as None as we'd need an arrow::compute kernel that does a distinct count.

Thanks for this PR @nevi-me !

In IOx we often would already have the min, max, null_count (and sometimes distinct_count) values for data we are saving to parquet, so being able to supply them somehow to the writer would be great.

If using the arrow compute kernels to compute the statistics is faster than doing it row by that seems like a win too from my perspective.

@Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to arrow::compute work?

DataFusion computes distinct counts using the code in https://github.com/apache/arrow-datafusion/blob/9cf32cf2cda8472b87130142c4eee1126d4d9cbe/datafusion/src/physical_plan/distinct_expressions.rs#L45 -- it would need some finagling to make into an arrow::compute::kernel I think but could be done

cc @crepererum

Dandandan · 2021-06-30T20:59:10Z

Distinct count AFAIK is often not included for parquet stats as calculating it is expensive.

The distinct count calculation in DataFusion is not really optimized yet (and quite high in memory usage), so not sure whether that's super useful for Arrow to use.

Also for DataFusion it would need to be over multiple arrays whether maybe in arrow it can be for one array? I think it would be great to have some kernel that can be used by DataFusion.

alamb · 2021-06-30T21:21:20Z

Distinct count AFAIK is often not included for parquet stats as calculating it is expensive.

This is true. One thing I have thought of recently is doing "best effort distinct count" -- namely because the distinct count is often used for detecting low cardinality columns, one could keep track of distinct count provided it consumed less than a fixed size memory budget. When that was exceeded then the distinct count would be abandoned.

This still costs CPU for sure, but it could cap the memory at some fixed size

crepererum · 2021-07-02T07:20:31Z

For the distinct count, but also in general for the stats: what's kinda unfortunate is that in IOx, we have most of the information available for the record batches prior to writing them to parquet. For the min/max values and null counts I think it's OK to recompute them, but for the distinct count it seems a bit of a waste.

So I would like through some future PR (which I can contribute) have the ability to pass through pre-calculated stats.

Furthermore, the "pass through pre-computed stats" might also be a good point to find some arrow-type-level representation of the stats, because if you wanna currently want consume the stats from parquet, you have to do the scalar physical=>logical type conversion yourself.

alamb · 2021-07-02T10:46:20Z

There is also a version of statistics in DataFusion here: https://github.com/apache/arrow-datafusion/blob/16a3db64cb50a5f6e27a032c270d9de40dd2d5a5/datafusion/src/datasource/datasource.rs#L31-L50

If we are going to bring the statistics into arrow-rs perhaps we can unify the statistics handling all around (so that we don't have separate statistics structs for arrow and datafusion)

jimexist · 2021-10-13T15:59:50Z

Would the hyperloglog file be helpful in this case? @alamb

jimexist · 2021-10-13T16:01:29Z

For the context we now have https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hyperloglog/mod.rs

jimexist · 2021-10-13T16:04:10Z

This now takes 16KiB but if we tune up standard error to be say 2.4% instead of 0.81% then the size can be down to 2KiB.

crepererum · 2021-10-13T16:09:27Z

IIRC parquet itself doesn't specify a HLL but only a bloom filter (which in theory can also do a cardinality estimation but shouldn't really be used for that). We could of course embed a HLL in a custom key-value metadata field.

alamb · 2021-10-13T17:00:42Z

The spec doesn't seem to mention allowing "approximate" estimations of distinct value counts:

https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L216-L217

As @crepererum says, it does offer a specific BloomFilter implementation:
https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L216-L217

Fix parquet definition levels

4fecc32

- non-null primitive should have def = 0, was misinterpreting the spec - list increments 1 if not null, or 2 if null This fixes these issues, and updates the tests

github-actions bot added the parquet Changes to the parquet crate label Jun 30, 2021

nevi-me changed the title ~~Fix parquet definition levels~~ Pre-compute parquet stats in arrow writer Jun 30, 2021

Compute write statistics in arrow::compute

453122d

nevi-me mentioned this pull request Jun 30, 2021

Fix parquet definition levels #511

Merged

nevi-me closed this Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-compute parquet stats in arrow writer #512

Pre-compute parquet stats in arrow writer #512

nevi-me commented Jun 30, 2021

nevi-me commented Jun 30, 2021

codecov-commenter commented Jun 30, 2021 •

edited

Loading

alamb commented Jun 30, 2021

Dandandan commented Jun 30, 2021

alamb commented Jun 30, 2021

crepererum commented Jul 2, 2021

alamb commented Jul 2, 2021

jimexist commented Oct 13, 2021

jimexist commented Oct 13, 2021

jimexist commented Oct 13, 2021

crepererum commented Oct 13, 2021

alamb commented Oct 13, 2021

Pre-compute parquet stats in arrow writer #512

Pre-compute parquet stats in arrow writer #512

Conversation

nevi-me commented Jun 30, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

nevi-me commented Jun 30, 2021

codecov-commenter commented Jun 30, 2021 • edited Loading

Codecov Report

alamb commented Jun 30, 2021

Dandandan commented Jun 30, 2021

alamb commented Jun 30, 2021

crepererum commented Jul 2, 2021

alamb commented Jul 2, 2021

jimexist commented Oct 13, 2021

jimexist commented Oct 13, 2021

jimexist commented Oct 13, 2021

crepererum commented Oct 13, 2021

alamb commented Oct 13, 2021

codecov-commenter commented Jun 30, 2021 •

edited

Loading