Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-compute parquet stats in arrow writer #512

Closed
wants to merge 2 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jun 30, 2021

Which issue does this PR close?

None, I'm opening this to bank some work that I did while investigating #385

Rationale for this change

The parquet writer computes row group stats record-by-record when writing. There's an alternative of providing computed stats to avoid this process.

This would allow us to also pass in the distinct count of records, as that seems to be desirable for IOx.

What changes are included in this PR?

Computes the stats using arrow::compute for some column types.
The PR is incomplete, as I want to solicit feedback first.

This is on top of #511, so should be reviewed after it.

Are there any user-facing changes?

No


There are no noticeable performance changes, per:

cargo bench -p parquet --bench arrow_writer
write_batch primitive/1024 values
                        time:   [1.5005 ms 1.5055 ms 1.5112 ms]
                        thrpt:  [66.781 MiB/s 67.035 MiB/s 67.261 MiB/s]
                 change:
                        time:   [-0.0027% +0.7862% +1.5464%] (p = 0.05 < 0.05)
                        thrpt:  [-1.5229% -0.7801% +0.0027%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
write_batch primitive/4096 values
                        time:   [5.2132 ms 5.2259 ms 5.2392 ms]
                        thrpt:  [75.460 MiB/s 75.653 MiB/s 75.838 MiB/s]
                 change:
                        time:   [-1.2864% -0.8463% -0.4253%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4272% +0.8535% +1.3031%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

- non-null primitive should have def = 0, was misinterpreting the spec
- list increments 1 if not null, or 2 if null

This fixes these issues, and updates the tests
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 30, 2021
@nevi-me nevi-me changed the title Fix parquet definition levels Pre-compute parquet stats in arrow writer Jun 30, 2021
@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 30, 2021

CC @crepererum @alamb, relates to https://github.com/influxdata/influxdb_iox/issues/1712

May you please check if this would be useful. I've left the distinct count as None as we'd need an arrow::compute kernel that does a distinct count.

@Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to arrow::compute work?

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2021

Codecov Report

Merging #512 (7907cc4) into master (f1a831f) will increase coverage by 0.04%.
The diff coverage is 99.00%.

❗ Current head 7907cc4 differs from pull request most recent head 453122d. Consider uploading reports for the commit 453122d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
+ Coverage   82.74%   82.79%   +0.04%     
==========================================
  Files         165      165              
  Lines       45686    45749      +63     
==========================================
+ Hits        37805    37876      +71     
+ Misses       7881     7873       -8     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 83.08% <94.44%> (+0.29%) ⬆️
parquet/src/arrow/arrow_writer.rs 98.22% <100.00%> (+0.18%) ⬆️
arrow/src/array/transform/boolean.rs 76.92% <0.00%> (-7.70%) ⬇️
parquet/src/column/writer.rs 93.06% <0.00%> (-0.24%) ⬇️
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️
parquet/src/util/bit_packing.rs 99.96% <0.00%> (-0.01%) ⬇️
arrow/src/array/equal/variable_size.rs 100.00% <0.00%> (ø)
parquet/src/schema/types.rs 88.18% <0.00%> (+0.11%) ⬆️
parquet/src/encodings/rle.rs 92.96% <0.00%> (+0.24%) ⬆️
parquet/src/arrow/record_reader.rs 93.73% <0.00%> (+0.28%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a831f...453122d. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jun 30, 2021

May you please check if this would be useful. I've left the distinct count as None as we'd need an arrow::compute kernel that does a distinct count.

Thanks for this PR @nevi-me !

In IOx we often would already have the min, max, null_count (and sometimes distinct_count) values for data we are saving to parquet, so being able to supply them somehow to the writer would be great.

If using the arrow compute kernels to compute the statistics is faster than doing it row by that seems like a win too from my perspective.

@Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to arrow::compute work?

DataFusion computes distinct counts using the code in https://github.com/apache/arrow-datafusion/blob/9cf32cf2cda8472b87130142c4eee1126d4d9cbe/datafusion/src/physical_plan/distinct_expressions.rs#L45 -- it would need some finagling to make into an arrow::compute::kernel I think but could be done

cc @crepererum

@Dandandan
Copy link
Contributor

Distinct count AFAIK is often not included for parquet stats as calculating it is expensive.

The distinct count calculation in DataFusion is not really optimized yet (and quite high in memory usage), so not sure whether that's super useful for Arrow to use.

Also for DataFusion it would need to be over multiple arrays whether maybe in arrow it can be for one array? I think it would be great to have some kernel that can be used by DataFusion.

@alamb
Copy link
Contributor

alamb commented Jun 30, 2021

Distinct count AFAIK is often not included for parquet stats as calculating it is expensive.

This is true. One thing I have thought of recently is doing "best effort distinct count" -- namely because the distinct count is often used for detecting low cardinality columns, one could keep track of distinct count provided it consumed less than a fixed size memory budget. When that was exceeded then the distinct count would be abandoned.

This still costs CPU for sure, but it could cap the memory at some fixed size

@crepererum
Copy link
Contributor

For the distinct count, but also in general for the stats: what's kinda unfortunate is that in IOx, we have most of the information available for the record batches prior to writing them to parquet. For the min/max values and null counts I think it's OK to recompute them, but for the distinct count it seems a bit of a waste.

So I would like through some future PR (which I can contribute) have the ability to pass through pre-calculated stats.

Furthermore, the "pass through pre-computed stats" might also be a good point to find some arrow-type-level representation of the stats, because if you wanna currently want consume the stats from parquet, you have to do the scalar physical=>logical type conversion yourself.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2021

There is also a version of statistics in DataFusion here: https://github.com/apache/arrow-datafusion/blob/16a3db64cb50a5f6e27a032c270d9de40dd2d5a5/datafusion/src/datasource/datasource.rs#L31-L50

If we are going to bring the statistics into arrow-rs perhaps we can unify the statistics handling all around (so that we don't have separate statistics structs for arrow and datafusion)

@nevi-me nevi-me closed this Oct 13, 2021
@jimexist
Copy link
Member

Would the hyperloglog file be helpful in this case? @alamb

@jimexist
Copy link
Member

@jimexist
Copy link
Member

This now takes 16KiB but if we tune up standard error to be say 2.4% instead of 0.81% then the size can be down to 2KiB.

@crepererum
Copy link
Contributor

IIRC parquet itself doesn't specify a HLL but only a bloom filter (which in theory can also do a cardinality estimation but shouldn't really be used for that). We could of course embed a HLL in a custom key-value metadata field.

@alamb
Copy link
Contributor

alamb commented Oct 13, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants