-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-compute parquet stats in arrow writer #512
Conversation
- non-null primitive should have def = 0, was misinterpreting the spec - list increments 1 if not null, or 2 if null This fixes these issues, and updates the tests
CC @crepererum @alamb, relates to https://github.com/influxdata/influxdb_iox/issues/1712 May you please check if this would be useful. I've left the distinct count as @Dandandan @jorgecarleitao I'd expect such to already exist in datafusion, so would simply porting it to |
Codecov Report
@@ Coverage Diff @@
## master #512 +/- ##
==========================================
+ Coverage 82.74% 82.79% +0.04%
==========================================
Files 165 165
Lines 45686 45749 +63
==========================================
+ Hits 37805 37876 +71
+ Misses 7881 7873 -8
Continue to review full report at Codecov.
|
Thanks for this PR @nevi-me ! In IOx we often would already have the If using the arrow compute kernels to compute the statistics is faster than doing it row by that seems like a win too from my perspective.
DataFusion computes distinct counts using the code in https://github.com/apache/arrow-datafusion/blob/9cf32cf2cda8472b87130142c4eee1126d4d9cbe/datafusion/src/physical_plan/distinct_expressions.rs#L45 -- it would need some finagling to make into an arrow::compute::kernel I think but could be done cc @crepererum |
Distinct count AFAIK is often not included for parquet stats as calculating it is expensive. The distinct count calculation in DataFusion is not really optimized yet (and quite high in memory usage), so not sure whether that's super useful for Arrow to use. Also for DataFusion it would need to be over multiple arrays whether maybe in arrow it can be for one array? I think it would be great to have some kernel that can be used by DataFusion. |
This is true. One thing I have thought of recently is doing "best effort distinct count" -- namely because the distinct count is often used for detecting low cardinality columns, one could keep track of distinct count provided it consumed less than a fixed size memory budget. When that was exceeded then the distinct count would be abandoned. This still costs CPU for sure, but it could cap the memory at some fixed size |
For the distinct count, but also in general for the stats: what's kinda unfortunate is that in IOx, we have most of the information available for the record batches prior to writing them to parquet. For the min/max values and null counts I think it's OK to recompute them, but for the distinct count it seems a bit of a waste. So I would like through some future PR (which I can contribute) have the ability to pass through pre-calculated stats. Furthermore, the "pass through pre-computed stats" might also be a good point to find some arrow-type-level representation of the stats, because if you wanna currently want consume the stats from parquet, you have to do the scalar physical=>logical type conversion yourself. |
There is also a version of statistics in DataFusion here: https://github.com/apache/arrow-datafusion/blob/16a3db64cb50a5f6e27a032c270d9de40dd2d5a5/datafusion/src/datasource/datasource.rs#L31-L50 If we are going to bring the statistics into |
Would the hyperloglog file be helpful in this case? @alamb |
For the context we now have https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hyperloglog/mod.rs |
This now takes 16KiB but if we tune up standard error to be say 2.4% instead of 0.81% then the size can be down to 2KiB. |
IIRC parquet itself doesn't specify a HLL but only a bloom filter (which in theory can also do a cardinality estimation but shouldn't really be used for that). We could of course embed a HLL in a custom key-value metadata field. |
The spec doesn't seem to mention allowing "approximate" estimations of distinct value counts: As @crepererum says, it does offer a specific BloomFilter implementation: |
Which issue does this PR close?
None, I'm opening this to bank some work that I did while investigating #385
Rationale for this change
The parquet writer computes row group stats record-by-record when writing. There's an alternative of providing computed stats to avoid this process.
This would allow us to also pass in the distinct count of records, as that seems to be desirable for IOx.
What changes are included in this PR?
Computes the stats using
arrow::compute
for some column types.The PR is incomplete, as I want to solicit feedback first.
This is on top of #511, so should be reviewed after it.
Are there any user-facing changes?
No
There are no noticeable performance changes, per: