Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support parquet bloom filter pruning for decimal128 #8930

Merged
merged 4 commits into from
Jan 24, 2024

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

related #8880.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jan 21, 2024
@@ -89,16 +89,19 @@ impl RunOpt {
pushdown_filters: false,
reorder_filters: false,
enable_page_index: false,
enable_bloom_filter: false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will support soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Maps column name to the parquet bloom filter
column_sbbf: HashMap<String, Sbbf>,
/// Maps column name to the parquet bloom filter and parquet physical type
column_sbbf: HashMap<String, (Sbbf, Type)>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the sbbf check input is parquet type, so we need know the read parquet physical type when construct check value 🤔 , is there any another way

@@ -80,15 +81,15 @@ async fn single_file() {
// request_method = 'GET'
.with_filter(col("request_method").eq(lit("GET")))
.with_pushdown_expected(PushdownExpected::Some)
.with_expected_rows(688);
.with_expected_rows(745);
Copy link
Member Author

@Ted-Jiang Ted-Jiang Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb https://github.com/apache/arrow-datafusion/blob/72fb87e27bfaf62580f30fb6e85aed2a22fbe744/test-utils/src/data_gen.rs#L172 I think here every running test should get diff result 🤔 , but from the fact only change the generate method this would change. Is there something i miss?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what you are asking here @Ted-Jiang . It seems like most of the tests in this file have been updated as well

@alamb
Copy link
Contributor

alamb commented Jan 21, 2024

Thanks @Ted-Jiang -- I hope to review this more carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- I think this PR is really nicely done.

I didn't quite make it through all the changes in datafusion/core/tests/parquet/filter_pushdown.rs -- would it be possible to avoid changes in the existing tests?

Either by avoiding changes to the data created or maybe by using a different data generator?

ScalarValue::Int16(Some(v)) => sbbf.check(v),
ScalarValue::Int8(Some(v)) => sbbf.check(v),
ScalarValue::Decimal128(Some(v), p, s) => match parquet_type {
Type::INT32 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the value of the i128 the same as the value of this i32 (aka are we sure the top 12 bytes are always 0) as this code is ignores those bytes?

The same basic question applies to the Int64 case too (are we sure its top 8 bytes are always zero)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add precision check avoid this situation

Comment on lines 229 to 245
//https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42
// All physical type are little-endian
let b = (*v as i32).to_le_bytes();
let decimal = Decimal::Int32 {
value: b,
precision: *p as i32,
scale: *s as i32,
};
sbbf.check(&decimal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to use the constructors instead: https://docs.rs/parquet/latest/parquet/data_type/enum.Decimal.html#method.from_i32

Suggested change
//https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/Encodings.md?plain=1#L35-L42
// All physical type are little-endian
let b = (*v as i32).to_le_bytes();
let decimal = Decimal::Int32 {
value: b,
precision: *p as i32,
scale: *s as i32,
};
sbbf.check(&decimal)
let decimal = Decimal::from_i32(* v as i32, *p as i32, *s as i32)'
sbbf.check(&decimal)

(the same applies to the other types here as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file apache/arrow-rs#5325 to look at the bytes_order issue

datafusion/core/tests/parquet/row_group_pruning.rs Outdated Show resolved Hide resolved
@@ -66,7 +66,9 @@ enum Scenario {
Int32Range,
Float64,
Decimal,
DecimalBloomFilter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two scenarios here but three encodings (Int32, Int64 and Bytes) handled by the code. Is there some way to over all three encodings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test

@@ -80,15 +81,15 @@ async fn single_file() {
// request_method = 'GET'
.with_filter(col("request_method").eq(lit("GET")))
.with_pushdown_expected(PushdownExpected::Some)
.with_expected_rows(688);
.with_expected_rows(745);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what you are asking here @Ted-Jiang . It seems like most of the tests in this file have been updated as well

test-utils/src/data_gen.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/filter_pushdown.rs Outdated Show resolved Hide resolved
@Ted-Jiang Ted-Jiang force-pushed the decimal branch 2 times, most recently from 41a61f0 to f52f5e1 Compare January 23, 2024 03:49
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Jan 23, 2024

@alamb I remove the change of the datafusion/core/tests/parquet/filter_pushdown.rs file #8958, and add decimal precision check

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @Ted-Jiang . I had some comment / test suggestions but I think they could be done as a follow on PR too

@Ted-Jiang Ted-Jiang merged commit 19ca7d2 into apache:main Jan 24, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants