Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find the correct fields when using page filter on struct fields in parquet #8848

Merged
merged 10 commits into from
Jan 24, 2024

Conversation

manoj-inukolunu
Copy link
Contributor

Which issue does this PR close?

Closes #8456

Rationale for this change

What changes are included in this PR?

The issue is in this line c.column_descr().name() == column.name() in find_column_index datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs . When struct fields have the same name as the column names that condition is returning the struct field instead of the column field.

Are these changes tested?

Yes I have added a test which is taken from the reproduction code.

Are there any user-facing changes?

No

@@ -280,7 +280,7 @@ fn find_column_index(
.columns()
.iter()
.enumerate()
.find(|(_idx, c)| c.column_descr().name() == column.name())
.find(|(_idx, c)| c.column_descr().name() == column.name() && c.column_descr().path().parts().len() == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb , I will check it out.

Copy link
Contributor Author

@manoj-inukolunu manoj-inukolunu Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this PR @alamb . Please check. Thank you.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @manoj-inukolunu thank you for this contribution

There appears to be a CI failure as well
https://github.com/apache/arrow-datafusion/actions/runs/7510755702/job/20489197785?pr=8848

@alamb alamb changed the title Dont consider struct fields for filtering in parquet Find the correct fields when using page filter on struct fields in parquet Jan 22, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @manoj-inukolunu -- this looks great

I have some style suggestions in this proposal which targets your branch: manoj-inukolunu#1

However they aren't required -- if you would prefer I think we could just update the docstrings for find_column_name instead

Thanks again 🙏

Comment on lines 174 to 178
let mut parquet_col = None;
if name.is_some() {
parquet_col =
parquet_column(parquet_schema, arrow_schema, name.unwrap().as_str());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more rust ideomatic way of doing this pattern that avoids the unwrap and mut is something like

Suggested change
let mut parquet_col = None;
if name.is_some() {
parquet_col =
parquet_column(parquet_schema, arrow_schema, name.unwrap().as_str());
}
let parquet_col = name
.and_then(|name| {
parquet_column(parquet_schema, arrow_schema, name.as_str());
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah thank you , i was searching for something like this. I will add this to my toolbox.

@@ -2136,4 +2144,65 @@ mod tests {
let execution_props = ExecutionProps::new();
create_physical_expr(expr, &df_schema, &execution_props).unwrap()
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test fails without the code in this PR 👍

thread 'datasource::physical_plan::parquet::tests::test_struct_filter_parquet' panicked at datafusion/core/src/datasource/physical_plan/parquet/mod.rs:2156:9:
assertion `left == right` failed
  left: 0
 right: 1
stack backtrace:

@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

I took the liberty of merging this branch with main and running cargo fmt which will hopefully get the CI to pass

@alamb alamb merged commit 7ad929a into apache:main Jan 24, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

Thanks again @manoj-inukolunu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

query result empty when a struct field name and a regular field name is same
2 participants