Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait insubquery #8363

Merged
merged 13 commits into from
Dec 20, 2023
Merged

Substrait insubquery #8363

merged 13 commits into from
Dec 20, 2023

Conversation

tgujar
Copy link
Contributor

@tgujar tgujar commented Nov 29, 2023

Which issue does this PR close?

Closes #8362.

Rationale for this change

Support for TPC-DS query set

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

No

@tgujar tgujar marked this pull request as ready for review December 12, 2023 22:36
let haystack_expr = &in_predicate.haystack;
if let Some(haystack_expr) = haystack_expr {
let haystack_expr =
from_substrait_rel(ctx, haystack_expr, extensions)
Copy link
Contributor Author

@tgujar tgujar Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to producer, I needed to add ctx here and the added argument for ctx in other functions is to support calling from_substrait_rel recursively

to_substrait_rex(ctx, expr, schema, col_ref_offset, extension_info)?;

let subquery_plan =
to_substrait_rel(subquery.subquery.as_ref(), ctx, extension_info)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added ctx here so that I could call to_substrait_rel recursively. All other functions in this file which have an added argument of ctx are to support this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, on roundtrip there is an additional projection during TableScan which includes all column of the table, and I had to use assert_expected_plan here

TableScan: data2 projection=[a, b, c, d, e, f]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is ok -- maybe you could just add a comment to test with this information

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this projection is not pushdown-ed to table scan, here is the optimized logical plan I get:

Filter: data.f = Utf8("a") OR data.f = Utf8("b") OR data.f = Utf8("c") OR data.a IN (<subquery>)
  Subquery:
    Projection: data2.a
      Filter: data2.f IN ([Utf8("b"), Utf8("c"), Utf8("d")])
        TableScan: data2
  TableScan: data projection=[a, f], partial_filters=[data.f = Utf8("a") OR data.f = Utf8("b") OR data.f = Utf8("c") OR data.a IN (<subquery>)]
    Subquery:
      Projection: data2.a
        Filter: data2.f IN ([Utf8("b"), Utf8("c"), Utf8("d")])
          TableScan: data2

@alamb
Copy link
Contributor

alamb commented Dec 17, 2023

I am sorry @tallamjr -- this PR got lost somehow when I was reviewing other ones. I will try and review it tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @tgujar -- this looks great. I am sorry again for the delay in reviewing

This PR has accumulated some conflicts that need to be resolved, but then I think it will be ready to merge. I had some small suggestions, but I don't think they are needed

cc @waynexia

Comment on lines +1042 to +1045
Err(DataFusionError::Substrait(
"InPredicate Subquery type must have exactly one Needle expression"
.to_string(),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use https://docs.rs/datafusion/latest/datafusion/common/macro.substrait_err.html to make the code less verbose

Suggested change
Err(DataFusionError::Substrait(
"InPredicate Subquery type must have exactly one Needle expression"
.to_string(),
))
substrait_err!("InPredicate Subquery type must have exactly one Needle expression")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is ok -- maybe you could just add a comment to test with this information

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others look good to me 👍

Apologize I haven't noticed this PR, there are a few API changes that are in conflict 🥲

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this projection is not pushdown-ed to table scan, here is the optimized logical plan I get:

Filter: data.f = Utf8("a") OR data.f = Utf8("b") OR data.f = Utf8("c") OR data.a IN (<subquery>)
  Subquery:
    Projection: data2.a
      Filter: data2.f IN ([Utf8("b"), Utf8("c"), Utf8("d")])
        TableScan: data2
  TableScan: data projection=[a, f], partial_filters=[data.f = Utf8("a") OR data.f = Utf8("b") OR data.f = Utf8("c") OR data.a IN (<subquery>)]
    Subquery:
      Projection: data2.a
        Filter: data2.f IN ([Utf8("b"), Utf8("c"), Utf8("d")])
          TableScan: data2

@alamb alamb merged commit 0e9c189 into apache:main Dec 20, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 20, 2023

Thanks again @tgujar and @waynexia

@tgujar
Copy link
Contributor Author

tgujar commented Dec 20, 2023

Thanks for merging my MR! Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substrait support of InSubquery expressions
3 participants