Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify windows builtin functions return type #8920

Merged
merged 2 commits into from
Jan 22, 2024
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Jan 20, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Before the PR the Datafusion derives the output datatypes for builtin functions twice.

Initially set in BuiltInWindowFunction::return_type the output datatype redefined in other places:

  • in built in function implementations, like row_number, rank, etc
  • on physical planner phase like lag, lead

This PR is to unify output types to be defined in 1 single place and be transferred through the schema.

Another good benefit of it is other systems that embed DataFusion as an engine can now use its datatypes by constructing expected schema.

What changes are included in this PR?

Datatype definition unified, covered with new tests, some other minor optimizations

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 20, 2024
@comphead comphead requested review from viirya and alamb January 20, 2024 00:27
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is to unify output types to be defined in 1 single place and be transferred through the schema.

Thank you @comphead -- this change looks good to me.

Before we merge it, however, I think we should give @mustafasrepo / @ozankabak a chance to comment as I think they are familiar with this code

datafusion/physical-expr/src/window/rank.rs Show resolved Hide resolved
@@ -3906,3 +3906,69 @@ ProjectionExec: expr=[sn@0 as sn, ts@1 as ts, currency@2 as currency, amount@3 a
--BoundedWindowAggExec: wdw=[SUM(table_with_pk.amount) ORDER BY [table_with_pk.sn ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Ok(Field { name: "SUM(table_with_pk.amount) ORDER BY [table_with_pk.sn ASC NULLS LAST] ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: CurrentRow }], mode=[Sorted]
----SortExec: expr=[sn@0 ASC NULLS LAST]
------MemoryExec: partitions=1, partition_sizes=[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests all seem to pass for me on main as well (without the changes in this PR). Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tests checked the functionality is not regressed.

@@ -719,14 +720,16 @@ impl DefaultPhysicalPlanner {
}

let logical_input_schema = input.schema();
let physical_input_schema = input_exec.schema();
// Extend the schema to include window expression fields as builtin window functions derives its datatype from incoming schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't input.schema() reflect all the columns that the input produces?

Or does the WindowAggExec create new columns "internally" by evaluating the window expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input schema is schema from previous plan node(no window expressions). afaik windows expression column being added separately

https://github.com/apache/arrow-datafusion/blob/0116e2a9b4a3ed4491802e19195769b96b7a971a/datafusion/expr/src/logical_plan/plan.rs#L2045

@ozankabak
Copy link
Contributor

@mustafasrepo PTAL

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Jan 22, 2024

I think this PR add two functionality. Which are

  • Removing hard coded output types from builtin window functions so that all handled using return_type API.
  • Removing physical_input_schema from create_window_expr function.

I think first change is better. However, second change is misleading. With the changes in this PR, given logical_input_schema is no longer actual input schema, but really window schema.
However, I think we can merge this PR as is, Then I will file a PR to retract 2nd part. We can discuss pros and cons of it in that PR.

@mustafasrepo mustafasrepo merged commit 2b218be into apache:main Jan 22, 2024
22 checks passed
@mustafasrepo
Copy link
Contributor

Filed the PR8945 for retracting second change

Comment on lines +724 to +732
let mut window_fields = logical_input_schema.fields().clone();
window_fields.extend_from_slice(&exprlist_to_fields(window_expr.iter(), input)?);
let extended_schema = &DFSchema::new_with_metadata(window_fields, HashMap::new())?;
let window_expr = window_expr
.iter()
.map(|e| {
create_window_expr(
e,
logical_input_schema,
&physical_input_schema,
extended_schema,
Copy link
Member

@viirya viirya Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to recompute this extended_schema which looks like the same of window operator's schema? You can simply get the window's schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8955

@@ -1598,21 +1602,14 @@ pub fn create_window_expr_with_name(
pub fn create_window_expr(
e: &Expr,
logical_input_schema: &DFSchema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misleading, it is actually window's schema, not input schema now.

@@ -1572,7 +1575,8 @@ pub fn create_window_expr_with_name(
create_physical_sort_expr(e, logical_input_schema, execution_props)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.

This may be what @mustafasrepo has improved in #8920 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this looks incorrect as you use window's schema as input schema. Although window's schema is input schema + window functions output, it is why this change still makes thing work. But it is actually misleading for readers and probably cause of potential bugs.

This may be what @mustafasrepo has improved in #8920 (comment)

Exactly, I re-introduced the invariant of using only input schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants