Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: show statistics in explain verbose #8113

Merged
merged 7 commits into from
Nov 13, 2023

Conversation

NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented Nov 9, 2023

Which issue does this PR close?

Closes ##8111

Rationale for this change

Always show statistics in explain verbose to get used by other DB such as IOx that do not want to use setting set datafusion.explain.show_statistics = true;

What changes are included in this PR?

Always show statistics in explain verbose

Are these changes tested?

Yes

Are there any user-facing changes?

Explain verbose now includes statistics

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 9, 2023
@NGA-TRAN
Copy link
Contributor Author

NGA-TRAN commented Nov 9, 2023

@alamb : Instead of adding lines with stats, I just add stats to all physical plans which I think more reasonable. Let me know what you think

CC @berkaysynnada

@berkaysynnada
Copy link
Contributor

I think it makes sense, thank you.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN and @berkaysynnada I think adding statistics to all explain plans is going to make verbose plans almost unreadable (they are already pretty hard to read), especially as we improve the statistics over time -- like imagine a plan with 100 columns after #8112 -- it will be almost impossible to read the verbose output

I would like to suggest rather than adding statistics to every plan in explain verbose, instead add only a single new line

Perhaps by adding a new entry in addition to

https://github.com/apache/arrow-datafusion/blob/fdf3f6c3304956cd56131d8783d7cb38a2242a9f/datafusion/core/src/physical_planner.rs#L1897-L1901

if verbose {
   // add initial physical plan and its statistics in verbose mode
   stringified_plans.push( 
     displayable(input.as_ref()) 
         .set_show_statistics(true) 
         .to_stringified(e.verbose, InitialPhysicalPlanWithStats), 
 ); 

@alamb
Copy link
Contributor

alamb commented Nov 10, 2023

Something like this perhaps: #8111 (comment)

@NGA-TRAN
Copy link
Contributor Author

Something like this perhaps: #8111 (comment)

Sounds good. I will add stats for initial and final physical plans only

// We do not want to add proto plan type for these two becasue they are just the same as
// InitialPhysicalPlan and FinalPhysicalPlan
datafusion_expr::PlanType::InitialPhysicalPlanWithStats
| datafusion_expr::PlanType::FinalPhysicalPlanWithStats => todo!(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Cargo suggested me to do this. Do you want to go with this or create 2 new types in photo enum?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding two new types in the proto would be the most consistent with the existing code

However, as long as this doesn't panic I think it would be ok to merge -- for example, if it returned an internal error.

I don't really understand why there is special serialization for this to be honest 🤔 maybe we could simplify that too as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb The todo! returns a panic of not yet implemented. Let me try to add new types to proto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have add 2 new types to proto instead

set datafusion.execution.collect_statistics = false;
# explain verbose with both collect & show statistics on
query TT
EXPLAIN VERBOSE SELECT * FROM alltypes_plain limit 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All plans have stats because show_stats is on. This works the same as before this PR


# explain verbose with collect on and & show statistics off: still has stats
query TT
EXPLAIN VERBOSE SELECT * FROM alltypes_plain limit 10;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 new lines added for this one because show_stats is off

@NGA-TRAN NGA-TRAN requested a review from alamb November 10, 2023 22:04
@NGA-TRAN
Copy link
Contributor Author

@alamb I have addressed your comments but I have a new question in the code

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN -- this looks good to me except for the panic in proto serialization.

// We do not want to add proto plan type for these two becasue they are just the same as
// InitialPhysicalPlan and FinalPhysicalPlan
datafusion_expr::PlanType::InitialPhysicalPlanWithStats
| datafusion_expr::PlanType::FinalPhysicalPlanWithStats => todo!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding two new types in the proto would be the most consistent with the existing code

However, as long as this doesn't panic I think it would be ok to merge -- for example, if it returned an internal error.

I don't really understand why there is special serialization for this to be honest 🤔 maybe we could simplify that too as a follow on PR

@@ -245,6 +245,7 @@ logical_plan after eliminate_projection SAME TEXT AS ABOVE
logical_plan after push_down_limit SAME TEXT AS ABOVE
logical_plan TableScan: simple_explain_test projection=[a, b, c]
initial_physical_plan CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.csv]]}, projection=[a, b, c], has_header=true
initial_physical_plan_with_stats CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.csv]]}, projection=[a, b, c], has_header=true, statistics=[Rows=Absent, Bytes=Absent]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -1900,6 +1900,18 @@ impl DefaultPhysicalPlanner {
.to_stringified(e.verbose, InitialPhysicalPlan),
);

// Add plan with stats for verbose case even if show_statistics is false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems confusing as the next line checks for !config.show_statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the comments clearer

@alamb
Copy link
Contributor

alamb commented Nov 13, 2023

Looks like the CI failures are due to a logical merge conflict with #8112 -- I'll push an update

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @NGA-TRAN !

@alamb alamb merged commit a38ac20 into apache:main Nov 13, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants