Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a faster way to check column existence in optimizer (not is_err()) #5309

Closed
alamb opened this issue Feb 16, 2023 · 22 comments
Closed

Make a faster way to check column existence in optimizer (not is_err()) #5309

alamb opened this issue Feb 16, 2023 · 22 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157

There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column

For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)

input_schema.field_from_column(col).is_ok()

This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157

As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,

Thus I propose:

  1. Add new functions for checking that return a bool rather than an error
  2. Replace the use of is_err() with

Find the field with the given qualified column

For example,

impl DFSchema {
  // existing function that returns Result
  pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}

  // new function that returns bool  <---- Add this new function
  pub fn has_column(&self, column: &Column) -> bool {...}
}

And then replace in the code that have the pattern

input_schema.field_from_column(col).is_ok()

With

input_schema.has_column(col)

Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )

Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot

Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise

@suxiaogang223
Copy link
Contributor

i'm happy to do this🌝

@ygf11
Copy link
Contributor

ygf11 commented Feb 17, 2023

Seems it is relative to the pr #5287.

@suxiaogang223
Copy link
Contributor

Seems it is relative to the pr #5287.

Yes, I think the method had_column also should distinguish FieldNotFound and Ambiguous reference error.
Maybe the new method should be

pub fn has_column(&self, column: &Column) -> Result<bool> {...}

@alamb
Copy link
Contributor Author

alamb commented Feb 17, 2023

The key for performance is not to return a DataFusionError with a allocated string

@suxiaogang223
Copy link
Contributor

The key for performance is not to return a DataFusionError with a allocated string

Maybe we can use assert to assume that the "Ambiguous reference error" should not happen in had_column?

@alamb
Copy link
Contributor Author

alamb commented Feb 17, 2023

Maybe we can use assert to assume that the "Ambiguous reference error" should not happen in had_column?

How about returning an enum like

enum FoundColumn {
  Found,
  NotFound,
  Ambiguous
}


pub fn has_column(&self, column: &Column) -> FoundColumn {...}

?

@suxiaogang223
Copy link
Contributor

suxiaogang223 commented Feb 17, 2023

I think returning enum will be same as returning result, because the caller also have to handle Ambiguous and return an Err.

Returning Result<bool> can also avoid str allocating in field_from_name().is_err(). The code will be like this:

if schema.had_column(col)? {...}

Maybe the key question is that do we need to check 'Ambiguous error' each time the had_column called? Actually we can just check Ambiguous err once at begin.

I'm not sure if my thinking is correct, need your advice @alamb 🤓

@alamb
Copy link
Contributor Author

alamb commented Feb 18, 2023

Not sure -- will try and check out #5328 shortly

@matthewmturner
Copy link
Contributor

@alamb i can pick this up

@alamb
Copy link
Contributor Author

alamb commented Dec 30, 2023

@alamb i can pick this up

Thank you @matthewmturner -- I think this is a "tip of the iceberg" type bug where there are many places in the optimizzer that use DFSchema that could be made faster.

Thus I suggest if possible, taking some time to map out a plan to incrementally improve the situation over time

@matthewmturner
Copy link
Contributor

@alamb Sounds good will do that

@alamb
Copy link
Contributor Author

alamb commented Dec 30, 2023

#8665 (comment) might be instructive

@matthewmturner
Copy link
Contributor

@alamb aha i had plans to profile that exact thing as a starting point.

@matthewmturner
Copy link
Contributor

I tried reproducing your results with Instruments but wasnt able to get to the granularity that you had that showed DFSchema as being heavy.

However, I put together a flamegraph and came to similar conclusion. In the below image the blocks in purple are for my search of DFSchema. Of those, there was a lot of merge and field_with_qualified_name (which is often called by merge) - this appears to be consistent with your profiling. It also looks like all uses of DFSchema are during the optimization pass which is consistent with your observation.

Based on this, and how field_with_name / field_with_qualified_name are used within merge I think I may be able to simply replace them with has_column_with_unqualified_name / has_column_with_qualified_name which return booleans.

Im hoping, time permitting, to also do some memory / allocations profiling to make sure these types of change have the desired effect.

image

@alamb
Copy link
Contributor Author

alamb commented Dec 31, 2023

simply replace them with has_column_with_unqualified_name / has_column_with_qualified_name which return booleans.

The other thing to do would be to look into making DFSchema cheaper to copy/create, for example using an Arc instead of OwnedTableReference (much as @tustvold did for Field in arrow-rs's Fields) so that copying a DFField doesn't require copying around strings

https://github.com/apache/arrow-datafusion/blob/848f6c395afef790880112f809b1443949d4bb0b/datafusion/common/src/dfschema.rs#L810

@matthewmturner
Copy link
Contributor

@alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately.

The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.

@matthewmturner
Copy link
Contributor

matthewmturner commented Jan 5, 2024

Just from updating the merge function we already see considerable improvements

image

@alamb
Copy link
Contributor Author

alamb commented Jan 5, 2024

@alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately.

The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.

Sounds like a great plan -- thank you!

@matthewmturner
Copy link
Contributor

matthewmturner commented Jan 12, 2024

@alamb
I've been looking into this more for places where we can replace unused results with booleans but nothing stuck out for that (let me know if you know or your intuition say otherwise). I've also been using the great analysis from @zeodotr in #7698 (comment) to guide some of my review.

A couple things:

  • I looked at optimization 6 from @zeodotr's list and I wasnt able to find columnize_expr as a hot spot in the context of creating physical plan (I tried reproducing on a wide table with several aggregates) which i believe is the use case they had (i didnt create 3000+ aggregates though like they have). it shows up as ~3% of cpu of creating unoptimized logical plan.
  • I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is (~87%) is now coming from has_column_with_qualified_name (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).

Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start.

image image

@alamb
Copy link
Contributor Author

alamb commented Jan 12, 2024

  • I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is (~87%) is now coming from has_column_with_qualified_name (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).

Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start.

Yes I agree getting DFSchema into better shape (e.g. not actually copying so many things) would likely make this task easier

It also looks like has_column_with_qualified_name is always being called from DFSchema::merge I wonder if we can figure out why that needs to be called so much. My bet is that most of the callsites dont' actually add any new fields. Maybe we can quickly check if the pass didn't many any changes to the children, then there is no need to call DFSchema::merge

Or maybe we can find some way to quickly compare if two schemas are the same 🤔

@comphead
Copy link
Contributor

comphead commented Jul 9, 2024

I think this can be closed @alamb? The original issue was resolved and has_column returns bool instead of error

@comphead
Copy link
Contributor

comphead commented Jul 9, 2024

@alamb i'm closing it, feel free to reopen it if needed

@comphead comphead closed this as completed Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants