Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BaseRelation.matches() #6844

Closed
wants to merge 1 commit into from

Conversation

peterallenwebb
Copy link
Contributor

resolves #6842

Description

This PR optimizes the BaseRelation.matches() function in order to avoid costly string processing and comparison operations which are being done a very large number of times during certain large project runs of dbt build. On the large scenario described in #6842, it took my local run time for dbt build from 23m to 14m.

As written, we would lose the ApproximateMatchError exception, since determining whether a relation is a an approximate match was a large part of the time spent. We'll need to determine whether that is justified by the savings, or if there is a better way to avoid performing the check a large number of times.

At any rate, there is a lot of room for improvement in this bottleneck.

Checklist

@peterallenwebb peterallenwebb requested a review from a team as a code owner February 2, 2023 15:36
@cla-bot cla-bot bot added the cla:yes label Feb 2, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@peterallenwebb
Copy link
Contributor Author

@boxysean If you have a chance to kick the tires on this change, please do and let me know how it works for you. If it looks good to you, I'll discuss it with other stakeholders.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 3, 2023

@peterallenwebb This is awesome!!

Perf bottlenecks in dbt

First, I want to offer some higher-level context for this change:

  • There are some "fixed costs" at the start of every run: parsing (files → manifest), graph generation (manifest → graph), cache population (run queries to build up relational cache on adapter).
  • All of those have to happen on the main thread, before we can do anything else, and they scale with the size of the overall project, rather than the number of nodes actually selected to run. That's particularly frustrating for developers in large projects who may just want to be running dbt build --select one_model. (There is an experimental config to modify cache population based on node selection.)
  • Once all those fixed costs are paid, we move into a queue. Now, each node can run, in parallel, on an independent thread—up to the number of --threads, depending on the shape of the DAG (interdependencies), and only the subset of nodes actually selected to run.
  • So: I would expect BaseRelation.get_relation() to run within those threads, and only for nodes that are actually selected to run. This does add up to a lot of time if running thousands of models, but I'd expect it to scale in proportion with that number of models, and to be parallelized across multiple threads. (Exception: In the case of deferral, we run adapter.get_relation as a fixed-cost step earlier on, to determine which upstream models do/don't exist in the target namespace.)
  • Finally, once we actually get into running models against the warehouse, dbt doesn't tend to be the performance bottleneck anymore. We're "I/O bound" by the time it takes to actually execute each query on the warehouse, which can be anywhere from <1s (create view) to many minutes (create table, merge). At this point, when users are thinking about performance, they're thinking about database performance, and dbt's role is mostly around making sure it's templating the right DDL/DML, with just the right keywords or magic incantations that make the data platform purr.
  • Of course, dbt can also try to avoid running duplicative queries. That's the motivation behind having the relational cache to begin with, populated at the beginning & accessed/update as we go along. That cache lookups should be much faster (overall) than running the same metadata queries, over and over, against the warehouse, while materializing every single model ("does this relation already exist?" "is it a table?").

This PR: proposed trade-off

As written, we would lose the ApproximateMatchError exception, since determining whether a relation is a an approximate match was a large part of the time spent. We'll need to determine whether that is justified by the savings, or if there is a better way to avoid performing the check a large number of times.

I could be open to making this change. The ApproximateMatchError is not particularly delightful for end users to see today. It is our attempt to surface an explicit error when dbt has missed matching up a user's defined relation to one in the relation cache, because of a very subtle discrepancy in casing/quoting. The alternative, unfortunately, is even more-confusing behavior, whatever is the fallout of missing the match. As it is, though, it's already pretty hard to debug, and the UX loss of removing this extra detection does feel outweighed by the UX improvement of a significant speedup.

A little history:

A concrete example

Here's a simple case for reproducing when this exception would be helpful. On Snowflake, the default case is uppercase (both ANSI-compliant and annoying), and quoted identifiers are case-sensitive. This was the bane of my existence in 2018; since then, we've disable quoting for relation identifiers by default, and it's much more pleasant.

So if I create a model like:

-- models/my_model.sql
select 1 as id

I dbt run, it templates out a SQL statement like

create or replace view analytics.dbt_jcohen.my_model as (select 1 as id);

In Snowflake, this relation has a much more boisterous name: ANALYTICS.DBT_JCOHEN.MY_MODEL. It's unquoted, ergo case-insensitive, ergo uppercase. What happens, though, if I turn quoting on, for all dbt-created relations in my project?

# dbt_project.yml
quoting:
  identifier: true

Now, dbt is going to try to template a SQL statement like:

create or replace view analytics.dbt_jcohen."my_model" as (select 1 as id);

But we don't even get there, because first dbt populates the adapter cache, then it tries to match up my model with an entry in the cache, and it sees there's there's an almost but not quite matching entry. And we stop the whole thing in its tracks, because we want to avoid an ugly scenario.

$ dbt run
...
11:12:25  Compilation Error in model my_model (models/my_model.sql)
11:12:25    When searching for a relation, dbt found an approximate match. Instead of guessing
11:12:25    which relation to use, dbt will move on. Please delete "ANALYTICS"."DBT_JCOHEN"."MY_MODEL", or rename it to be less ambiguous.
11:12:25    Searched for: ANALYTICS.DBT_JCOHEN.my_model
11:12:25    Found: "ANALYTICS"."DBT_JCOHEN"."MY_MODEL"
11:12:25
11:12:25    > in macro create_or_replace_view (macros/materializations/models/view/create_or_replace_view.sql)
11:12:25    > called by macro materialization_view_snowflake (macros/materializations/view.sql)
11:12:25    > called by model my_model (models/my_model.sql)
...

If I try the same, having checked out your branch, I don't get that exception—the model succeeds!—because we didn't get a match, and we didn't check for an approximate match either. dbt successfully created a view named analytics.dbt_jcohen."my_model". Of course, depending on which of these queries I run in Snowflake, I will actually be querying a different view:

select * from analytics.dbt_jcohen.my_model;
select * from analytics.dbt_jcohen."my_model";

It's a gross situation, no question. But, in keeping with what I said above, I'm not convinced that the ApproximateMatchError exception does a whole lot to make the situation less gross—it just shoves the grossness in the user's face, earlier and a bit more explicitly.

@boxysean
Copy link
Contributor

boxysean commented Feb 3, 2023

Thanks for the explanation @jtcohen6!

I'd be curious to see some real-world results, but I won't have time in the next 1-2 weeks to review due to travel. A similar analysis to what I did here would help us determine the impact of @peterallenwebb's proposed change. I will ask my client to see if they could support.

I'd also be curious to see some unit tests on get_relation() 🙈

@peterallenwebb
Copy link
Contributor Author

@jtcohen6 Yes, thanks very much for this clear explanation of where the practical performance concerns really are! I'll keep it in mind as I inevitably continue to tinker with performance. I definitely don't feel strongly about getting this change in if it is unlikely to make an impact under real world conditions.

@peterallenwebb peterallenwebb changed the title Optimize BaseRelaton.matches() Optimize BaseRelation.matches() Feb 3, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 3, 2023

@peterallenwebb If this proves to be a significant perf boost in an "in-the-wild" scenario, I'd be supportive of moving forward! Sounds like the next step here is testing with a real large project. We do have one of these for our own internal analytics :)

target = self.create(database=database, schema=schema, identifier=identifier)
raise ApproximateMatchError(target, self)
if database is None and schema is None and identifier is None:
raise dbt.exceptions.DbtRuntimeError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this first since you don't need to run self._is_exactish_match(). I'm assuming that method is the expensive method. I'd also rephrase the if clause:

if not any(database, schema, identifier):
    raise dbt.exceptions.DbtRuntimeError(...)

"Tried to match relation, but no search path was passed!"
)
if identifier is not None and not self._is_exactish_match(
ComponentName.Identifier, identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way it was originally written. I think the performance pickup comes from two places:

  1. not looking for the approximate match
  2. not exiting the for loop once a match was found
    I think it could look something like this:
if not any(identifier, schema, database):
    raise dbt.exceptions.DbtRuntimeError(...)

search = filter_null_values(
    {
        ComponentName.Identifier: identifier,
        ComponentName.Schema: schema,
        ComponentName.Database: database
    }
)

return any(
    (
        self._is_exactish_match(existing_components, new_component)
        for existing_components, new_component in search.items()
    )
)

I'm pretty sure any() will lazily evaluate each element in the generator and then stop when it finds one, which is what you're trying to do.

@peterallenwebb
Copy link
Contributor Author

peterallenwebb commented Feb 13, 2023

@jtcohen6 @boxysean I'm probably going to close this particular PR for now, since there are risks to any effective improvement, and we have not demonstrated that it is needed in the field.

That said, I have some interesting parting observations...

Our current strategy for looking up relations in the cache is compute-intensive, since it has to account for all the ways a relation name might be quoted or cased. Our strategy also takes time linear in the number of tables in the cache. The entire list of tables in the cache is scanned for a match every time a relation is looked up. A lookup in a cache with 1000 entries will be ten times slower than one with 100 on average. With some development effort we could make this a constant-time lookup with much lower overhead.

It's not clear how important this bottleneck is in real-world production scenarios, but the anonymized client project which @boxysean provided me has the following runtimes for dbt build on my local machine:

  • Stock dbt: 23 minutes
  • With the optimization from this PR: 14 minutes
  • With a quick/dirty implementation of constant-time lookup: 7 minutes

As @jtcohen6 has pointed out to me, our multithreading model might blunt the impact of the compute savings in real-world scenarios. It's still interesting how much overhead is being spent on this operation, though.

@jtcohen6
Copy link
Contributor

It's still interesting how much overhead is being spent on this operation, though.

Agree that it's very interesting. If you think the right next step is to close this specific PR for now, given some unknowns in the risks & benefits, I won't argue. I do think we should keep #6842 open as a promising lead to revisit for perf improvements in the future.

@jtcohen6 jtcohen6 mentioned this pull request Apr 10, 2023
9 tasks
@github-actions
Copy link
Contributor

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 13, 2023
@github-actions
Copy link
Contributor

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes stale Issues that have gone stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2018] Performance Bottleneck in BaseRelation.get_relation()
4 participants