Implement the relation cache for `dbt compile` and `dbt ls` #1705

drewbanin · 2019-08-27T18:56:37Z

Describe the feature

In the dbt run command, dbt builds a relation cache. dbt should also build this cache when the dbt compile and dbt ls commands are invoked.

Bonus: It would be even better if dbt lazily populated this cache. TBD if that takes place in this issue, or if we should make a separate issue for that.

Edited: Removed references to the rpc server

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-09-12T17:25:08Z

Kicking this out of the LMA milestone. The better version of this will:

lazily populate the relation cache
lock around calls to populate the cache
- lock per schema?
- lock around the whole cache?

mcannamela · 2020-03-19T14:45:50Z

Are contributions welcome on this? We're experiencing punishing compile times on Snowflake that would be alleviated by fixing this.

drewbanin · 2020-03-19T20:34:55Z

hey @mcannamela - sure, we'd love a PR for this one! I think the big gain here is going to be lazily populating the relation cache.

We want to make these introspective queries run when they're needed, not at startup. Right now, dbt is going to scan every schema referenced in your project, even if you're just doing something like:

dbt compile -m my_model

In this case, we'd want to:

wait until my_model is actively compiled to query the information schema
only query the information schema for the relations in the relevant schema for my_model, not all of the schemas referenced in the project

The locking piece comes in if you have more than one model running concurrently that's hitting the same information schema:

dbt compile -m my_first_model my_second_model

If both of these models are materialized in the same information schema namespace[1], dbt should only run the information schema query once, then return the results to both models.

@beckjake there was some good reason why we didn't implement the original feature request (populating the relation cache) -- do you remember why that was?

[1] this varies by warehouse. On postgres/redshift/snowflake, the information schema is accessed at the database level. On BigQuery, the information schema is accessed at the dataset level.

beckjake · 2020-03-19T21:09:15Z

there was some good reason why we didn't implement the original feature request (populating the relation cache) -- do you remember why that was?

I think it's because populating the cache is expensive, and many compile/ls invocations don't require it. We didn't do lazy loading the cache because it seems hard and our time is finite (we wrote this issue instead). But that was a long time ago, my memory could be wrong.

mcannamela · 2020-03-22T20:00:53Z

Could you clarify which compile invocations would not require it? From my reading, it seems like it's going to be hit any time BaseAdapter.get_relation is called, so compiling any model with a ref would need to invoke it at least once. At least for us if you picked a model at random then more likely than not it would contain a ref.

I came to this issue via #1737 and that seemed like a win to me, since you could just pass the --bypass-cache flag and be no worse off than you are currently.

Is there some reason why populating the cache up front for all schemas is beneficial for the run command but would be detrimental for compile? Or would they ideally have the same lazy behavior?

Not trying to be contrarian here just want to snuff out my own ignorance here before I or my team undertakes this.

beckjake · 2020-03-23T13:15:32Z

The trick is that not everything has to call get_relation - ref doesn't actually require it.

BaseAdapter.get_relation is (generally!) only called during materialization execution, which compilation doesn't look at - compilation only generates the sql variable for materializations, and then at runtime those materializations are evaluated.

I think ideally they'd have the same lazy behavior.

mcannamela · 2020-03-30T13:19:07Z

Got it, thanks.

drewbanin · 2020-03-31T19:09:27Z

Adding this to the Octavius Catto milestone (for now, might not make it into that release). Folks can disable this if they're experiencing negative performance characteristics with --bypass-cache.

…elation-caches dbt compile and dbt ls relation caches (#1705)

drewbanin added enhancement New feature or request rpc Issues related to dbt's RPC server labels Aug 27, 2019

drewbanin added this to the Louisa May Alcott milestone Aug 27, 2019

beckjake self-assigned this Aug 30, 2019

drewbanin changed the title ~~Build the relation cache in the RPC Server~~ Implement the relation cache for dbt compile and dbt ls Aug 30, 2019

drewbanin removed the rpc Issues related to dbt's RPC server label Aug 30, 2019

beckjake mentioned this issue Sep 10, 2019

populate the cache on compile, too (#1705) #1737

Closed

cmcarthur assigned drewbanin and unassigned beckjake Sep 12, 2019

drewbanin removed this from the Louisa May Alcott milestone Sep 12, 2019

drewbanin added this to the Octavius Catto milestone Mar 31, 2020

beckjake mentioned this issue Apr 14, 2020

Feature/compile ls relation caches (#1705) #2319

Merged

4 tasks

beckjake closed this as completed in #2319 Apr 14, 2020

beckjake added a commit that referenced this issue Apr 14, 2020

Merge pull request #2319 from fishtown-analytics/feature/compile-ls-r…

79db880

…elation-caches dbt compile and dbt ls relation caches (#1705)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the relation cache for `dbt compile` and `dbt ls` #1705

Implement the relation cache for `dbt compile` and `dbt ls` #1705

drewbanin commented Aug 27, 2019 •

edited

Loading

drewbanin commented Sep 12, 2019

mcannamela commented Mar 19, 2020

drewbanin commented Mar 19, 2020

beckjake commented Mar 19, 2020

mcannamela commented Mar 22, 2020

beckjake commented Mar 23, 2020

mcannamela commented Mar 30, 2020

drewbanin commented Mar 31, 2020

Implement the relation cache for dbt compile and dbt ls #1705

Implement the relation cache for dbt compile and dbt ls #1705

Comments

drewbanin commented Aug 27, 2019 • edited Loading

Describe the feature

drewbanin commented Sep 12, 2019

mcannamela commented Mar 19, 2020

drewbanin commented Mar 19, 2020

beckjake commented Mar 19, 2020

mcannamela commented Mar 22, 2020

beckjake commented Mar 23, 2020

mcannamela commented Mar 30, 2020

drewbanin commented Mar 31, 2020

Implement the relation cache for `dbt compile` and `dbt ls` #1705

Implement the relation cache for `dbt compile` and `dbt ls` #1705

drewbanin commented Aug 27, 2019 •

edited

Loading