Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the relation cache for dbt compile and dbt ls #1705

Closed
drewbanin opened this issue Aug 27, 2019 · 8 comments · Fixed by #2319
Closed

Implement the relation cache for dbt compile and dbt ls #1705

drewbanin opened this issue Aug 27, 2019 · 8 comments · Fixed by #2319
Assignees
Labels
enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Aug 27, 2019

Describe the feature

In the dbt run command, dbt builds a relation cache. dbt should also build this cache when the dbt compile and dbt ls commands are invoked.

Bonus: It would be even better if dbt lazily populated this cache. TBD if that takes place in this issue, or if we should make a separate issue for that.

Edited: Removed references to the rpc server

@drewbanin drewbanin added enhancement New feature or request rpc Issues related to dbt's RPC server labels Aug 27, 2019
@drewbanin drewbanin added this to the Louisa May Alcott milestone Aug 27, 2019
@beckjake beckjake self-assigned this Aug 30, 2019
@drewbanin drewbanin changed the title Build the relation cache in the RPC Server Implement the relation cache for dbt compile and dbt ls Aug 30, 2019
@drewbanin drewbanin removed the rpc Issues related to dbt's RPC server label Aug 30, 2019
@cmcarthur cmcarthur assigned drewbanin and unassigned beckjake Sep 12, 2019
@drewbanin
Copy link
Contributor Author

Kicking this out of the LMA milestone. The better version of this will:

  • lazily populate the relation cache
  • lock around calls to populate the cache
    • lock per schema?
    • lock around the whole cache?

@drewbanin drewbanin removed this from the Louisa May Alcott milestone Sep 12, 2019
@mcannamela
Copy link

Are contributions welcome on this? We're experiencing punishing compile times on Snowflake that would be alleviated by fixing this.

@drewbanin
Copy link
Contributor Author

hey @mcannamela - sure, we'd love a PR for this one! I think the big gain here is going to be lazily populating the relation cache.

We want to make these introspective queries run when they're needed, not at startup. Right now, dbt is going to scan every schema referenced in your project, even if you're just doing something like:

dbt compile -m my_model

In this case, we'd want to:

  • wait until my_model is actively compiled to query the information schema
  • only query the information schema for the relations in the relevant schema for my_model, not all of the schemas referenced in the project

The locking piece comes in if you have more than one model running concurrently that's hitting the same information schema:

dbt compile -m my_first_model my_second_model

If both of these models are materialized in the same information schema namespace[1], dbt should only run the information schema query once, then return the results to both models.

@beckjake there was some good reason why we didn't implement the original feature request (populating the relation cache) -- do you remember why that was?

[1] this varies by warehouse. On postgres/redshift/snowflake, the information schema is accessed at the database level. On BigQuery, the information schema is accessed at the dataset level.

@beckjake
Copy link
Contributor

there was some good reason why we didn't implement the original feature request (populating the relation cache) -- do you remember why that was?

I think it's because populating the cache is expensive, and many compile/ls invocations don't require it. We didn't do lazy loading the cache because it seems hard and our time is finite (we wrote this issue instead). But that was a long time ago, my memory could be wrong.

@mcannamela
Copy link

Could you clarify which compile invocations would not require it? From my reading, it seems like it's going to be hit any time BaseAdapter.get_relation is called, so compiling any model with a ref would need to invoke it at least once. At least for us if you picked a model at random then more likely than not it would contain a ref.

I came to this issue via #1737 and that seemed like a win to me, since you could just pass the --bypass-cache flag and be no worse off than you are currently.

Is there some reason why populating the cache up front for all schemas is beneficial for the run command but would be detrimental for compile? Or would they ideally have the same lazy behavior?

Not trying to be contrarian here just want to snuff out my own ignorance here before I or my team undertakes this.

@beckjake
Copy link
Contributor

The trick is that not everything has to call get_relation - ref doesn't actually require it.

BaseAdapter.get_relation is (generally!) only called during materialization execution, which compilation doesn't look at - compilation only generates the sql variable for materializations, and then at runtime those materializations are evaluated.

I think ideally they'd have the same lazy behavior.

@mcannamela
Copy link

Got it, thanks.

@drewbanin drewbanin added this to the Octavius Catto milestone Mar 31, 2020
@drewbanin
Copy link
Contributor Author

Adding this to the Octavius Catto milestone (for now, might not make it into that release). Folks can disable this if they're experiencing negative performance characteristics with --bypass-cache.

beckjake added a commit that referenced this issue Apr 14, 2020
…elation-caches

dbt compile and dbt ls relation caches (#1705)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants