Lazy load connections #1584

heisencoder · 2019-06-28T19:01:50Z

Issue

The dbt compile command requires gcloud for bigquery profiles.

This is a regression from version 0.13.1, which did not require gcloud for running the compile command.

Issue description

Command:

dbt compile
dbt --debug compile --profiles-dir . --vars "{target-path: my_path}";

Results

Error Summary:

Could not find command, ensure it is in the user's PATH: "gcloud"

dbt_compile_error.txt

System information

The output of dbt --version:
Running with dbt=0.14.0-a2

The operating system you're running on:
Debian-based Linux with kernel version 4.19.37

The python version you're using (probably the output of python --version)
2.7

Steps to reproduce

Run dbt compile with a project that uses bigquery and with gcloud missing from the path.

The text was updated successfully, but these errors were encountered:

heisencoder · 2019-06-28T19:25:50Z

I think this was introduced in e2af871 with this change to the context/parser.py generate function:

def generate(model, runtime_config, manifest, source_config):
    # during parsing, we don't have a connection, but we might need one, so we
    # have to acquire it.
    # In the future, it would be nice to lazily open the connection, as in some
    # projects it would be possible to parse without connecting to the db
    with get_adapter(runtime_config).connection_named(model.get('name')):
        return dbt.context.common.generate(
            model, runtime_config, manifest, source_config, Provider()
        )

beckjake@, would it be possible to go back to making this lazily open the connection? I can't contact an actual bigquery client to run the compile command.

heisencoder · 2019-06-28T19:36:51Z

(Meant to put the ampersand in front: @beckjake)

drewbanin · 2019-06-28T19:56:45Z

Thanks for the report @heisencoder. Can you tell me a little bit more about your setup here? I'm interested in your note that you can't access a BQ client when running dbt compile -- that feels atypical given my experience with most dbt projects out there, so definitely curious to learn more about your use case.

Additionally, I'm unable to reproduce this! I've removed gcloud from my path, but can't seem to make dbt fail with this type of error. Do you know if there are other specifics about your project required to make dbt fail in this way?

heisencoder · 2019-06-28T20:22:59Z

I'm running dbt compile in a hermetic, stripped-down environment that does not allow external RPCs. There is an error message that suggests that gcloud needs to be in the PATH, but I suspect that the real failure is to to gcloud trying to make an external rpc and then failing.

I suspect that uninstalling gcloud could lead to an easy reproduction, although I don't know whether this will cause other unrelated failures. Maybe trying unplugging the network cable and then running the dbt compile command?

In any case, I think the main issue is due to greedily creating a connection during the compile (in the generate function). Jake's comment suggests that this could be approached via a lazy connection, although this was not implemented.

drewbanin · 2019-06-28T20:45:23Z

Ok, that makes sense. Fundamentally, this line of reasoning is a little divergent from how we think about model compilation. Right now, the dbt compile command just outputs the select statement that dbt will use during the model materialization in the dbt run step. Longer term though, I'd love to make dbt compile output the full DDL/DML it would run at dbt run-time. This involves inspecting the state of the database and recording the existence (and types) of the relations that dbt intends to operate on. It just so happens that BigQuery didn't historically provide SQL-interfaces into its state, but we're planning on using things like the information_schema, for instance, going forwards to fetch information about the state of the database.

We could indeed lazily load connections, but I imagine that these future plans are necessarily going to involve running SQL against the warehouse in the dbt compile step, so that actually wouldn't be a super helpful change given your particular use case.

Can you just say a little bit more about why you're running dbt compile in a hermetic environment like this? Are you just validating that the project is compilable? Or are you grabbing the compiled output from the target/ directory? Or something else?

I hesitate to recommend this, but one approach may be to connect to a locally-running postgres instance in your hermetic environment. The resulting SQL might be postgres-specific, but depending on your needs, that might be acceptable.

Edit: made this make more sense :)

heisencoder · 2019-06-28T21:16:22Z

What I'm trying to do with the dbt compile step is to just create the manifest.json file and then use that to link together an Airflow DAG in a similar manner to how this reference implementation uses the networkx gpickle file: https://gist.github.com/adamhaney/916a21b0adcabef4038c38e3ac96a36f

I'm happy to have the SQL interpolation happen at run time, so in any case would want to defer actual network connects to the run phase.

drewbanin · 2019-06-30T14:03:49Z

Ok - that totally makes sense. It also helps me understand what #1538 was all about too. Thanks for the additional info!

Maybe the right way to proceed here is to expose dbt's understanding of the graph after the parsing step? This seems like something we could support in a first-class way via the (eventual) Python API. dbt compiles projects in two steps:

parsing out the shape of the graph by evaluating ref and config function calls
"compiling" SQL with a fully-built out context and connections to the database

After parsing, dbt will have a complete understanding of the shape of the graph, but will not yet have "correct" SQL for each of the models. Sounds like that wouldn't be a problem given your use case.

heisencoder · 2019-07-01T16:34:18Z

In the short-term, I either need to revert back to version 0.13.1 or find a way to patch 0.14 to allow for lazily creating database connections. Longer term, I'd be happy if there was either a Python API or a command-line 'parse' option to create the manifest.json file as a hermetic step (I have a slight preference for a Python API, since I'm already creating one myself by wrapping the dbt binary with a Python API). Ultimately, I want to create a rich unit-testing environment that understands dependencies of each SQL file, but doesn't require creating any real database connections. This testing environment needs to understand the DAG, which is why I need that build hermetically.

Is the lazy evaluation option still on the table? I took out the database connection in the generate() function, but it failed on me later on.

Thanks!

drewbanin · 2019-07-01T17:18:20Z

I believe @beckjake implemented a new pattern for adapters in which each thread gets their own dedicated connection. I think the diff you're seeing here was a part of that change, though I'm not 100% certain.

Jake, this is a little outside of my wheelhouse: how involved do you think it is to make these adapters lazy again? Is it something that could conceivably happen for 0.14.1? Or are we better off supporting this use case via some other mechanism?

beckjake · 2019-07-01T17:32:12Z

I don't love that we eagerly acquire the connection, and I am into trying to make it lazy load for 0.14.1

My thoughts were that "acquiring" a connection in the with block would mark it as ready to be acquired, and then on exit we'd only close it if it was acquired at some point (so the handle was non-None/state left init, something like that)

heisencoder · 2019-07-01T19:08:29Z

Thanks Jake! Do you have a sense of when this might be ready? I'm trying to decide whether to revert back to 0.13.1 or look into moving forward with patches against 0.14.

drewbanin · 2019-07-01T19:27:53Z

I think you might be best off reverting to 0.13.1 - we're wrapping up the release candidate for 0.14.0 presently, and I don't think this is a change we're going to be able to incorporate this late into the release cycle. Super happy to turn this around in short order after 0.14.0 drop though! Apologies for the inconvenience, but definitely looking to implement this change!

heisencoder · 2019-07-01T19:30:52Z

Sounds good -- thanks!

heisencoder · 2019-07-01T20:17:33Z

@beckjake Is there a tracking feature that I can follow to see progress on creating lazy connections?

beckjake · 2019-07-01T20:35:05Z

@heisencoder I'll use this issue for that

heisencoder · 2019-08-29T16:47:04Z

Another thought on this same topic, would it make sense to address this issue by explicitly using a 'fake' version of the database? In the context of BigQuery, this could be accomplished by creating a new method called something like 'fake'. The 'fake' method would then have a fake implementation of the BigQuery interface (possibly using a MagicMock initially) that would not create any real connections, but would instead do something like log all the commands that would have been sent. This could also be viewed as a 'dry-run' mode to allow developers to see what commands would be sent before making actual database changes. This could be expressed as a --dry-run flag or as a new profile target (like 'fake').

Either of these approaches would make compile work inside my hermetic build and test environment.

Thoughts?

drewbanin · 2019-08-29T17:47:05Z

Hey @heisencoder - dbt requires a connection to the database in order to correctly compile SQL. If you have code like:

{{ config(materialized='incremental') }}

select * from my_big_table
{% if is_incremental() %}
  where event_time > (select max(event_time) from {{ this }}
{% endif %}

then the is_incremental() macro will require a query to the database in order to correctly compile this sql. When I think of --dry-run, I imagine that dbt would still make introspective queries to the database, but it would not run any sort of mutative/creative queries.

If you're so inclined, you can create a new "null" adapter for dbt which just implements the adapter interface and discards queries. I actually think that might be a kind of cool thing for dbt to ship with :)

Note: if you call dbt compile on your query, the compiled SQL is going to be pretty inaccurate. If you actually want to compile SQL with dbt, you'll need to have an open connection to the database. If you care more about building the graph with dbt, this could be a good option

heisencoder · 2019-08-30T17:10:49Z

Hi @drewbanin

The main goal of compiling the SQL is to test the SQL in an isolated environment that doesn't have a database connection. This would just be for simple syntax and schema testing, things that could be verified in a unit-test context. It look like the is_incremental macro defaults to returning false if there is no database connection, which is probably fine for our purposes.

I'm assuming that if we're not using certain dbt power features that the compiled SQL will exactly match the executed SQL. Are there other macros other than is_incremental that we need to avoid?

With regard to a "null" adapter, I'll probably need to create one to allow internally upgrading to 0.14. I'll look into putting one together, but I may also need certain BigQuery features, so I'm also considering creating one that extends the existing BigQuery adapter and just replaces the connection with a fake object of some kind.

Thanks!
-Matt

drewbanin · 2019-08-30T17:53:19Z

The is_incremental() macro just proxies a call to adapter.get_relation() I believe. You may want to watch out for any other similar calls to adapter methods, but otherwise you should be in good shape.

heisencoder · 2019-08-30T22:35:03Z

Hi @drewbanin

With regard to implementing a 'null' adapter, I think for my purposes I'll need to extend the existing bigquery adapter because I need the result of the dbt compile command to closely match that of existing bigquery adapter. Is this something that you'd be interested in taking as an upstream eventually?

drewbanin · 2019-08-30T23:01:23Z

I think there could be merit to supporting a "null" adapter in dbt core, but probably not one which implements BigQuery-specific semantics. In general, we're aiming to have the output of dbt compile match exactly what dbt would actually execute in a dbt run, so I think some amount of querying/introspection is going to be required. Issue #588 does a good job of outlining the place we want to get to with dbt's approach to compilation if you're interesting in reading more about this.

I don't have a firm use-case for a generic null adapter in mind (it certainly would not output correct SQL for any given database) -- maybe just as a debugging / testing sort of utility? So, feel free to do whatever fits your needs here!

heisencoder · 2019-09-04T14:46:09Z

Thank you Drew for your thoughts! I'll go ahead and create an internal fake BigQuery adapter then.

drewbanin · 2019-11-08T17:06:38Z

This came up again today. We're going to prioritize this for an 0.15.x release - currently looking like 0.15.1.

The impetus for this is that acquiring connections greedily can be really slow. This impacts more real-time queries, like ones that can be submitted to a dbt server.

…nnections lazy-load connections (#1584)

…lazy-load-connections Revert "lazy-load connections (#1584)"

…nnections lazy-load connections (#1584)

drewbanin changed the title ~~'dbt compile' requires gcloud with bigquery profiles~~ Lazy load connections Nov 8, 2019

drewbanin added this to the 0.15.1 milestone Nov 8, 2019

drewbanin added the enhancement New feature or request label Nov 8, 2019

beckjake mentioned this issue Nov 20, 2019

lazy-load connections (#1584) #1941

Merged

drewbanin added a commit that referenced this issue Dec 9, 2019

Merge pull request #1941 from fishtown-analytics/feature/lazy-load-co…

afd69ea

…nnections lazy-load connections (#1584)

drewbanin added a commit that referenced this issue Dec 9, 2019

Revert "lazy-load connections (#1584)"

aa23460

drewbanin added a commit that referenced this issue Dec 9, 2019

Merge pull request #1991 from fishtown-analytics/revert-1941-feature/…

da7afd8

…lazy-load-connections Revert "lazy-load connections (#1584)"

beckjake mentioned this issue Dec 9, 2019

lazy-load connections (#1584), round 2 #1992

Merged

beckjake closed this as completed in #1992 Dec 9, 2019

beckjake added a commit that referenced this issue Dec 9, 2019

Merge pull request #1992 from fishtown-analytics/feature/lazy-load-co…

cc49190

…nnections lazy-load connections (#1584)

drewbanin mentioned this issue Mar 25, 2020

dbt compile should not need service account credentials #2239

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy load connections #1584

Lazy load connections #1584

heisencoder commented Jun 28, 2019

heisencoder commented Jun 28, 2019

heisencoder commented Jun 28, 2019

drewbanin commented Jun 28, 2019

heisencoder commented Jun 28, 2019

drewbanin commented Jun 28, 2019 •

edited

Loading

heisencoder commented Jun 28, 2019

drewbanin commented Jun 30, 2019

heisencoder commented Jul 1, 2019

drewbanin commented Jul 1, 2019

beckjake commented Jul 1, 2019 •

edited

Loading

heisencoder commented Jul 1, 2019

drewbanin commented Jul 1, 2019

heisencoder commented Jul 1, 2019

heisencoder commented Jul 1, 2019

beckjake commented Jul 1, 2019

heisencoder commented Aug 29, 2019

drewbanin commented Aug 29, 2019 •

edited

Loading

heisencoder commented Aug 30, 2019

drewbanin commented Aug 30, 2019

heisencoder commented Aug 30, 2019

drewbanin commented Aug 30, 2019

heisencoder commented Sep 4, 2019

drewbanin commented Nov 8, 2019

Lazy load connections #1584

Lazy load connections #1584

Comments

heisencoder commented Jun 28, 2019

Issue

Issue description

Results

System information

Steps to reproduce

heisencoder commented Jun 28, 2019

heisencoder commented Jun 28, 2019

drewbanin commented Jun 28, 2019

heisencoder commented Jun 28, 2019

drewbanin commented Jun 28, 2019 • edited Loading

heisencoder commented Jun 28, 2019

drewbanin commented Jun 30, 2019

heisencoder commented Jul 1, 2019

drewbanin commented Jul 1, 2019

beckjake commented Jul 1, 2019 • edited Loading

heisencoder commented Jul 1, 2019

drewbanin commented Jul 1, 2019

heisencoder commented Jul 1, 2019

heisencoder commented Jul 1, 2019

beckjake commented Jul 1, 2019

heisencoder commented Aug 29, 2019

drewbanin commented Aug 29, 2019 • edited Loading

heisencoder commented Aug 30, 2019

drewbanin commented Aug 30, 2019

heisencoder commented Aug 30, 2019

drewbanin commented Aug 30, 2019

heisencoder commented Sep 4, 2019

drewbanin commented Nov 8, 2019

drewbanin commented Jun 28, 2019 •

edited

Loading

beckjake commented Jul 1, 2019 •

edited

Loading

drewbanin commented Aug 29, 2019 •

edited

Loading