Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy load connections #1584

Closed
heisencoder opened this issue Jun 28, 2019 · 23 comments · Fixed by #1992
Closed

Lazy load connections #1584

heisencoder opened this issue Jun 28, 2019 · 23 comments · Fixed by #1992
Labels
enhancement New feature or request
Milestone

Comments

@heisencoder
Copy link
Contributor

Issue

The dbt compile command requires gcloud for bigquery profiles.

This is a regression from version 0.13.1, which did not require gcloud for running the compile command.

Issue description

Command:

dbt compile
dbt --debug compile --profiles-dir . --vars "{target-path: my_path}";

Results

Error Summary:

Could not find command, ensure it is in the user's PATH: "gcloud"

dbt_compile_error.txt

System information

The output of dbt --version:
Running with dbt=0.14.0-a2

The operating system you're running on:
Debian-based Linux with kernel version 4.19.37

The python version you're using (probably the output of python --version)
2.7

Steps to reproduce

  • Run dbt compile with a project that uses bigquery and with gcloud missing from the path.
@heisencoder
Copy link
Contributor Author

I think this was introduced in e2af871 with this change to the context/parser.py generate function:

def generate(model, runtime_config, manifest, source_config):
    # during parsing, we don't have a connection, but we might need one, so we
    # have to acquire it.
    # In the future, it would be nice to lazily open the connection, as in some
    # projects it would be possible to parse without connecting to the db
    with get_adapter(runtime_config).connection_named(model.get('name')):
        return dbt.context.common.generate(
            model, runtime_config, manifest, source_config, Provider()
        )

beckjake@, would it be possible to go back to making this lazily open the connection? I can't contact an actual bigquery client to run the compile command.

@heisencoder
Copy link
Contributor Author

(Meant to put the ampersand in front: @beckjake)

@drewbanin
Copy link
Contributor

Thanks for the report @heisencoder. Can you tell me a little bit more about your setup here? I'm interested in your note that you can't access a BQ client when running dbt compile -- that feels atypical given my experience with most dbt projects out there, so definitely curious to learn more about your use case.

Additionally, I'm unable to reproduce this! I've removed gcloud from my path, but can't seem to make dbt fail with this type of error. Do you know if there are other specifics about your project required to make dbt fail in this way?

@heisencoder
Copy link
Contributor Author

I'm running dbt compile in a hermetic, stripped-down environment that does not allow external RPCs. There is an error message that suggests that gcloud needs to be in the PATH, but I suspect that the real failure is to to gcloud trying to make an external rpc and then failing.

I suspect that uninstalling gcloud could lead to an easy reproduction, although I don't know whether this will cause other unrelated failures. Maybe trying unplugging the network cable and then running the dbt compile command?

In any case, I think the main issue is due to greedily creating a connection during the compile (in the generate function). Jake's comment suggests that this could be approached via a lazy connection, although this was not implemented.

@drewbanin
Copy link
Contributor

drewbanin commented Jun 28, 2019

Ok, that makes sense. Fundamentally, this line of reasoning is a little divergent from how we think about model compilation. Right now, the dbt compile command just outputs the select statement that dbt will use during the model materialization in the dbt run step. Longer term though, I'd love to make dbt compile output the full DDL/DML it would run at dbt run-time. This involves inspecting the state of the database and recording the existence (and types) of the relations that dbt intends to operate on. It just so happens that BigQuery didn't historically provide SQL-interfaces into its state, but we're planning on using things like the information_schema, for instance, going forwards to fetch information about the state of the database.

We could indeed lazily load connections, but I imagine that these future plans are necessarily going to involve running SQL against the warehouse in the dbt compile step, so that actually wouldn't be a super helpful change given your particular use case.

Can you just say a little bit more about why you're running dbt compile in a hermetic environment like this? Are you just validating that the project is compilable? Or are you grabbing the compiled output from the target/ directory? Or something else?

I hesitate to recommend this, but one approach may be to connect to a locally-running postgres instance in your hermetic environment. The resulting SQL might be postgres-specific, but depending on your needs, that might be acceptable.

Edit: made this make more sense :)

@heisencoder
Copy link
Contributor Author

What I'm trying to do with the dbt compile step is to just create the manifest.json file and then use that to link together an Airflow DAG in a similar manner to how this reference implementation uses the networkx gpickle file: https://gist.github.com/adamhaney/916a21b0adcabef4038c38e3ac96a36f

I'm happy to have the SQL interpolation happen at run time, so in any case would want to defer actual network connects to the run phase.

@drewbanin
Copy link
Contributor

Ok - that totally makes sense. It also helps me understand what #1538 was all about too. Thanks for the additional info!

Maybe the right way to proceed here is to expose dbt's understanding of the graph after the parsing step? This seems like something we could support in a first-class way via the (eventual) Python API. dbt compiles projects in two steps:

  1. parsing out the shape of the graph by evaluating ref and config function calls
  2. "compiling" SQL with a fully-built out context and connections to the database

After parsing, dbt will have a complete understanding of the shape of the graph, but will not yet have "correct" SQL for each of the models. Sounds like that wouldn't be a problem given your use case.

@heisencoder
Copy link
Contributor Author

In the short-term, I either need to revert back to version 0.13.1 or find a way to patch 0.14 to allow for lazily creating database connections. Longer term, I'd be happy if there was either a Python API or a command-line 'parse' option to create the manifest.json file as a hermetic step (I have a slight preference for a Python API, since I'm already creating one myself by wrapping the dbt binary with a Python API). Ultimately, I want to create a rich unit-testing environment that understands dependencies of each SQL file, but doesn't require creating any real database connections. This testing environment needs to understand the DAG, which is why I need that build hermetically.

Is the lazy evaluation option still on the table? I took out the database connection in the generate() function, but it failed on me later on.

Thanks!

@drewbanin
Copy link
Contributor

I believe @beckjake implemented a new pattern for adapters in which each thread gets their own dedicated connection. I think the diff you're seeing here was a part of that change, though I'm not 100% certain.

Jake, this is a little outside of my wheelhouse: how involved do you think it is to make these adapters lazy again? Is it something that could conceivably happen for 0.14.1? Or are we better off supporting this use case via some other mechanism?

@beckjake
Copy link
Contributor

beckjake commented Jul 1, 2019

I don't love that we eagerly acquire the connection, and I am into trying to make it lazy load for 0.14.1

My thoughts were that "acquiring" a connection in the with block would mark it as ready to be acquired, and then on exit we'd only close it if it was acquired at some point (so the handle was non-None/state left init, something like that)

@heisencoder
Copy link
Contributor Author

Thanks Jake! Do you have a sense of when this might be ready? I'm trying to decide whether to revert back to 0.13.1 or look into moving forward with patches against 0.14.

@drewbanin
Copy link
Contributor

I think you might be best off reverting to 0.13.1 - we're wrapping up the release candidate for 0.14.0 presently, and I don't think this is a change we're going to be able to incorporate this late into the release cycle. Super happy to turn this around in short order after 0.14.0 drop though! Apologies for the inconvenience, but definitely looking to implement this change!

@heisencoder
Copy link
Contributor Author

Sounds good -- thanks!

@heisencoder
Copy link
Contributor Author

@beckjake Is there a tracking feature that I can follow to see progress on creating lazy connections?

@beckjake
Copy link
Contributor

beckjake commented Jul 1, 2019

@heisencoder I'll use this issue for that

@heisencoder
Copy link
Contributor Author

Another thought on this same topic, would it make sense to address this issue by explicitly using a 'fake' version of the database? In the context of BigQuery, this could be accomplished by creating a new method called something like 'fake'. The 'fake' method would then have a fake implementation of the BigQuery interface (possibly using a MagicMock initially) that would not create any real connections, but would instead do something like log all the commands that would have been sent. This could also be viewed as a 'dry-run' mode to allow developers to see what commands would be sent before making actual database changes. This could be expressed as a --dry-run flag or as a new profile target (like 'fake').

Either of these approaches would make compile work inside my hermetic build and test environment.

Thoughts?

@drewbanin
Copy link
Contributor

drewbanin commented Aug 29, 2019

Hey @heisencoder - dbt requires a connection to the database in order to correctly compile SQL. If you have code like:

{{ config(materialized='incremental') }}

select * from my_big_table
{% if is_incremental() %}
  where event_time > (select max(event_time) from {{ this }}
{% endif %}

then the is_incremental() macro will require a query to the database in order to correctly compile this sql. When I think of --dry-run, I imagine that dbt would still make introspective queries to the database, but it would not run any sort of mutative/creative queries.

If you're so inclined, you can create a new "null" adapter for dbt which just implements the adapter interface and discards queries. I actually think that might be a kind of cool thing for dbt to ship with :)

Note: if you call dbt compile on your query, the compiled SQL is going to be pretty inaccurate. If you actually want to compile SQL with dbt, you'll need to have an open connection to the database. If you care more about building the graph with dbt, this could be a good option

@heisencoder
Copy link
Contributor Author

Hi @drewbanin

The main goal of compiling the SQL is to test the SQL in an isolated environment that doesn't have a database connection. This would just be for simple syntax and schema testing, things that could be verified in a unit-test context. It look like the is_incremental macro defaults to returning false if there is no database connection, which is probably fine for our purposes.

I'm assuming that if we're not using certain dbt power features that the compiled SQL will exactly match the executed SQL. Are there other macros other than is_incremental that we need to avoid?

With regard to a "null" adapter, I'll probably need to create one to allow internally upgrading to 0.14. I'll look into putting one together, but I may also need certain BigQuery features, so I'm also considering creating one that extends the existing BigQuery adapter and just replaces the connection with a fake object of some kind.

Thanks!
-Matt

@drewbanin
Copy link
Contributor

The is_incremental() macro just proxies a call to adapter.get_relation() I believe. You may want to watch out for any other similar calls to adapter methods, but otherwise you should be in good shape.

@heisencoder
Copy link
Contributor Author

Hi @drewbanin

With regard to implementing a 'null' adapter, I think for my purposes I'll need to extend the existing bigquery adapter because I need the result of the dbt compile command to closely match that of existing bigquery adapter. Is this something that you'd be interested in taking as an upstream eventually?

@drewbanin
Copy link
Contributor

I think there could be merit to supporting a "null" adapter in dbt core, but probably not one which implements BigQuery-specific semantics. In general, we're aiming to have the output of dbt compile match exactly what dbt would actually execute in a dbt run, so I think some amount of querying/introspection is going to be required. Issue #588 does a good job of outlining the place we want to get to with dbt's approach to compilation if you're interesting in reading more about this.

I don't have a firm use-case for a generic null adapter in mind (it certainly would not output correct SQL for any given database) -- maybe just as a debugging / testing sort of utility? So, feel free to do whatever fits your needs here!

@heisencoder
Copy link
Contributor Author

Thank you Drew for your thoughts! I'll go ahead and create an internal fake BigQuery adapter then.

@drewbanin drewbanin changed the title 'dbt compile' requires gcloud with bigquery profiles Lazy load connections Nov 8, 2019
@drewbanin drewbanin added this to the 0.15.1 milestone Nov 8, 2019
@drewbanin drewbanin added the enhancement New feature or request label Nov 8, 2019
@drewbanin
Copy link
Contributor

This came up again today. We're going to prioritize this for an 0.15.x release - currently looking like 0.15.1.

The impetus for this is that acquiring connections greedily can be really slow. This impacts more real-time queries, like ones that can be submitted to a dbt server.

drewbanin added a commit that referenced this issue Dec 9, 2019
drewbanin added a commit that referenced this issue Dec 9, 2019
…lazy-load-connections

Revert "lazy-load connections (#1584)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants