Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible for DBT to generate models from a piece of JINJA code #3469

Closed
bashyroger opened this issue Jun 17, 2021 · 11 comments
Closed
Labels
discussion enhancement New feature or request

Comments

@bashyroger
Copy link

bashyroger commented Jun 17, 2021

Describe the feature

While I like DBT a lot, it us currently quite limited in supporting true metadata based 'data warehouse automation'. Sure, a lot of things are automated for you in DBT:

  • Materializations automate the patterns for creating data emitting artifacts
  • Tests as yml metadata automate various ways of testing data
  • The ref statement automates loading dependencies
  • etc

However, what is lacking is the possibility for DBT to:

  • generate multiple models, based on a Pseudo-SQL template (pseudo DBT model) and a piece of JINJA code that loops over such a template to spawn not one but multiple models
  • it should be possible to do this both in DBT CLI and DBT cloud

Describe alternatives you've considered

At my current client we store our raw data in a snowflake variant column, with the schema metadata of that column in a separate variant column.

We do this to keep loading data simple, to never ever be affected by source schema changes anymore in the raw data zone.
To actually use this data however, we have written a piece of python code that parses the stored schema metadata to generate a series of DBT models (materialized as views) on top of the raw data tables.

While this all works fine, this code is obviously disconnected from DBT which currently hinders development as it requires code / environment / context switching.

Additional context

Who will this benefit?

Everyone that wants to do more extensive data warehouse automation with / using DBT.

Are you interested in contributing this feature?

I am not a solid programmer, but definitely would like to contribute in the requirements, use cases and beta test this kind of functionality

@bashyroger bashyroger added enhancement New feature or request triage labels Jun 17, 2021
@jtcohen6 jtcohen6 added discussion and removed triage labels Jul 3, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 3, 2021

@bashyroger Thanks for opening! Apologies for the delay in my response. There's a lot here to chew on.

Today, one of the foundational assumptions underlying dbt is that 1 model file = 1 model node = 1 object (view or table) in the database. What's the problem here? Whenever we want a new dbt-managed database object, it requires a PR to add a new model file. That's hardly an automated approach to data warehousing.

I've seen three quite different approaches that try to solve this problem:

  1. Creating multiple database objects from a single model
  2. Defining multiple models (dbt manifest objects) within a single file
  3. Automated code generation to create new model files, as a process that could live inside or outside of dbt

At the risk of writing much too much here, I want to discuss each approach in some detail.

1. One model, many objects

See: #1637, #2551

In the most common use case, someone wants a large model result—same logic, same columns—to be split in the the database across multiple tables, with perhaps a few parametrized differences across them. There may be a good reason (such as PII) that each subset needs to be isolated from other records, with the same column schema, transformed in the exact same manner. We see this sometimes with marketing or consulting agencies who offer modeled data as a service to their clients.

There are parts about this that are tricky for dbt today:

  • How to ref this multi-relational model? The model could create a thin view that unions the underlying tables together, or the ref may need to be suffixed with a particular shard identifier.
  • How to combine metadata on multiple objects, for the purposes of the adapter cache and catalog generation (documentation)?

In the meantime, while this functionality isn't natively supported, I've seen valid approaches here go in two different directions:

  • Tighter combination: Find a comfortable-enough way to keep all the data in the same tables until the very last step, at which point it's split up into one secure/authorized view per client. A database feature offering partition-level permissioning would be huge in addressing privacy concerns.
  • More separation: Turn the common modeling into a package, imported and parametrized within a separate project for each client.

I wanted to bring up this approach because these have often been part of the same conversation. It's not exactly the issue you describe here, and I don't think it would apply nearly as well to unstructured data that doesn't share a common column schema.

2. One file, many models

This is an old, old issue: #184. There was a long time when we were convinced that model blocks the ultimate way to go. They've fallen out of fashion recently, but this vein of thinking was especially popular around the time that we created snapshots (dbt v0.14.0, summer 2019).

-- models/many_models.sql

{% model stg_stripe_payments %}
    
    select * from {{ source('stripe', 'payments') }}
    where is_deleted = false

{% endmodel %}

{% model net_revenue %}
    
    {{ config(materialized = 'table') }}
    
    select
        orders.order_id,
        payments.amount - orders.cogs as net_revenue
    
    from {{ ref('stg_stripe_payments') }} as payments
    join {{ ref('stg_orders') }} as orders
        on payments.order_id = orders.order_id

{% endmodel %}

There are a few things that model blocks have going for them:

  • Flexibility in the hands of users, to split across multiple files as they see fit, similar to the flexibility offered by dbt .yml files today
  • Encouragement to use even-more-modular models, especially ephemeral models, by diminishing the development-time "cost" of switching between multiple files
  • Support for "local" macros (Inline Macro Support #3379) that are defined at the top of the file and scoped only to the model blocks defined therein

This may still be a direction we head in, ultimately. We won't get there before dbt v1.0, but who's to say it couldn't be an essential component of dbt v2.0?

I bring up model blocks here because the prospect of decoupling 1 model-per-file also suggests the possibility of taking this further, by treating the larger file as a single Jinja template that could produce many model blocks dynamically:

-- models/many_models.sql

{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
    
    {% model stg_stripe_{{ source.name }} %} {# how would this work? #}
    
        select * from {{ source(source.source_name, source.name) }}
        where is_deleted = false
        
    {% endmodel %}

{% endfor %}

If you try to do this with snapshots today, you'll get an error message:

Got a block definition inside control flow at 3:4. All dbt block definitions must be at the top level

For good reason: This gets really complex to parse, and it makes us even more reliant on Jinja at a time when we're investing in static analysis to speed up parse-time manifest construction.

There's an alternative syntax toyed with in #1374 (comment). That issue was about dynamic columns, to avoid repeating the same yaml properties over and over. (That's ground we've since retrod in #2995, among other places, and definitely interested in thinking about more.) There, Drew mentioned the theoretical idea of a manifest namespace, exposed to the Jinja context, with methods like set_column_schema. Well, why not some manifest-altering method like add_model?

-- models/many_models.sql

{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
    
    {% set model_sql %}
        select * from {{ source(source_name, name) }}
    {% endset %}
    
    {% set model_name = 'stg_' + source.source_name + '_' + source.name %}
    
    {% do manifest.add_model(name=model_name, sql=model_sql) %}

{% endfor %}

While it's easier to imagine writing code like this, it would still be fiendishly complex to handle at parse time.

This approach quickly runs into a big limitation in how dbt works today, given these two premises:

  • At parse time, dbt doesn't need to know the exact SQL for every model, but it does need a full accounting of all models that exist, their dependencies, and their configurations.
  • At parse time, dbt does not run any queries against the database.

So the SQL for each individual model could be dynamically generated, just as it can today, based on the latest content or metadata in the database. But it would not be possible to create more or fewer models on the basis on information stored in the database—exactly the sort of use case that your issue is getting at. Fuller capabilities would require big foundational changes to the way that dbt works.

3. Code generation

After all, maybe dbt cannot be—ought not be—a data warehouse automater itself, but merely the best-possible substrate for "true" DWH automation?

I feel this gets at the heart of the matter. In my early years using dbt, I thought the extensibility and flexibility of its tooling were its strongest features; with Jinja, all things are possible, certainly compared to SQL. Nowadays, I'm inclined to think that dbt's greatest strenght is the rigidity of its opinionated framework, its demand for decisiveness and verbosity in the places that matter. The guardrails are there for a reason.

So, let's say we keep living in a world where one database object has to be one model has to be one file (or at least one Jinja block). An automated process could still be of tremendous help by generating those files; open PRs; run CI checks; require a human reviewer (or not); merge; revert if necessary. In that world, data warehouse automation is a time-lapse photograph: a double-speed verson of the current git-based dbt development workflow, with human intervention at only the appropriate moments. The key difference is that data warehouse automation is decoupled from data warehouse execution—their handoff point is dbt model code, explicitly defined.

What should that automated process look like? Should the premier executor of dbt code also try to be its own code generator? Or should there be a separate tool? There's fair arguments on either side:

  • dbt could do it all. dbt can run arbitrary SQL and template code; why not template dbt code? Because the processes are decoupled, we avoid entirely the parse-time vs. execute-time dilemma discussed above. The principal example here is the codegen package. There are some gaps in this workflow today, requiring more human facilitation or glue code than we'd ultimately want. In this category fall the desires to write codegen's output to files from within the Jinja context directly (Ability to write to files with Jinja #3428), or to at least have better tooling for piping it to files via shell commands ([CT-172] --quiet flag for run-operations #3451)

  • A workflow unto itself. An end-to-end code generator, written in a true scripting language (not Jinja), that can fully control a file system, and do anything it needs. Ideally, something like this is hooked up to an automation workflow that also plugs into the git provider. My sense is that this is the piece of python code your colleagues are using today, though as a more-manual process.

Why not both? dbt-helper, written in python, already imports dbt as a python module, using its database connection and catalog generation capabilities as scaffolding. We want dbt-core to be a more-stable, better-documented python library. I'd like to see open source dbt code-generation workflows for GitHub Actions and GitLab CI/CD. Perhaps this could even be a feature in dbt Cloud.

tl;dr

Eventually, I could see us doing pieces of all three of the things mentioned above:

  1. Better support for "template" or "sharded" models that map to multiple databse objects
  2. Support for multiple model blocks in the same file (though this would be more of a cosmetic change than a functional one)
  3. Invest in dbt's own capabilities around code generation, and/or lend our support to promising code generation workflows developed by community members. I know that such workflows already exist, and I'd be excited to see more of them open sourced and mettle-tested.

Ultimately, I find the possibilities offered by the third approach most compelling. Having written 1500 words on dbt and model automation, I'm coming around to the idea that a native dbt task (#1082), or even just a way to plug together codegen + file system + GitHub Actions, could get us a lot of the way there.

@fabrice-etanchaud
Copy link

A masterful play in three acts ! Thank you Jeremy.

@erika-e
Copy link

erika-e commented Nov 9, 2021

I'd like to see open source dbt code-generation workflows for GitHub Actions and GitLab CI/CD. Perhaps this could even be a feature in dbt Cloud.

An automated process could still be of tremendous help by generating those files; open PRs; run CI checks; require a human reviewer (or not); merge; revert if necessary. In that world, data warehouse automation is a time-lapse photograph: a double-speed verson of the current git-based dbt development workflow, with human intervention at only the appropriate moments. The key difference is that data warehouse automation is decoupled from data warehouse execution—their handoff point is dbt model code, explicitly defined.

We built workflows like this at Aula and we'll be talking about it at Coalsece 2021! Jeremy's description, above, is an elegant summary of what our automation does. Source schema files, snapshots, and the first staging layer are all auto-generated and auto-maintaining.

The automation relies on the following components:

  • A yaml file with configuration information for the resources that the automation generates
  • SQL or yaml templates for resources
  • A python script that takes arguments for the resources to generate, applies the configurations to the templates, and outputs the appropriate files to the configured locations
  • A GitHub action that opens a PR when changes are required

I didn't know about dbt-helper until just now, so I'm interested to take a look and see how importing dbt-core directly might help standardize the methods and make the code more usable for others. Assuming my coworker from Aula is ok with it, I'd also love to contribute it somewhere in an open-source way. I see somewhat frequent posts over in the dbt slack from people looking for similar functionality.

@rsazima
Copy link

rsazima commented Jan 31, 2022

Great discussion.

If I get it right, approaches 1 and 2 above do not currently work, correct?

Even if we end up with just approach 3, a "template" or reference implementation would be handy.

@trentkgUjet
Copy link

My team would also like this feature!

@Gwildor
Copy link

Gwildor commented Apr 1, 2022

I'm curious if the ability from Jinja2 to extend from another template and replace predefined blocks of code (template inheritance) was considered for this feature? Then a user could define something like base_model.sql and then model_1.sql that extends from the base, and replaces the blocks as needed.

@trentkgUjet
Copy link

We would still really love this feature!

@Brandon-Peebles-Zocdoc
Copy link

would love this feature!

@fabrice-etanchaud
Copy link

I'm curious if the ability from Jinja2 to extend from another template and replace predefined blocks of code (template inheritance) was considered for this feature? Then a user could define something like base_model.sql and then model_1.sql that extends from the base, and replaces the blocks as needed.

Hi @Gwildor, there is an old feature request about jinja2 template inheritance : #1337

@mdlnr
Copy link

mdlnr commented Apr 6, 2022

My use case would be for a multi-tenant platform where we create an access layer (view) for every tenant. I would like to configure a list of tenant identifiers and generate a view for each tenant using one generic jinja template.

@bashyroger
Copy link
Author

Thanks for your extensive comment @jtcohen6 , I have been giving your writings a bit more thoughs
Regarding your comment in act 3:

After all, maybe dbt cannot be—ought not be—a data warehouse automater itself, but merely the best-possible substrate for "true" DWH automation?

I still think it can / should be partially. Partially by indeed introducing 'template models': models with a recursive (Jinja) loop that can spawn the creation of multiple models files.
For us it now feels disjoined that we, just to get this done, have to write our own custom python code that enables us to do this when ALL that is missing is this 1 meta model to n child models step. As in: we have now implemented a solution that is akin to what @erika-e mentions, but it feels over-engineered to us...

From what I can infer about your writing, this initial request would largely be fulfilled if #3428 were implemented.

Regarding you comment on parse time / vs execution time problem: those template models would not have to run at execution time initially. As a developer, I would purely see them as a way to automate the creation of multiple model files from a template, something that the current codegen package indeed cannot do.

Practically, I would expect a command to exists like:
dbt generate -s my_meta_model.sql --args {args}
It would create / update 1-n children, REAL dbt models that after creation obviously would have to be compiled.

Then later, further along the road, I would see them be added to the execution context:

  • The dbt generate command would be allowed to be used in an execution context, requiring a 2nd compile pass when invoked.
  • The child models, if all valid would be 'lazily' auto-committed to the same branch the code is running on.

@dbt-labs dbt-labs locked and limited conversation to collaborators Apr 19, 2022
@jtcohen6 jtcohen6 converted this issue into discussion #5101 Apr 19, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants