auto full-refresh when schema changes for incremental models #320

drewbanin · 2017-03-04T03:19:31Z

No description provided.

adamhaney · 2017-08-09T19:55:07Z

I'm interested in working on this, conceptually I assumed that after compilation you the code would need to issue a query to the target database to get the list of existing columns and then compare that to the sqlparse compiled AST of the model, and then conditionally do something like

if len(set('db_table_columns') - set('model_columns')) > 0:
    delete_existing_table()
run()

it looks like there's a before_run in dbt.node_runners.ModelRunner would that be the correct place to "hook into" for this problem?

If this was run directly in the ModelRunner would that take care of both the normal dbt run as well as a single model dbt run --models=$foo case? Any other edge cases I should be aware of?

drewbanin · 2017-08-09T20:31:12Z

@adamhaney cool! I think parsing the model SQL probably isn't the way to go here -- I can imagine things like select * being problematic.

We're in the process of making dbt less aware of the different materializations options. As such, the code for the incremental materialization is entirely defined here. So ideally, any incremental-specific code will also be defined inside of that incremental materialization definition.

If needed, we can totally make changes to dbt to provide useful context variables in the incremental materialization to support functionality like this. I've been thinking about MD5ing the model contents and comparing against the previous value stored in the graph.gpickle file. If the MD5 changed, then a full-refresh is in order. All of that logic can be encoded in jinja-land as long as the requisite context variables are available during compilation.

I'm not necessarily certain that's the best option here, but it's where my head is at. Do you buy all of that?

DavidAmesPup · 2018-03-14T23:35:15Z

I like the idea of being able to automatically do a full refresh when a model changes. We may also want to do a full refresh of all down-stream models too. Eg, Parent selects columns a, b & c. Changed to select a,b,c &d. Child does a select * from parent.

Having said that, I'd probably avoid doing select * in my models, so might not be required.

The proposed MD5ing of the model content and exposing the fact that it is different via jinja sounds like a great approach.

drewbanin · 2018-03-14T23:39:55Z

thanks for your input @DavidAmesPup! I don't think this issue is on our near-term roadmap, but I think it would be super valuable nonetheless. Happy to point you (or anyone reading this comment) in the right direction if you're interested in contributing a PR!

DavidAmesPup · 2018-03-14T23:54:28Z

Contributing a pr sounds like a great way for us to get some experience on the dbt codebase during our POC phase.

Sure, please point me in the right direction. We are not a Python shop but I'm sure we will manage :)

DavidAmesPup · 2018-04-19T05:51:05Z

Hey @drewbanin - we are going to pick this item up in our next sprint or two - do you have any pointers or design requests?

drewbanin · 2018-04-20T17:09:59Z

hey @DavidAmesPup, awesome, glad to hear it!

To be honest, I've given this some thought and I'm not sure of the perfect answer. I'd love to hear a little more about what you're thinking too, but here are my updated thoughts:

I think the md5 option presented above is approximately correct, but I think there's a bit of an issue. The graph.gpickle file probably isn't the best way to preserve information across dbt runs. If you run dbt from Sinter, or in Docker, or in any env where the filesystem isn't persisted between runs, there will be no information about the state of the previous run to go off of.

Instead, we've been talking a lot about adding a persistence layer to dbt in the form of a database. That could either look like a SQLite db, or the data warehouse itself. This database would contain information about when models last ran, who ran them, a hash of the model SQL, etc. I think we'd be able to use this database to understand whether or not a full refresh is in order for incremental models.

Obviously adding a persistence layer is quite an undertaking! We're definitely going to work on this in the near-to-medium term, but is probably more involved than a simple graph.gpickle based implementation of this feature. You should definitely give this a shot, but I wanted to give you fair warning: I'm unsure if I'll be able to merge it. Regardless, I'd love to take a look at whatever you come up with!

Happy to discuss here, and keen to hear what you think

DavidAmesPup · 2018-05-03T21:57:50Z

hey @drewbanin Agreed that graph.gpickle isn't the best place - we are currently looking at using Docker for our execution environment which wouldn't work at all. We will also execute the same model set against multiple tenants so the persistence of file checksums needs to be on a per-target-dw basis.

Rather than trying to hack around inside of DBT to get this to work - we are thinking writing a simple external util that can look at a folder of models and tell which ones have changed. It will persist a hash of previous contents to it's own persistence store).

In our orchestration logic, we can run this util, if it detects any model changes, we will execute a dbt run --full-refresh with a list of models that changed and then do a normal dbt run to get the latest data for any incremental models not rebuilt.

The biggest challenge that I can see is for us to know which models were built successfully and which failed. - Obviously, if we did a full refresh on a model and the full refresh failed, then we would want to run the full re-fresh on that model the next time around.

I'd rather implement this without parsing the console output from dbt - I can see a couple of options

The ability for us to pass in some sort of correlation_id (invocation_id?) into dbt run, and have that exposed via a post-hook. We could then write the correlation_id into the audit logs and our process can use that to identify the last models successfully run.
Invoke dbt as a python module, and have the ability to query the API so get a list of failed models.

We would like some guidance as to which you thought would be the most useful option for dbt - we can then raise an issue, submit a PR etc. We are also happy to hear other options that you may have.

Cheers,
dave

drewbanin · 2018-05-03T23:31:34Z

I think that sounds like a really great solution @DavidAmesPup! You're in luck on both counts:

dbt provides an invocation_id in the model context. It's a UUID that's generated for each invocation of dbt. Check out the docs here
we're working hard on dbt's api, but it's not in a place where it's ready for public consumption yet. In the meantime, you can hack it using this "unofficial" api. It's pretty heavy handed, but I think it might work for your purpose:

>>> import dbt
>>> import dbt.main
>>> res, success = dbt.main.handle_and_check(['run'])
>>> res[0]
<dbt.node_runners.RunModelResult object at 0x11237b358>
>>> res[0].node
{'name': 'some_model', ........}'
>>> res[0].node['config']['materialized']
'incremental'
>>> res[0].errored or res[0].failed or res[0].skipped
False

We don't support this API, and it may change in a future release, but this is the best way to get pass/fail info about a run at the moment. Hope this helps!

sphinks · 2019-08-02T12:58:40Z

Wanna to refresh conversation: I'm not sure that auto full-refresh could be a good idea. It should be directed by some flag, so use give an explicit agreement to rebuild the whole incremental model (which can be quite a long process). As an idea to give an idea to user, that it should be refreshed I suggest to introduce warning indicating the schema changed comparing to existing schema in DB. What do you say about such kind of first step?

drewbanin · 2019-08-02T17:08:35Z

@sphinks I think that's a great idea! This approach would work well with --warn-error I think.

sphinks · 2019-08-02T22:03:55Z

@drewbanin could you guide me a little bit where should I start to look at to integrate such kind of warning? No so good with python, but want to help you.

drewbanin · 2019-08-03T17:49:46Z

I think this might be pretty challenging to implement today! We have some code already in the incremental materialization which expands column types in the destination table if required. This could happen if dbt wants to insert a varchar(32) column into a varchar(16) column. The linked method will resize the destination column to account for the new data.

I think if I were going to implement this, I'd make a new method, migrate_destination_table, which wraps the call to expand_target_column_types. This method could:

collect the schemas for both the source and target relations
warn if the two schemas are incompatible (the column names or types are different)
call expand_target_column_types to preserve the existing behvior

If we want to get more clever with our approach, like add or delete columns, then we could certainly do that in this method too. This won't extend well to the approach of full-refreshing automatically, which i think is ok. The more I think about this issue, the less inclined I am to make dbt automatically full-refresh when the relevant --full-refresh flag is not provided.

Somewhat relevant: dbt's Snapshots already add columns to the destination table when columns in the source query are changed. Long-term, I'd like to combine the logic of expand_target_column_types and create_columns into a method like migrate_destination_table, then use that across incremental models and snapshots.

I don't want to discourage you from trying this out, but I do want to say: there are a lot of gnarly edge cases in this part of the codebase. The column types dbt knows about vary by database (BigQuery has STRINGs and int64s, Snowflake has varchar() and text types, for instance). If you do end up giving this a shot, please don't hesitate to let me know how I can help! Feel free to also use the #development channel in dbt Slack for more general dbt development questions if you have them :)

drewbanin · 2020-05-13T16:11:40Z

closing in favor of #1132

The big idea:

on_schema_change: full-refresh | migrate | ignore | fail

rrbarbosa · 2020-12-17T08:59:11Z

Since we have the "state" method for node selectors, it should be possible to find all modified incremental models (and their children) automatically. Or am I missing something?

Here is my idea. Run the command:

dbt run --state target_prod/ --selector modified_incremental --full-refresh

With the selector is defined as:

selectors:
  - name: modified_incremental
    definition:
      intersection:
        # any modified model and downstream models
        - method: state
          value: modified
          children: true
        # any incremental model and downstream models
        - method: config.materialized
          value: incremental
          children: true

If there's no updated incremental model, the selector returns nothing, therefore, you can keep this in the your CI pipeline.

…-period-0-19 Fix: insert by period for 0.19

jessiehuang2022 · 2022-08-12T14:45:03Z

Since we have the "state" method for node selectors, it should be possible to find all modified incremental models (and their children) automatically. Or am I missing something?

Here is my idea. Run the command:
dbt run --state target_prod/ --selector modified_incremental --full-refresh
With the selector is defined as:
selectors:
  - name: modified_incremental
    definition:
      intersection:
        # any modified model and downstream models
        - method: state
          value: modified
          children: true
        # any incremental model and downstream models
        - method: config.materialized
          value: incremental
          children: true
If there's no updated incremental model, the selector returns nothing, therefore, you can keep this in the your CI pipeline.

Hi, this is what I want. in the dbt command, what is target_prod/ ? Thanks

rrbarbosa · 2022-08-12T14:56:08Z

~~@jessiehuang2022 I haven't tried myself, but this seem to be fully supported as of v0.21. No need to hack around these selectors.~~ This does not work, see @justmytwospence comment below.

Regarding target_prod/, you should check: https://docs.getdbt.com/guides/legacy/understanding-state

justmytwospence · 2022-09-29T20:15:52Z

@rrbarbosa, thats not the case (just FYI): Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

drewbanin mentioned this issue Jul 3, 2018

Efficient ways to handle column additions / deletions for incremental models #823

Closed

drewbanin mentioned this issue Sep 18, 2018

Move --full-refresh into a model config #1009

Closed

drewbanin mentioned this issue Oct 29, 2019

snowflake incremental merge does not detect new columns, fails silently #1870

Closed

4 tasks

drewbanin mentioned this issue Jan 23, 2020

Support Materialized Views #1162

Closed

drewbanin closed this as completed May 13, 2020

yu-iskw pushed a commit to yu-iskw/dbt that referenced this issue Aug 17, 2021

Merge pull request dbt-labs#320 from fishtown-analytics/fix/insert-by…

46dbfed

…-period-0-19 Fix: insert by period for 0.19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto full-refresh when schema changes for incremental models #320

auto full-refresh when schema changes for incremental models #320

drewbanin commented Mar 4, 2017

adamhaney commented Aug 9, 2017

drewbanin commented Aug 9, 2017

DavidAmesPup commented Mar 14, 2018

drewbanin commented Mar 14, 2018

DavidAmesPup commented Mar 14, 2018

DavidAmesPup commented Apr 19, 2018

drewbanin commented Apr 20, 2018

DavidAmesPup commented May 3, 2018

drewbanin commented May 3, 2018 •

edited

Loading

sphinks commented Aug 2, 2019

drewbanin commented Aug 2, 2019

sphinks commented Aug 2, 2019

drewbanin commented Aug 3, 2019

drewbanin commented May 13, 2020

rrbarbosa commented Dec 17, 2020

jessiehuang2022 commented Aug 12, 2022

rrbarbosa commented Aug 12, 2022 •

edited

Loading

justmytwospence commented Sep 29, 2022

auto full-refresh when schema changes for incremental models #320

auto full-refresh when schema changes for incremental models #320

Comments

drewbanin commented Mar 4, 2017

adamhaney commented Aug 9, 2017

drewbanin commented Aug 9, 2017

DavidAmesPup commented Mar 14, 2018

drewbanin commented Mar 14, 2018

DavidAmesPup commented Mar 14, 2018

DavidAmesPup commented Apr 19, 2018

drewbanin commented Apr 20, 2018

DavidAmesPup commented May 3, 2018

drewbanin commented May 3, 2018 • edited Loading

sphinks commented Aug 2, 2019

drewbanin commented Aug 2, 2019

sphinks commented Aug 2, 2019

drewbanin commented Aug 3, 2019

drewbanin commented May 13, 2020

rrbarbosa commented Dec 17, 2020

jessiehuang2022 commented Aug 12, 2022

rrbarbosa commented Aug 12, 2022 • edited Loading

justmytwospence commented Sep 29, 2022

drewbanin commented May 3, 2018 •

edited

Loading

rrbarbosa commented Aug 12, 2022 •

edited

Loading