-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Microbatch" incremental strategy #6194
base: current
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@@ -42,6 +42,10 @@ exports.versions = [ | |||
* @property {string} lastVersion The last version the page is visible in the sidebar | |||
*/ | |||
exports.versionedPages = [ | |||
{ | |||
page: "docs/build/incremental-microbatch", | |||
lastVersion: "1.9", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its meant to show for 1.9 and higher right? it wasn't showing when i was on versionless, however it was showing for 1.8 and lower
lastVersion: "1.9", | |
firstVersion: "1.9", |
|
||
Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations. | ||
|
||
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. | |
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. Instead of processing all of your data at once. Since each "batch" is based on a time period, like a single day, it makes it much faster and more efficient to update large datasets, especially when you're working with data that changes over time (like new records being added daily). |
|
||
During standard incremental runs, dbt will process new batches and any according to the configured `lookback` (with one query per batch) | ||
|
||
<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix img
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much - these are looking really solid! Just a few suggestions
|
||
# About microbatch incremental models <Lifecycle status="beta" /> | ||
|
||
:::info Microbatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtcohen6 Thoughts on calling out that this feature is in beta - as I anticipate there will be some changes ahead of our 1.9 final release based on community feedback
|
||
Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models append, update, or replace rows in the existing table with the new data just processed. This can significantly reduce the time and resources required for your data transformations. | ||
|
||
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. | |
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` columns and `batch_size` you configure. |
|
||
<Lightbox src="/img/docs/building-a-dbt-project/microbatch/event_time.png" title="The event_time column configures the real-world time of this record"/> | ||
|
||
- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on making this a table instead of a bulleted list?
|
||
- `batch_size` (string, optional) - The granularity of your batches. The default is `day`, and currently that is the only granularity supported. | ||
- `lookback` (integer, optional) - Process X batches prior to the latest bookmark, in order to capture late-arriving records. The default value is `0`. | ||
- `begin` (date, optional) - The "beginning of time" for your data. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches. (It's a leap year!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is optional
|
||
<Lightbox src="/img/docs/building-a-dbt-project/microbatch/microbatch_lookback.png" title="Configure a lookback to reprocess additional batches during standard incremental runs"/> | ||
|
||
If there’s an upstream model that configures `event_time`, but you *don’t* want the reference to it to be filtered, you can specify `ref('upstream_model').render()` to opt-out of auto-filtering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on nesting this under the sentence "dbt will automatically filter upstream inputs (source
or ref
) that define event_time
, based on the lookback
and batch_size
configs for this model."
|
||
### Available configs | ||
|
||
- `event_time` - The column indicating "at what time did the row occur" (for both your microbatch model and its direct parents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a callout that event_time
and begin
need to be in UTC? (also true of the CLI args --event-time-start/end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe having the "timezone"s section is better!
select * from {{ ref('stg_events') }} | ||
where my_time_field >= '2024-10-01 00:00:00' | ||
and my_time_field < '2024-10-02 00:00:00' | ||
) # this ref will be auto-filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) # this ref will be auto-filtered | |
) |
|
||
### Supported incremental strategies by adapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need versioning here? (since microbatch is versionless + core 1.9+)
|
||
Microbatch incremental models make it possible to process transformations on large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or during manual backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` column you configure. | ||
|
||
Where other incremental strategies operate only on "old" and "new" data, microbatch models treat each "batch" of data as a unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have an example in the "## How does microbatch
compare to other incremental strategies?" section - but maybe we should have one here as well?
|
||
dbt has changed the default materialization for incremental table merges from `temporary table` to `view`. For more information about this change and instructions for setting the configuration to a temp table, please read about [Snowflake temporary tables](/reference/resource-configs/snowflake-configs#temporary-tables). | ||
The `merge` strategy is available in dbt-postgres and dbt-redshift beginning in dbt v1.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Or can we add it to the table somehow?
|
||
## How does `microbatch` compare to other incremental strategies? | ||
|
||
Most incremental models rely on the end user (you) to explicitly tell dbt what "new" means, in the context of each model, by writing a filter in an `is_incremental()` block. You are responsibly for crafting this SQL in a way that queries `this` to check when the most recent record was last loaded, with an optional look-back window for late-arriving records. Other incremental strategies will control _how_ the data is being added into the table — whether append-only `insert`, `delete` + `insert`, `merge`, `insert overwrite`, etc — but they all have this in common. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Other incremental strategies will control how the data is being added into the table"
maybe add a callout of how microbatch choose which of these "how"s is best?
Currently
with more adapters to come! |
Currently, as microbatch is still in "beta", this functionality is still gated behind an env var (will be swapped to behavior change flag ahead of the final 1.9 release) - so you need to set |
Resolves #6136
What are you changing in this pull request and why?
microbatch
incremental strategy (v1.9+)microbatch
to table of incremental strategiesChecklist
website/sidebars.js