Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate source data freshness #1240

Closed
drewbanin opened this issue Jan 15, 2019 · 1 comment
Closed

Calculate source data freshness #1240

drewbanin opened this issue Jan 15, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Jan 15, 2019

Feature

Calculating Data Freshness

Using the information present in sources, dbt can determine how "fresh" source data is at a given point in time. dbt should provide a command which is capable of snapshotting the data freshness (the max(loaded_at_field) for each table) at a given point in time. dbt should produce a json file which contains information about the freshness when this command is invoked.

Example usage

$ dbt source snapshot-freshness [--select [source_1, source_2, ...]] [-o target/sources.json]

Arguments

--select

This flag allows users to select specific sources to describe. It should accept multiple values, each of which is either:

  1. The name of a source (eg. snowplow, quickbooks, etc)
  2. The name of specific table in a source (eg. snowplow.event, quickbooks.accounts). This name is generating by concatenating the source and table with a dot.

If no sources are --selected, then dbt should calculate the freshness for all of the sources in a project.

-o

A path to a .json file (relative to the target/ directory?) to write the file to.

Calculating Freshness

with source_snapshot as (

	select
		max({{ loaded_at_field }}) as max_loaded_at
		
	from github_stars.stargazers

)

select
	max_loaded_at,
	getdate() as snapshotted_at,
	datediff(second, max_loaded_at, getdate()) as max_loaded_at_time_ago_in_s
	
from source_snapshot

This query will vary in all of the usual, unfortunate ways across databases:

  1. inconsistent mechanisms for getting the current timestamp (getdate(), now(), current_timestamp, etc)
  2. inconsistent data types (timestamp_tz vs. timestamp_ntz on Snowflake)
  3. incomplete support for datediff (namely on postgres

As such, this command should be implemented using the adapter macro paradigm. Moreover, it would be convenient to support a contract of fields in this query, then let users supply their own macro to calculate the time delta. This is a nice-to-have for the first cut of this feature, but if it's easy to do, we should do it!

Output file format

All times should be UTC

{
  "meta": {
    "generated_at": "2019-01-15T19:57:51.793643Z",
    "elapsed_time": 0.314208984375
  },
  "sources": {
    # map the source unique id onto data about the source
    "source.project.source_name.table": {
        "max_loaded_at": "2018-01-01 12:00:00.123",
        "snapshotted_at": "2018-01-01 12:02:12.456",
        "max_loaded_at_time_ago_in_s": "1234".
        "state": "warn" # one of {ok|warn|error}
        "criteria": {....} # copied from the schema.yml spec
    }
  }
}

Stdout

This command should work a lot like the dbt run command, outputting a parallelized list of resource invocations to the console.

17:05:22 | Concurrency: 8 threads (target='dev')
17:05:22 |
17:05:22 | 1 of 3 START freshness of source.table_1 .......... [RUN]
17:05:22 | 2 of 3 START freshness of source.table_2................. [RUN]
17:05:22 | 3 of 3 START freshness of source.table_3................. [RUN]
17:05:22 | 1 of 3 START freshness of source.table_1................. [OK in 1.2s]
17:05:22 | 2 of 3 START freshness of source.table_2......................... [WARN in 2.4s]
17:05:22 | 3 of 3 START freshness of source.table_3......................... [ERROR in 3.5s]
17:05:22 |
17:05:22 | Finished running 3 sources in 9.29s.

Completed with 1 error and 1 warning:

Freshness Error in source table_3 (models/sources.yml)
  The table source.table_3 is 812 days out of date. Error for condition count=800 period=day.

Freshness Warning in source table_2 (models/sources.yml)
  The table source.table_2 is 12 days out of date. Warning for condition count=10, period=day.
@drewbanin drewbanin added the enhancement New feature or request label Jan 15, 2019
@drewbanin drewbanin added this to the Stephen Girard milestone Jan 15, 2019
beckjake added a commit that referenced this issue Feb 13, 2019
@drewbanin
Copy link
Contributor Author

fixed in #1272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant