Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define source tables #814

Closed
drewbanin opened this issue Jun 28, 2018 · 16 comments
Closed

Define source tables #814

drewbanin opened this issue Jun 28, 2018 · 16 comments
Assignees
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Jun 28, 2018

Related: #790

Presently, the link between source data and dbt models is pretty tenuous. dbt should provide a mechanism for defining source tables, and users should be able to select from these source tables with a mechanism similar to ref. Further, these source tables should be documented alongside other dbt resources.

Proposed Syntax

sources:
  - name: snowplow
    loader: snowplow # This is new, see below
    loaded_at_field: etl_tstamp # This is new, see below
    description: "A snowplow dataset"
    freshness: # This is new, see below
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
        - name: event
          description: An immutable log of events collected by Snowplow
          loaded_at_field: _etl_tstamp # Override value from top-level source
          sql_table_name: snowplow.event
          columns:
            - name: collector_tstamp
              description: Timestamp for the event recorded by the collector

            - name: domain_userid
              description: User ID set by Snowplow using 1st party cookie
              tests:
                  - unique

        - name: web_page
          description: "{{ docs('snowplow_web_page') }}"
          sql_table_name: "{{ var('snowplow:web_page') }}"
          columns:
            - name: event_id
              description: {{ docs('snowplow_web_page.event_id') }}

Notes

For more information on the description and columns fields, check out the discussion on the new schema.yml syntax.

There are two new constructs here:

sql_table_name

This field will be responsible for pointing to a table in the warehouse. This syntax works for:

BigQuery:

  • Syntax: project-name.dataset.tablename
  • can select across projects

Snowflake:

  • Syntax: database.schema.tablename
  • can select across databases

pg/Redshift:

  • Syntax: schema.tablename
  • can not select across databases. Cross-db references should not however be enforced!

Rather than trying to construct the table name automatically, users will need to enter the exact identifier for the source table.

loader

This will be used for documentation. Loader can be something like Airflow, Stitch, Fivetran, etc. User supplied.

loaded_at_field

This field indicates which timestamp field in the source table should be used as the "loaded at" timestamp. This varies by ETL loader, but for loaders which do supply such a field, we can calculate how long ago certain sources were loaded. This field can be specified at the source level or at the table level. If specified at the table level, the value overrides the value from the source if present.

freshness

This field describes the acceptable latency for source tables. This field can be provided at the source or table level. If provided at the source level, it should apply to all of the tables in the source. If provided at the table level, it should override any freshness configuration at the source level. The freshness value for a table can be set to null to override the value specified in the source.

Acceptable formats:

    freshness:
      warn_after:
        count: 12
        period: hour
      error_after:
        count: 24
        period: hour

Acceptable periods are minute, hour, and day.

Usage

To select from a source table, users can use the source function, analogous to ref. This function accepts two arguments: a source and a table. For example:

-- models/snowplow/snowplow_sessions_xf.sql

select * from {{ source('snowplow', 'event') }}

will render to:

select * from snowplow.event

When source() is used, an edge will be added to the graph. This will allow model selection like: "run everything that depends on Snowplow data". In the above example snowplow_event is derived from the name: parameter of the first table definition.

Ex:

# run all models that touch Snowplow data
$ dbt run --models source:snowplow+

See #1014 for info about the selection syntax shown above.

See #1240 for info about model freshness

@drewbanin drewbanin added the dbt-docs [dbt feature] documentation site, powered by metadata artifacts label Jun 28, 2018
@jthandy
Copy link
Member

jthandy commented Jun 28, 2018

I like this a lot. I have a couple of questions that popped into my head from reading this that we'll likely have to answer before completing this feature:

  • what does the CLI syntax look like for selecting everything that depends on snowplow source tables? I don't think there's a way to do that with --models today because our hierarchy there is different...

  • what is the name of the file that contains this yaml? sources.yml? can there be many of these files? can packages define their own sources which then get namespaced as packages.source_name.table_name?

@cmcarthur
Copy link
Member

this looks great.

the only thing i'm not 100% sure on is sql_table_name. this is kind of like alias, right? but it also includes a schema, database, or other namespace. i wonder if it's better to define this like other relations, where you can specify a identifier, schema, and/or database. that way when we ref it we already have a relation in place.

so instead of sql_table_name, we'd have:

sources:
  - name: snowplow
    description: "A snowplow dataset"
    tables:
        - name: snowplow_event
          description: An immutable log of events collected by Snowplow
          schema: snowplow
          identifier: event
          columns:
            - name: collector_tstamp
              description: Timestamp for the event recorded by the collector

            - name: domain_userid
              description: User ID set by Snowplow using 1st party cookie
              tests:
                  - unique

        - name: web_page
          description: "{{ docs('snowplow_web_page') }}"
          # this used to be one var, which is unfortunate! hard to convert.
          # sql_table_name: "{{ var('snowplow:web_page') }}"
          schema: snowplow
          identifier: "{{ var('snowplow:web_page') }}"
          columns:
            - name: event_id
              description: {{ docs('snowplow_web_page.event_id') }}

@jthandy
Copy link
Member

jthandy commented Jun 29, 2018

That was what the initial version looked like, but databases are different in terms of how they address namespace--do you need to specify a database as well or just a schema? is it even possible to select from a different database? verbiage is different also.

Is there a benefit to actually defining database and schema for sources specifically instead of just having a string that defines where the table can be addressed? Since dbt doesn't create or maintain them, it only needs to know how to write the select statement that gets data out of them.

@cmcarthur
Copy link
Member

cmcarthur commented Jun 29, 2018

@jthandy

do you need to specify a database as well or just a schema? is it even possible to select from a different database?

Depends on the warehouse type. For Redshift and Postgres, you can't select from a different database. For Snowflake, you could. And, I think in Bigquery you have to specify the project.

verbiage is different also.

We made some decisions in the relations rollout to accommodate this, the big one being that we use schema everywhere even though the schema is actually a dataset in bigquery. So the only actual difference is that in Bigquery you have to specify a project, whereas in the other warehouses you may be able to optionally specify a database.

Is there a benefit to actually defining database and schema for sources specifically instead of just having a string that defines where the table can be addressed? Since dbt doesn't create or maintain them, it only needs to know how to write the select statement that gets data out of them.

This is true today, but a lot of our internal API for inspecting relations depends on this code being structured a certain way -- so for example you can imagine if we want to inspect the schema of a source table in Bigquery, we actually make an API call with the project, schema, and table name as arguments, instead of just running some arbitrary SELECT.

OFC we could always convert this string back to a relation, it's just messy and there's room for bugs there.

@jthandy
Copy link
Member

jthandy commented Jun 29, 2018

All that make sense. I think that we would have to require database on snowflake and bq for every relation. As long as we're comfortable making that kind of thing conditional then makes sense.

@drewbanin
Copy link
Contributor Author

I'm definitely leaning more towards constructing the SQL identifier from a database + schema + table name. Generally, I think there are some real questions to answer around implementation. We also need to figure out how these things fit into --models selection syntax, and we'll need to update the (prospective) docs site to account for them too.

I think we should kick this out of the 0.11.0 milestone, but we should definitely continue to shore up our thinking around implementation in this issue

@drewbanin
Copy link
Contributor Author

This is partially blocking on #550 for source selection. Otherwise, I think we may be in good shape to tackle this one soon.

Note that there is a significant set of corresponding changes in the documentation site that will be enabled by this feature.

@drewbanin drewbanin added this to the Guion Bluford milestone Sep 28, 2018
@drewbanin drewbanin removed this from the Guion Bluford milestone Oct 17, 2018
@drewbanin drewbanin added this to the Wilt Chamberlain milestone Nov 28, 2018
@drewbanin
Copy link
Contributor Author

This is no longer blocking on #550 and can be prioritized when ready

@beckjake beckjake self-assigned this Jan 17, 2019
@beckjake
Copy link
Contributor

beckjake commented Jan 18, 2019

@drewbanin I've got a few questions! I've worked around most of them while working out the schema and basic parsing, but I think I should address them before I do much more:

  • What are the defaults for freshness? If a user overrides a source table's freshness to null, what does that mean?

    • For now I ignored this and the defaults are {'freshness': {'warn_after': None}, {'error_after': None}}
    • Can users override just count/period, or must they do both?
  • For freshness, are only integers allowed, or any numeric value?

  • The issue text implies that the loaded_at_field is optional. Is that accurate? How does that interact with freshness?

  • How should the override mechanism work?

    • Should it be early binding (do the merge at parse time) or late binding (use the SourceConfig override mechanism)?
  • It looks like sources can have docrefs, and models can now have all of refs, docrefs (via schema.yml patches), and source refs. Is that right?

    • Should all those kinds of refs get combined into one somehow? It seems crazy that we'll have three mostly separate reference mechanisms!
    • Do sources get access to macros?
    • Are all fields in sources rendered?
    • What things are available in what contexts?
      • Should I be able do things like description: "{{ docs(var('snowplow:webpage') }}"?
  • Should the parsed results of the source tables be nodes in the manifest, or should they have their own namespace?

    • They're not ref-able, so maybe they shouldn't be in the default namespace? But we do want to access them from the graph.
    • can you define table-level tests for sources like you can for models, or only column tests?
    • model/column tests can ref other models. Can they source sources? Can source tests source sources? What about refing models?
    • Currently we name our model tests like ${model_name}_${column_name}_${variables_and_such}. How should we differentiate source tests from model tests with this method, given the fact that there's no namespace information in there as it is?
      • Should we change model tests' naming as well to match/put them under their own namespace? (source_tests?)
        • Will that break anything that's currently using manifest.json?

@drewbanin
Copy link
Contributor Author

What are the defaults for freshness? If a user overrides a source table's freshness to null, what does that mean?

Yeah, you're correct to assume that there are no defaults. I think there are a ton of datasets where monitoring freshness is unnecessary (eg. manual loads) or otherwise not possible (no loaded_at date provided by the ETL tool). I like your instincts here.

Users must override both the count and the period. In the schema.yml freshness contract, both the count and period should be required.

For freshness, are only integers allowed, or any numeric value?

I can't think of a good use case for decimal values. Instead of .5 hours, you could just specify 30 minutes. My instinct is that we'd probably want to start with ints, just insofar as it will make rendering any UIs a little easier. These should be integers ≥ 1

I don't feel strongly about this though. If there's a good reason to support floats, that would be ok too.

The issue text implies that the loaded_at_field is optional. Is that accurate? How does that interact with freshness?

Great question! Freshness may only be supplied if the user also supplied a loaded_at_field. Not every dataset will have a loaded_at field, and there are merits to making sources beyond just calculating data freshness. So, we should make loaded_at_field optional, but throw an error if freshness is supplied without a loaded_at_field.

How should the override mechanism work? Should it be early binding (do the merge at parse time) or late binding (use the SourceConfig override mechanism)?

I'm not sure I understand the implications of one approach over the other. Can you elaborate?

It looks like sources can have docrefs, and models can now have all of refs, docrefs (via schema.yml patches), and source refs. Is that right?

Yes, 100% correct!

Should all those kinds of refs get combined into one somehow? It seems crazy that we'll have three mostly separate reference mechanisms!

Possibly? Since we don't have much in the way of namespaces, different function names indicate what kind of thing you're referencing!

In a typical base model today, you'd write:

    select * from source_schema.table_name

With sources, it would be:

    select * from {{ source('source', 'table_name') }}

I like the idea of preserving the fact that this is a reference to source data, and would hesitate to just use ref. I feel largely the same about the doc function. It's important to note that the arguments to source differ from that of ref, and so there's probably some merit to giving them different names.

Do sources get access to macros?

This is a really good question! if sources have access to macros, does that mean that macros can't refer to sources? I'd rather be able to specify {{ source(...) }} in a macro than allow macros to be called from sources.

These should be mostly static/unchanging configs, and so I'm ok with keeping them pretty simple. The fewer moving parts the better IMO.

Are all fields in sources rendered?

I think that all of the source-specific fields should be renderable. All of the configs that are shared with the models: syntax should be implemented with the currently existing behavior (ie. not rendered, I think).

Here's the rationale: sources can be defined inside of packages, and the package owners might not know anything about how the source tables are being loaded into the warehouse. So, you have snowplow.event, and you know that there are 130 columns with precise names, types, and definitions, but you don't actually know:

  1. the name of the loader
  2. the name of the loaded_at_field (not the best example in this case, but true in general)
  3. the expected freshness
  4. the sql_table_name

It would be good to allow these to be configured with variables (see below) as we do in most of our packages, like snowplow.

Note: I don't love how we scope and supply variables currently, but that can be a future project.

What things are available in what contexts? Should I be able do things like description: "{{ docs(var('snowplow:webpage') }}"?

Really good question. I think the following context variables should be supplied:

  • all of the base context things (available in dbt_project.ymls context, like env_var, target, etc).
  • var, scoped to the path of schema.yml
    • I think this will happen automatically, but maybe not. Happy to discuss this point in particular
  • doc function

Should the parsed results of the source tables be nodes in the manifest, or should they have their own namespace?

Let's make them nodes in the manifest. I don't feel strongly about this though, open to discussing.

They're not ref-able, so maybe they shouldn't be in the default namespace? But we do want to access them from the graph.

While you're correct that they would not be referenced with the ref function, they do indeed participate in the graph. We should only split these out into their own namespace if we plan to do the same for other node types, like tests (and potentially archives, once they too are referenceable). I think nodes are fine for now, but let me know if you feel differently!

can you define table-level tests for sources like you can for models, or only column tests?

Good question! Let's support both. I can imagine tests like "make sure this table isn't empty" or similar.

model/column tests can ref other models. Can they source sources? Can source tests source sources? What about refing models?

Ohhhhh boy. Are there complications in supporting both ref and source in both models and sources? I see now how this is kind of unpleasant.... My initial impression is that it's reasonable to want to do all of the things you noted here. Can you give me an idea if there are architectural changes required to do this, or if it's more of an implementation detail?

Currently we name our model tests like ${model_name}_${column_name}_${variables_and_such}. How should we differentiate source tests from model tests with this method, given the fact that there's no namespace information in there as it is?

Let's add a source_ prefix and punt on this question for now. I know it's not perfect and they can still collide if you name your models terribly. That's ok for now.

Should we change model tests' naming as well to match/put them under their own namespace? (source_tests?) Will that break anything that's currently using manifest.json?

We should tackle this at some point (the test names are unreservedly terrible). Changing this will impact the docs site, as we actually parse out referential integrity from these convoluted names :upside-down-smiley-face:. We'll need to give that some love in the future, but no need to poke the sleeping bear right this second.

@beckjake
Copy link
Contributor

beckjake commented Jan 21, 2019

How should the override mechanism work? Should it be early binding (do the merge at parse time) or late binding (use the SourceConfig override mechanism)?

I'm not sure I understand the implications of one approach over the other. Can you elaborate?

Early binding is a lot easier, pulling in SourceConfig sounds pretty difficult! When I asked this I thought it might make sense to have one file define a source table in another file's sources section, but given your answers elsewhere it makes less sense.

if sources have access to macros, does that mean that macros can't refer to sources? I'd rather be able to specify {{ source(...) }} in a macro than allow macros to be called from sources.

Without some big changes, I'm pretty confident that it has to be one or the other for macros, it's why we build up the partial macro-only manifest during parsing.

Possibly? Since we don't have much in the way of namespaces, different function names indicate what kind of thing you're referencing!

I think the different function names is a fine way to expose it to users. My problem with this is a bit larger - I'm not sure that I like what comes out of the obvious way to implement this. We've already got refs and docrefs, and now it seems natural to be adding sourcrefs and doing all the same things again but with the slight tweak of sources, and making source available everywhere ref is.

All of the configs that are shared with the models: syntax should be implemented with the currently existing behavior (ie. not rendered, I think).

In dbt_project.yml's models/seeds sections, the current behavior is that pre-hook, post-hook, and vars (or anything nested in a vars block) are evaluated at runtime. The remainder are evaluated at parse time. The parse time renderer has access to env_var and var, and var only honors command-line vars.

It sounds like you want it to include the vars defined in the dbt_project.yml, is that accurate? Currently, by the way, target is not available - it's only available in fields that are rendered at runtime (hooks, vars). I would like a better specification fo exactly what's allowed here.

Finally, the current parser for description fields in schema.yml only provides docs in its context. It sounds like you want more - should the node's description field also get var/env_var?

We should only split these out into their own namespace if we plan to do the same for other node types, like tests (and potentially archives, once they too are referenceable). I think nodes are fine for now, but let me know if you feel differently!

Sources just seem to exist in this weird half-in, half-out state. I guess the question here is - would archives be referenced via ref, or their own function? Also, fitting sources into the unparsed/parsed nodes model is going to be pretty clunky, even more so than with seeds.

Are there complications in supporting both ref and source in both models and sources?

I don't think so. As long as you don't form cycles I don't think it matters much. I do have to admit, I'm not clear on how exactly you'd use a ref from a source - the description? We'd have to rethink that context then. The source name? Seems to defeat the point if your source is secretly already in the graph because it's a ref! But maybe there's a use I don't understand.

@cmcarthur
Copy link
Member

if sources have access to macros, does that mean that macros can't refer to sources? I'd rather be able to specify {{ source(...) }} in a macro than allow macros to be called from sources.

Without some big changes, I'm pretty confident that it has to be one or the other for macros, it's why we build up the partial macro-only manifest during parsing.

is this true? how is this different from refing in macros?

@beckjake
Copy link
Contributor

@cmcarthur oh, good point! I guess sources will work the same way as refs since the macro level does no interpretation.

beckjake added a commit that referenced this issue Jan 28, 2019
@cosoare
Copy link

cosoare commented Jun 15, 2020

@drewbanin would be super useful if we could set the location of the source (dataset), similar to how we can define that in profile.yml, this is particularly useful for BigQuery datasets which live in a single region - currently dbt queries only using multiregion args, ie. EU US

Please see this Slack thread for a concrete example:
https://getdbt.slack.com/archives/C99SNSRTK/p1592241999380600

Any workarounds (apart from creating a multi-region dataset) would be appreciated.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 15, 2020

Hey @cosoare, do you mind opening a new issue and including all the relevant details? We can continue the discussion there. I want to better understand the pros/cons of single-region vs. multi-region datasets.

@drewbanin
Copy link
Contributor Author

Just pasting in the thread details from Slack since they disappear on us :)

Is there a way to specify location for BigQuery datasets, specifically for source datasets? For context, I’m facing an issue where I can’t utilise a dataset I’ve created for sources since it’s a single region ie. europe-west2, and dbt is looking for EU

You can specify database in your model configuration. But this assumes that your authenticated user account has the necessary permissions for both projects

Ah, sorry I’ve just re-read your question. Actual location? No I don’t think so.

hmm… so I need to setup cross-region supported datasets? Not ideal 😞

No you can do cross region as long as its within EU AFAIK

The problem with this is the assumption dbt makes that a BigQuery dataset will be cross-region (EU, US) hence when trying to run dbt run on a dataset that is not EU but rather a single region europe-west2 dbt fails.

You can indeed feel free to create a new issue for us to discuss this change, but I don't really think that dbt can help you here! BigQuery disallows querying datasets across regions. If you did provide a location for source datasets in BQ, there isn't really anything dbt could do with that information. The source data already lives in a specific region, and short of copying that data into a differently named dataset in a different region (something we would not want to do), I don't think there's anything dbt could do to make this work!

So, please pop open a new issue to discuss in further detail if you'd like. In particular, I'd like to know what you think dbt should be doing with these dataset locations if provided!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts
Projects
None yet
Development

No branches or pull requests

6 participants