Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced node selection syntax #2172

Closed
jtcohen6 opened this issue Feb 28, 2020 · 5 comments · Fixed by #2640
Closed

Advanced node selection syntax #2172

jtcohen6 opened this issue Feb 28, 2020 · 5 comments · Fixed by #2640
Labels
enhancement New feature or request

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 28, 2020

We want to enable a mechanism of node selection that is:

  • More powerful, composable, extensible
  • In a structured data format
  • Possible to check into version control

We think that this is best implemented as YML. It should be similar to CLI --models and --select syntaxes, but it will also allow us to move beyond what's possible with CLI flags + arguments.

Selectors

  • resource name
  • resource type
  • model materialization type
  • tags
  • project/package
  • subdirectory
    • file path literals
  • Node dependencies. We can make these more verbose in YML than the current selectors on the CLI:
    • parents: +my_model
    • children: my_model+
    • children and all their parents: @my_model
    • proposed: models that depend on macros (and their children?): my_macro+

Set logic

  • unions (inclusive OR): current default behavior
  • exclusion: possible on the CLI with --exclude
  • intersections (AND): not yet possible, proposed in Model intersection syntax #2167
  • exclusive OR will be possible as the combination of the three above (union(A,B) —exclude intersect(A,B))

Well defined "pseudo-selectors"

We can encode a dynamic selector that returns resources based on a set of conditions, which dbt uses to pick specific nodes at build time. I'm including a couple possibilities of varying complexity, mainly to spur the imagination:

  • this_package_only
    • Only execute models + tests that are defined in the current "home? project
    • Dynamic based on the project directory from which it's run.
  • build_if_missing
    • Exclude model nodes that already exist as relations in the target database + schema
  • build_if_changed
    • point to manifest.json from a different dbt build, and dbt can compare to infer changed resources
    • sensible pattern: select all nodes with changes + their children
  • build_if_updated
    • point to a manifest.json from a different dbt build, and the result of a more recent dbt source snapshot-freshness. dbt can determine whether

(Very) hypothetical spec

version: 2

selectors:

  - name: snowplow_marketing_nightly    # human-friendly name for this custom node grouping
    definition:
      - union:          # include nodes for which ANY of the selectors below is true
        - intersect:    # include nodes for which ALL of the selectors below are true
          - tag: nightly
          - tag: marketing
          - package: snowplow
          - materialized: incremental
        - union:
          - resource_name: snowplow_marketing_custom_events
          - file_path: "models/snowplow/marketing/custom_events.sql"
          - model_dir: "snowplow/marketing"
        - intersect:
          - resource_type: seed
          - package: snowplow
          - exclude:
              resource_name: country_codes
  
  - name: ci    # a different custom node grouping
    definition:
      - dynamic: build_if_changed
        parents: false
        children: true
dbt run --selector snowplow_marketing_nightly
dbt run --selector ci
dbt test --selector ci

Prior art

This carries on the legacy of several past issues (going back to #550, if not earlier). It's something we've been thinking about for some time.

Looking ahead, I believe that a good approach here will form the basis for features we're very interested in supporting:

  • complex workflows
  • smarter CI
  • dev/prod environments in larger/multi-package projects
@jtcohen6 jtcohen6 added the enhancement New feature or request label Feb 28, 2020
@drewbanin drewbanin added this to the Octavius Catto milestone Mar 24, 2020
@drewbanin drewbanin self-assigned this Mar 24, 2020
@alanmcruickshank
Copy link
Contributor

@drewbanin @jtcohen6 - I'm very invested in this feature. I think it could meaningfully improve the incremental run times of our production DAG. Especially the ability to skip any view materialised models and just run a pruned DAG of incremental and table models. I'm really pleased to find such a well through approach detailed here and in the linked issues.

I looks like this depends on #2203, so I'm assuming there's nothing I can do to help right now, but I'm very keen to help out if I can - event if that's just constructing a bank of potential test cases. Please let me know if I can help. 😁

@drewbanin drewbanin modified the milestones: Octavius Catto, dbt-next Apr 29, 2020
@drewbanin
Copy link
Contributor

@beckjake to review and advise. Sounds like PowerShell and jq have good syntaxes for arbitrary selection over a list -- what do those look like, and can we be inspired by them?

@aaronsteers
Copy link
Contributor

aaronsteers commented Jun 24, 2020

I'd like to propose a possible implementation for the "diff-only" (build_if_changed) feature which is based upon my own prior learnings with similar architectures. I'm not sure if this is already the plan but I wanted to document here in case it would be helpful.

  1. During DBT run, the source code of this node is hashed, and the result is hashed with the hashes of all upstream models and the dbt version number. The output of this process is a unique hash can be stored for each model - either on the database itself, and/or in manifest.json as a uniqueness key hash for that run.
  2. During subsequent executions to DBT run - the hashes again are calculated and compared. If --diff-only (or --skip-unchanged or similar) is specified, any object with an exactly matching hash is skipped.
  3. Objects which are missing would always fail the comparison and would therefor be built.

Importantly, this can be performed using static code analysis and is sensitive to upstream model changes. The use cases supported here are:

  1. I'm a developer (using dbt-cloud or similar incremental development process) and I don't want to wait for things to rebuild when we already know what their outputs will be.
    • Developers could (and probably would), leave this as the default setting and only disable/override it when changes are applied outside of the DBT environment (e.g. new data loaded or raw table schemas updated).
  2. I'm in production and I just released a bugfix to the main branch. Without rebuilding my entire environment, I want to automatically rebuild only objects who's source code definition has changed (along with its downstream models) - without having to manually identify which those objects are.

Would this type of "smart rebuild" be feasible and is this similar perhaps to what is already being planned?

@ucg8j
Copy link

ucg8j commented Jul 13, 2020

This could also improve the data lineage usability in dbt docs.

I don't think this is covered above. When working with massive DAGs I don't want all children/parents recursively. But want to traverse the tree a level at a time or specify the depth I want to traverse.

Much like the nix command tree takes an argument to list X many levels deep OR recursive. This might look something like e.g.

dbt model_name^1 # only immediate children
dbt model_name^2 # immediate children and grandchildren
dbt 1^model_name # immediate parents

@Raalsky
Copy link
Contributor

Raalsky commented Jul 13, 2020

@ucg8j Check direct child model selector syntax added here: #2485 . It should be released in next feature release (maybe 0.18.0 or something)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants