Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generate_database_name() macro #1695

Closed
whisperstream opened this issue Aug 21, 2019 · 8 comments · Fixed by #2143
Closed

Add a generate_database_name() macro #1695

whisperstream opened this issue Aug 21, 2019 · 8 comments · Fixed by #2143
Labels
enhancement New feature or request

Comments

@whisperstream
Copy link

whisperstream commented Aug 21, 2019

Describe the feature

In the current version (v14.0), there exists a {{ generate_schema_name_for_env }} macro which works very well in dev mode, allowing a production run to write to the specified schema but writing all tables and views to a dev schema when in dev mode. In the same way we need a {{generate_database_name_for_env}} macro for when a database is configured in the dbt_project.yml file.

Currently:
If I have in my dbt_project.yml file a section in models that reads:

models:
   product:
      database: mart_db
      materialized: view
      schema: mart_schema

and in my profiles.yml file I have:

my_dbt:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: *******

      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"

      role: "ur_{{ env_var('DBT_USER') }}"
      database: dbt_dev

Then my models in dev mode will be written using mart_db instead of dev_db.

Describe alternatives you've considered

Right now to solve this I've created alternative ref macro called xref to override this behavior but it feels a bit clunky to do this and I will have to tell out dbt devs to all use {{ xref('some table') }} instead of the inbuilt ref function.

Additional context

Not database specific, it's a dbt issue.

Who will this benefit?

Anyone who wants to specify a set of production databases in their dbt_project.yml file in the same way that they might already do for their schemas using the existing {{generate_database_name_for_env}} macro but who also wants to have dbt write all tables and views into a single schema when in dev_mode.

@whisperstream whisperstream added enhancement New feature or request triage labels Aug 21, 2019
@drewbanin
Copy link
Contributor

Hey @whisperstream - I think this is a good idea!

We historically have recommended building all dbt models into a single database - can you tell me more about why you're interested in building models across databases? I'm all for building the functionality described in this feature request, but I also want to get a handle on what a typical use case looks like :)

@drewbanin drewbanin removed the triage label Aug 21, 2019
@whisperstream
Copy link
Author

Mostly this comes from an organizational perspective. My (current) plan is to have 3 databases, staging, bizready and marts. Within each of the first two databases I use one schema per source of data.

I'm working on a large enterprise project and I'm trying to enable multiple teams to contribute dbt scripts for the data sources they have access to. This is further complicated by having two tiers of dbt developers.

Tier 1 - dev_raw
Can contribute dbt scripts for a specific source of data, transforming that source's data from staging to business ready. Has access only to a single or subset of the sources, does not have access to build the entire staging -> bizready dbt project.

Tier 2 - dev analyst
Contributes dbt scripts that create tables and views from specific sources of the bizready data. Has no access to the raw/staging data.

I could use different dbt_project.yml for the different dev types but I want to keep the lineage data when generating docs.

It seems that for most companies, it's assumed the dbt developer has access to run the whole dbt_project.yml in one go, in my case, only the integration and production environments will have this capability. If there's a better way of structuring the project am open to that too, just haven't found/thought of another way and so worked around it by creating an xref() alternative that figures out what sort of dbt dev you are (tier1/tier2) and rewrites the fully qualified table name accordingly.

@drewbanin
Copy link
Contributor

Thanks for the context @whisperstream - this is really interesting! One approach that may be interesting: you can use multiple different dbt projects and they can state each other as dependencies. Check out the docs on packages here if you haven't seen them already.

In your case, the Tier 2 package could "import" the Tier 2 package. That would still let you ref across projects and the generated documentation would show the full lineage graph. If anything, this might make the docs better because it would separate the two sets of models into different projects :)

I think there's merit to implementing a generate_database_name macro regardless (at the very least for consistency's sake!) but also wanted to throw this alternative approach out there. Thanks for sharing your use case with me!

@whisperstream
Copy link
Author

@drewbanin thanks for that, let me try and arrange my current project into a multi package model and see if that works too.

@whisperstream
Copy link
Author

whisperstream commented Aug 21, 2019

Ok back again. I think packages help separate out the projects into logical parts, but I still need to do the rewriting.

For Tier1 devs - the ref() function works normally

But for Tier2 devs, even when models are separated into different packages I still need to have an xref() macro because since Tier2 has only has access to bizready tables from Tier1, I have to override the normal ref() function and return instead the final production table name. Does that make sense?

Another semi related item I wondered was if ref should be allowed to be overridden. Before creating xref I tried just creating a macro called ref but that caused a recursive loop I think or some other error. So I wondered if it would be better to rename the actual ref function and then for dbt to have a default macro which would be something like.

{%- macro ref(package_name, table_name=None) -%}
    {{ renamed_ref(package_name, table_name) }}
{% endmacro %}

That way if people have more complicated things to do with ref, they can do it in a more seamless way, that also keeps everything appearing as standard dbt syntax.

@drewbanin
Copy link
Contributor

Hey @whisperstream - check out this thread. It's a long one, but we set forth some approaches in this thread to make it possible to override ref! I don't think that would be an unreasonable thing for us to do at all.

Can you show me what your xref() macro looks like? I follow a lot of what you're saying at a high-level, but this part isn't super clear to me:

I have to override the normal ref() function and return instead the final production table name

What happens when a Tier 2 analyst runs a Tier 1 model? Do they just exclude those models from their dbt runs? Can you also share what you would want your generate_database_name_for_env() macro to look like?

I think the pattern you're describing is a really good one and I'm super interested in making sure that dbt does a good job of facilitating it! I appreciate you taking the time :)

@whisperstream
Copy link
Author

whisperstream commented Aug 22, 2019

Let me clean up my xref macro a bit and add some comments, in the mean time I'll try and explain this part I have to override the normal ref() function and return instead the final production table name a bit better:

So just to preface this, what I describe below (and the condition in xref()) only happens when target.name!=prod AND env_var('DEVELOPER_TYPE') == tier2, otherwise dbt runs as normal.

Normally in dbt you create a hierarchy of models, i.e.

[actual raw table] -> model_raw -> model_stage -> model_bizready -> model_mart

This assumes you have access to [actual raw table], but in my case tier2 developers only have access to [production bizready table] (created by a previous production run of dbt)

So when a tier2 developer executes:

dbt -run -m +model_mart

dbt will try and execute the whole DAG, which will fail because tier2 can't access the [actual raw table]. To work around this, my xref() macro figures out that xref(model_bizready) should not keep traversing the DAG as normal, but instead should rewrite the DAG to reference the production table that the tier2 dev does have access to. So for xref(model_bizready) the DAG becomes:

[production bizready table] -> model_mart

That way the tier2 developer doesn't need access to anything including or proceeding bizready because they don't have access rights to it anyway, but they're still able to contribute to and execute dbt for any mart models they wish to work on.

Conversely a tier1 developer or a production process executing the same dbt -run -m +model_mart will execute the whole DAG normally
i.e.
[actual raw table] -> model_raw -> model_stage -> model_bizready -> model_mart

@drewbanin drewbanin changed the title Need a {{generate_database_name_for_env}} macro generate_database_name_for_env}} macro Aug 27, 2019
@drewbanin drewbanin changed the title generate_database_name_for_env}} macro Add a generate_database_name() macro Aug 27, 2019
@whisperstream
Copy link
Author

whisperstream commented Sep 9, 2019

@drewbanin

Ok here's what my xref macro is looking like:

all bizready tables are prefixed with br_, so a sample table might be br_s1__some_dataset,

In prod mode or when DEVELOPER_TYPE is RAW_DEV then xref just acts like ref. but if developer mode is BIZREADY_DEV and a br_ table is being referenced then it rewrites the table reference to point to production. If target != prod I always do the mapping because I need xref to fail if anyone adds a new bizready schema. Anyway hopefully the use case makes sense. The database name override isn't the only piece I'd need to do this more elegantly, I'd also need to some how get the custom schema name from the config.

Originally I was getting the custom schema name if the developer mode was BIZREADY_DEV but dbt creates the schemas first instead of doing it only if something is written to that schema and this meant that if two devs were running dbt at the same time, one would error out because dbt wouldn't be able to create the custom schemas again (because the 1st dev would have created them)....that's why I have the clumsy lookup table.

Not sure if you have some better ways I might solve this use case, but being able to oevrride the ref function is probably the 1st step and then maybe having more ways look up information about a reference name, i.e. to see if it has a custom schema, custom database configured etc..?

{%- macro xref(package_name, table_name=None) -%}

    {%- if table_name is none -%}
        {%- set table_name = package_name -%}
        {%- set package_name = None -%}
    {%-  endif -%}

    {# -- only do this for bizready tables #}
    {%- if (package_name is none and target.name == 'prod') or not table_name.startswith('br_') -%}
        {{- return( ref(package_name, table_name) ) -}}
    {%- endif -%}

    {%- set orig_ref = ref(table_name) -%}
    {%- set final_table = orig_ref -%}

    {%- set prefix_mapping = {
            's1':  'schema1',
            's2':  'schema2',
            's3':  'schema3'
        }
    -%}

    {# -- lookup the mapping and get the schema name -- #}
    {%- set db_schema = prefix_mapping[table_name.split('__')[0][3:]] if (table_name|string).startswith('br_') else None -%}

    {%- if var('DEVELOPER_TYPE') == 'BIZREADY_DEV' -%}

        {%- set db = var('BIZREADY_PROD_DB') -%}

        {%- set final_table = db + '.' + db_schema + '.' + table_name -%}

        {{- log('[' ~ var('DEVELOPER_TYPE') ~ ' MODE] - Converted original table ref: ' ~ table_name
            ~ ' from expected conversion: ' ~ orig_ref ~ ' to: ' ~ final_table) -}}

    {%- else -%}

        {{- log('No changes made for table: ' ~ table_name ~ ' and ref: ' ~ orig_ref) -}}

    {%- endif -%}

    {{- log('Returning final table:' ~ final_table) -}}

    {{- final_table -}}

{%- endmacro -%}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants