Feature: Command to auto-generate schema.yml files #1082

mikekaminsky · 2018-10-22T23:29:50Z

Feature description

DBT should have a command to bootstrap a schema.yml file (or, if you ask me, files) for a given model.

dbt bootstrap_schmea --model my_model

Should create one model_name.yml file for every model, with the appropriate boilerplate (model names and column names) filled in.

It should warn and skip if model_name.yml already exists.

Who will this benefit?

When creating a new dbt project from scratch, creating all of the testing files by hand is a PITA. This will help encourage people to use tests by alleviating the need for boiler-plate copy/pasting.

How to achieve?

In redshift, you can get the column names by creating a view in an un-used namespace and then inspecting the relevant system tables. Not sure about BQ / SF but I'm hopeful there's an equivalent method.

The text was updated successfully, but these errors were encountered:

drewbanin · 2018-10-23T00:49:50Z

Great idea. Shocked we don't have an issue for this already!

This becomes a much easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?

kevinsanz93 · 2018-11-16T18:02:17Z

I think it would be cool if we could define the level at which the schema file is defined, maybe from the project.yml.

I can see cases where having one schema file for the entire project would be nice, but can definitely see an argument for having a {{model}}_schema.yml created in a folder structure that mirrors the model structure, where a schemas folder would be created at the root directory of a project.

mikekaminsky · 2018-11-16T18:15:53Z

@kevinsanz93 thanks for raising this. I have always used separate schema files, one for each model, and had been assuming it would be implemented this way, but clearly that's not a good assumption!

yennanliu · 2019-01-07T15:53:11Z

bootstrap populate a dbt schema.yml is a good idea!

thalesmello · 2019-02-21T23:14:04Z

I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:

with "columns" as (
	select '- name: "' || column_name || '"' as column_statement,
		table_name
	from information_schema."columns"
	where table_schema = 'schema_name'
)
select table_name,
'version: 2

models:
  - name: "' || table_name || '"
    columns:
' || string_agg('      ' || column_statement, E'\n') || E'\n' as yml_file
from "columns"
group by table_name
order by table_name

drewbanin · 2019-02-21T23:33:48Z

@thalesmello this is super cool! Thanks for sharing :)

adrienboutreauu · 2019-07-08T14:11:02Z

For snowflake :


with "columns" as (
	select '- name: ' || column_name  || '\n       description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'
        
            as column_statement,
		table_name
	from information_schema.columns
	where table_schema = 'MY_TABLE'
  and table_name in ('')
  order by 1
),
tables as (
select table_name,
'
  - name: ' || table_name || '
    columns:
' || listagg('      ' || column_statement || '\n'|| '\n')  as table_desc
from "columns"
group by table_name
order by table_name
)

select '---Generated automatically, please update after generation
version: 2
sources:
  - name: '  || '
\n\nmodels:' || listagg(table_desc )
from tables;

bashyroger · 2020-01-29T15:34:35Z

It still would be great if a DBT macro that generates SQL could generate YML files at the same time.

We are now using a DBT macro to generate a series of views in Snowflake. These views depack JSON data in discrete columns using source system information_schema metadata

As an example, each generate view as a column {table_name}_BK. This is a column that should be unique. We would like to create a accompanying YML file for each view that will validate this uniqueness....

In a similar fashion we'd like to add more Data Quality tests to these YML files: in a metadata-driven way!

dubravcik · 2020-02-12T16:58:43Z

For SQL Server:

DECLARE @schema_yaml varchar(max)='';

WITH [columns] AS (
       SELECT [table] = '- name: ' + TABLE_NAME + char(13) + char(10)
                      + '  description: ' + char(13) + char(10)
                      + '  columns: ' + char(13) + char(10)

              ,[column] = '    - name: ' + COLUMN_NAME + char(13) + char(10) 
                        + '      description: ' + char(13) + char(10)
              ,[column_index]  = ORDINAL_POSITION  
              ,[table_schema] = '- '+ TABLE_SCHEMA + char(13) + char(10) + char(13) + char(10)          
         FROM INFORMATION_SCHEMA.COLUMNS 
)

,[tables] AS (
       SELECT [table_full] = [table] + STRING_AGG(CAST([column] as varchar(max)),'') WITHIN GROUP (ORDER BY [column_index])
              ,[table_schema]
         FROM [columns]
        GROUP BY [table], [table_schema]
)

,[schema] AS (
       SELECT [table_schema_full] = [table_schema] + STRING_AGG([table_full],'')  WITHIN GROUP (ORDER BY [table_full])
         FROM [tables]
         GROUP BY [table_schema]
)

,[final] AS (
       SELECT [text] = STRING_AGG([table_schema_full],'')
         FROM [schema]
)

SELECT @schema_yaml = [text] FROM final

PRINT @schema_yaml


---- if result is longer than 8000 characters, you have to print in iterations
--DECLARE @i int = 1
--WHILE EXISTS(SELECT(SUBSTRING(@schema_yaml,@i,4000))) and (@i < LEN(@schema_yaml))
--BEGIN
--     PRINT SUBSTRING(@schema_yaml,@i,4000)
--     SET @i = @i+4000
--END

jranks123 · 2020-02-21T17:10:02Z

For BigQuery:

WITH columns as (
    SELECT 
        '- name: "'
            || column_name
            ||  '"\n      description: '
            || lower(column_name) 
            || ')\n' 
            || '      tests:\n' as column_statement,
	table_name
    FROM (
        SELECT *
        FROM `<project>.<data set>`.INFORMATION_SCHEMA.COLUMNS
        where table_name='<table name>'
    )
),
  
tables as (
    SELECT
        table_name,
        '- name: ' || table_name || '\n'
        || '  description:\n\n'
        || '  columns:\n' || string_agg('    ' || column_statement, '\n') as table_desc
    FROM columns
    GROUP BY table_name
    ORDER BY table_name
)


SELECT 
     '#Generated automatically, please update after generation\n'
    || 'version: 2\n\n'
    || 'models:\n'
    || string_agg('' || table_desc || '\n') 
FROM tables;

bhavkap · 2020-05-13T22:30:21Z

Variant of the above script that works on Redshift

select p.name,
'version: 2

models:
  - name: "' || p.name || '"
    columns:
' || listagg(distinct ('      ' || '- name: ' || attname || '\n')) WITHIN GROUP (ORDER BY 1) AS "columns"
FROM pg_attribute a, pg_namespace ns, pg_class c, pg_type t, stv_tbl_perm p, pg_database db
WHERE t.oid=a.atttypid AND a.attrelid=p.id AND ns.oid = c.relnamespace AND db.oid = p.db_id AND c.oid = a.attrelid
AND typname NOT IN ('oid','xid','tid','cid')
AND ns.nspname = 'schema_name'
group by p.name

dubravcik · 2020-06-10T19:39:34Z

I think we should extend the scope, not only to initialize a new schema*.yml file but also to sync existing shema*.yml with current model.
Currently I add or delete columns in schema.yml manually and don’t enjoy it. So many times the change remains undocumented in the yml.
Btw I use one yml per one model.

robastel · 2021-04-28T16:24:12Z

A couple of coworkers and I built a tool to help automate this process.

The short explanation:

Behind the scenes it uses dbt ls for resource selection, and dbt run-operation combined with get_columns_in_query to retrieve column names.
It supports creating property (.yml) files for models, seeds, snapshots, and analyses.
- It even works for ephemeral models!
It is designed to create a one-to-one relationship between resources and property files. So for example, if a resource is located at models/path/to/resource.sql, then a property file will be created at models/path/to/resource.yml.
If a property file already exists, its column list will be synced against results from the data warehouse at runtime. As long as a column still exists in the results from the data warehouse, its property values such as descriptions, tests, etc. will be preserved. One limitation is that formatting and comments will not be preserved.

Basic usage:

Make sure you have previously dbt run or dbt seed or dbt snapshot the relevant resources. This is important because the columns listed in property files will be based on results from the data warehouse.
pip install dbt-invoke
dbt-invoke properties
- The first time you run this, you should be prompted to automatically add a macro called _log_columns_list (which wraps get_columns_in_query) to your project.
- For control over which resources to create property files for, you can use most of the same options available in dbt ls (long flags only):
  - --resource-type
  - --models
  - --select
  - --selector
  - --exclude
  - --project-dir
  - --profiles-dir
  - --profile
  - --target
  - --vars
  - --bypass-cache
  - --state
- For performance, there is an extra option --threads (be aware that increasing the number of threads may increase the load on your data warehouse)

Notes:

When updating existing property files, formatting and comments are not preserved.
We've tested on Python 3.6, 3.7, 3.8 as well as dbt 0.18.2 and 0.19.1, but not across different types of data warehouses, although as far as I can tell the implementation doesn't seem to be dependent on warehouse type (famous last words).

Repo with full README:

https://github.com/Dashlane/dbt-invoke#readme

I hope this is helpful and works for some of you!

asheone · 2021-07-08T11:29:41Z

The snowflake script for generating schema.yml with slightly improved identation (from @adrienboutreauu ):

with "columns" as (
	select '    - name: ' || column_name  || '\n            description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'

            as column_statement,
		table_name
	from information_schema.columns
	where table_schema = 'TABLE_NAME'
  order by 1
),
tables as (
select table_name,
'
    - name: ' || table_name || '
      columns:
'    || listagg('      ' || column_statement || '\n'|| '\n')  as table_desc
from "columns"
group by table_name
order by table_name
)

select '---Generated automatically, please update after generation
version: 2
sources:
  - name: '  || '
\n\nmodels:' || listagg(table_desc )
from tables;

github-actions · 2022-01-06T01:55:56Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

alexrosenfeld10 · 2022-02-20T01:08:13Z

Here's my version of these generation helpers. Most notable differences being:

Handles modes v.s. sources via macros and a runner script
Ordering is consistently alphabetical (Snowflake's listagg function is unpredictable until tamed)
Formatting is a tiny bit cleaner, at least in my IDE 🤷

Models

Macro

gen_models.sql

{% macro gen_models(database_name, schema_name) %}
    {% set sql %}
with "columns" as (
	select '- name: ' || lower(column_name) || '\n        description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'

            as column_statement,
		table_name,
        column_name
	from {{ database_name }}.information_schema.columns
	where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
  - name: ' || lower(table_name) || '
    columns:
'    || listagg('      ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)

select '# This file was generated automatically. Please place desired portions into the project manually.
version: 2

models:' || listagg(table_desc) within group ( order by table_name )
from tables;
{% endset %}


    {%- call statement('generator', fetch_result=True) -%}
    {{ sql }}
    {%- endcall -%}

    {%- set states=load_result('generator') -%}
    {%- set states_data=states['data'] -%}
    {%- set states_status=states['response'] -%}

{# we log as warning so it's easier for jq to find in the output #}
    {% do exceptions.warn(states_data[0][0]) %}
{% endmacro %}

Sources

Macro

gen_sources.sql

{% macro gen_sources(database_name, schema_name) %}
    {% set sql %}
with "columns" as (
	select '- name: ' || lower(column_name) || '\n            description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'

            as column_statement,
		table_name,
	    column_name
	from {{ database_name }}.information_schema.columns
	where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
      - name: ' || lower(table_name) || '
        columns:
'    || listagg('          ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)

select '# Generated automatically, please update after generation
version: 2

sources:
  - name: {{ schema_name }}
    description: you fill this out
    database: you fill this out. It might need some jinja logic.
    schema: {{ schema_name }}
    loader: where does this source originate from?
    loaded_at_field: what field indicates the last time a row was updated from the source?

    tables:' || listagg(table_desc) within group ( order by table_name )
from tables;

{% endset %}


    {%- call statement('generator', fetch_result=True) -%}
    {{ sql }}
    {%- endcall -%}

    {%- set states=load_result('generator') -%}
    {%- set states_data=states['data'] -%}
    {%- set states_status=states['response'] -%}

{# we log as warning so it's easier for jq to find in the output #}
    {% do exceptions.warn(states_data[0][0]) %}
{% endmacro %}

Scripts

For usability, I have a script that does this:

# Sample usage:

# models:
# ./scripts/generate_yml.sh models your_db_name your_schema_name

# sources:
# ./scripts/generate_yml.sh sources your_db_name your_schema_name

echo "Generating your $1..."
if dbt_logs=$(dbt --log-format json run-operation gen_"$1" --args "{database_name: $2, schema_name: $3}"); then
  echo "Model generation succeeded. Parsing dbt logs to obtain query output."
  if generated_yml=$(jq -rs '. |  map(select(.level == "warn")) | first | .data.msg' <<<"$dbt_logs"); then
    echo "Found query output in logs. Writing to generated_$1.yml"
    echo "$generated_yml" >generated_"$1".yml
    exit 0
  else
    echo "Error occurred locating yml in dbt query output. Check the dbt log file for more info."
    exit 1
  fi
else
  echo "Error occurred running dbt or querying snowflake. Ensure your arguments are valid and try again."
  echo "If things still aren't working, your dbt project is probably mis-configured. Run 'dbt compile' to see what might be wrong."
  exit 1
fi

Obviously the generated_[models|sources].yml files are in my .gitignore.

abhineetgupta · 2022-07-22T21:18:36Z

I wanted to add the dbt-codegen package which also accomplishes these objectives.

sizhitu · 2022-12-27T03:34:10Z

For PostgresQL:


WITH columns as (
    SELECT 
        '- name: '
            || column_name
            ||  '\n      description: "'
            || coalesce(lower(column_statement),'')
            || '"\n' as column_statement,
	    table_name,
	    tabcomment
    FROM (
        SELECT c.relname as table_name,
               a.attname AS column_name,
               col_description(a.attrelid,a.attnum) as column_statement,
               obj_description(c.oid) as tabcomment
          FROM pg_class c,pg_attribute a,pg_type t
         WHERE c.relname = 'tablename'
           and a.attnum > 0 
           and a.attrelid = c.oid
           and a.atttypid = t.oid
         ORDER BY a.attnum
    ) tab
),
  
tables as (
    SELECT
        table_name,
        '- name: ' || table_name || '\n'
        || '  description: "'||tabcomment||'"\n'
        || '  columns:\n' || string_agg(('    ' || column_statement), '\n') as table_desc
    FROM columns
    GROUP BY table_name,tabcomment
    ORDER BY table_name
)

SELECT 
     '#Generated automatically, please update after generation\n'
    || 'version: 2\n\n'
    || 'models:\n'
    || string_agg(('' || table_desc ), '\n') 
FROM tables;

eng-rodrigocunha · 2023-03-01T17:11:41Z

I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:

For BigQuery:

with columns as (
	select '- name: "' || column_name || '"' as column_statement,
		table_name
	from `<project>.<data set>`.INFORMATION_SCHEMA.COLUMNS

),
tables as (
select table_name,
"  - name: " || table_name || "\n" ||
"    columns:\n" ||
string_agg('      ' || column_statement, "\n") as yml_file
from columns
group by table_name
order by table_name
)
select
"version: 2\n" ||
"\n" ||
"models:\n" ||
string_agg(yml_file, "\n") as yml_file
from tables

mikekaminsky mentioned this issue Jan 12, 2019

Adds dbt bootstrap subcommand #1238

Closed

bashyroger mentioned this issue Feb 4, 2020

Support for metadata-based object / ETL generation: ETL automation #2091

Closed

dubravcik mentioned this issue Jul 1, 2020

docs+tests directly in a model #2084

Closed

jtcohen6 mentioned this issue Jul 3, 2021

Make it possible for DBT to generate models from a piece of JINJA code #3469

Closed

github-actions bot added the stale Issues that have gone stale label Jan 6, 2022

github-actions bot closed this as completed Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Command to auto-generate schema.yml files #1082

Feature: Command to auto-generate schema.yml files #1082

mikekaminsky commented Oct 22, 2018

drewbanin commented Oct 23, 2018

kevinsanz93 commented Nov 16, 2018

mikekaminsky commented Nov 16, 2018

yennanliu commented Jan 7, 2019

thalesmello commented Feb 21, 2019 •

edited

Loading

drewbanin commented Feb 21, 2019

adrienboutreauu commented Jul 8, 2019

bashyroger commented Jan 29, 2020

dubravcik commented Feb 12, 2020

jranks123 commented Feb 21, 2020 •

edited

Loading

bhavkap commented May 13, 2020

dubravcik commented Jun 10, 2020

robastel commented Apr 28, 2021

asheone commented Jul 8, 2021 •

edited

Loading

github-actions bot commented Jan 6, 2022

alexrosenfeld10 commented Feb 20, 2022 •

edited

Loading

abhineetgupta commented Jul 22, 2022

sizhitu commented Dec 27, 2022 •

edited

Loading

eng-rodrigocunha commented Mar 1, 2023 •

edited

Loading

Feature: Command to auto-generate schema.yml files #1082

Feature: Command to auto-generate schema.yml files #1082

Comments

mikekaminsky commented Oct 22, 2018

Feature description

Who will this benefit?

How to achieve?

drewbanin commented Oct 23, 2018

kevinsanz93 commented Nov 16, 2018

mikekaminsky commented Nov 16, 2018

yennanliu commented Jan 7, 2019

thalesmello commented Feb 21, 2019 • edited Loading

drewbanin commented Feb 21, 2019

adrienboutreauu commented Jul 8, 2019

bashyroger commented Jan 29, 2020

dubravcik commented Feb 12, 2020

jranks123 commented Feb 21, 2020 • edited Loading

bhavkap commented May 13, 2020

dubravcik commented Jun 10, 2020

robastel commented Apr 28, 2021

asheone commented Jul 8, 2021 • edited Loading

github-actions bot commented Jan 6, 2022

alexrosenfeld10 commented Feb 20, 2022 • edited Loading

Models

Macro

Sources

Macro

Scripts

abhineetgupta commented Jul 22, 2022

sizhitu commented Dec 27, 2022 • edited Loading

For PostgresQL:

eng-rodrigocunha commented Mar 1, 2023 • edited Loading

thalesmello commented Feb 21, 2019 •

edited

Loading

jranks123 commented Feb 21, 2020 •

edited

Loading

asheone commented Jul 8, 2021 •

edited

Loading

alexrosenfeld10 commented Feb 20, 2022 •

edited

Loading

sizhitu commented Dec 27, 2022 •

edited

Loading

eng-rodrigocunha commented Mar 1, 2023 •

edited

Loading