Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Command to auto-generate schema.yml files #1082

Closed
mikekaminsky opened this issue Oct 22, 2018 · 19 comments
Closed

Feature: Command to auto-generate schema.yml files #1082

mikekaminsky opened this issue Oct 22, 2018 · 19 comments
Labels
stale Issues that have gone stale

Comments

@mikekaminsky
Copy link
Contributor

Feature description

DBT should have a command to bootstrap a schema.yml file (or, if you ask me, files) for a given model.

dbt bootstrap_schmea --model my_model

Should create one model_name.yml file for every model, with the appropriate boilerplate (model names and column names) filled in.

It should warn and skip if model_name.yml already exists.

Who will this benefit?

When creating a new dbt project from scratch, creating all of the testing files by hand is a PITA. This will help encourage people to use tests by alleviating the need for boiler-plate copy/pasting.

How to achieve?

In redshift, you can get the column names by creating a view in an un-used namespace and then inspecting the relevant system tables. Not sure about BQ / SF but I'm hopeful there's an equivalent method.

@drewbanin
Copy link
Contributor

Great idea. Shocked we don't have an issue for this already!

This becomes a much easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right?

@kevinsanz93
Copy link

I think it would be cool if we could define the level at which the schema file is defined, maybe from the project.yml.

I can see cases where having one schema file for the entire project would be nice, but can definitely see an argument for having a {{model}}_schema.yml created in a folder structure that mirrors the model structure, where a schemas folder would be created at the root directory of a project.

@mikekaminsky
Copy link
Contributor Author

@kevinsanz93 thanks for raising this. I have always used separate schema files, one for each model, and had been assuming it would be implemented this way, but clearly that's not a good assumption!

@yennanliu
Copy link

bootstrap populate a dbt schema.yml is a good idea!

@thalesmello
Copy link

thalesmello commented Feb 21, 2019

I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:

with "columns" as (
	select '- name: "' || column_name || '"' as column_statement,
		table_name
	from information_schema."columns"
	where table_schema = 'schema_name'
)
select table_name,
'version: 2

models:
  - name: "' || table_name || '"
    columns:
' || string_agg('      ' || column_statement, E'\n') || E'\n' as yml_file
from "columns"
group by table_name
order by table_name

@drewbanin
Copy link
Contributor

@thalesmello this is super cool! Thanks for sharing :)

@adrienboutreauu
Copy link

For snowflake :


with "columns" as (
	select '- name: ' || column_name  || '\n       description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'
        
            as column_statement,
		table_name
	from information_schema.columns
	where table_schema = 'MY_TABLE'
  and table_name in ('')
  order by 1
),
tables as (
select table_name,
'
  - name: ' || table_name || '
    columns:
' || listagg('      ' || column_statement || '\n'|| '\n')  as table_desc
from "columns"
group by table_name
order by table_name
)

select '---Generated automatically, please update after generation
version: 2
sources:
  - name: '  || '
\n\nmodels:' || listagg(table_desc )
from tables;

@bashyroger
Copy link

It still would be great if a DBT macro that generates SQL could generate YML files at the same time.

We are now using a DBT macro to generate a series of views in Snowflake. These views depack JSON data in discrete columns using source system information_schema metadata

As an example, each generate view as a column {table_name}_BK. This is a column that should be unique. We would like to create a accompanying YML file for each view that will validate this uniqueness....

In a similar fashion we'd like to add more Data Quality tests to these YML files: in a metadata-driven way!

@dubravcik
Copy link

For SQL Server:

DECLARE @schema_yaml varchar(max)='';

WITH [columns] AS (
       SELECT [table] = '- name: ' + TABLE_NAME + char(13) + char(10)
                      + '  description: ' + char(13) + char(10)
                      + '  columns: ' + char(13) + char(10)

              ,[column] = '    - name: ' + COLUMN_NAME + char(13) + char(10) 
                        + '      description: ' + char(13) + char(10)
              ,[column_index]  = ORDINAL_POSITION  
              ,[table_schema] = '- '+ TABLE_SCHEMA + char(13) + char(10) + char(13) + char(10)          
         FROM INFORMATION_SCHEMA.COLUMNS 
)

,[tables] AS (
       SELECT [table_full] = [table] + STRING_AGG(CAST([column] as varchar(max)),'') WITHIN GROUP (ORDER BY [column_index])
              ,[table_schema]
         FROM [columns]
        GROUP BY [table], [table_schema]
)

,[schema] AS (
       SELECT [table_schema_full] = [table_schema] + STRING_AGG([table_full],'')  WITHIN GROUP (ORDER BY [table_full])
         FROM [tables]
         GROUP BY [table_schema]
)

,[final] AS (
       SELECT [text] = STRING_AGG([table_schema_full],'')
         FROM [schema]
)

SELECT @schema_yaml = [text] FROM final

PRINT @schema_yaml


---- if result is longer than 8000 characters, you have to print in iterations
--DECLARE @i int = 1
--WHILE EXISTS(SELECT(SUBSTRING(@schema_yaml,@i,4000))) and (@i < LEN(@schema_yaml))
--BEGIN
--     PRINT SUBSTRING(@schema_yaml,@i,4000)
--     SET @i = @i+4000
--END

@jranks123
Copy link

jranks123 commented Feb 21, 2020

For BigQuery:

WITH columns as (
    SELECT 
        '- name: "'
            || column_name
            ||  '"\n      description: '
            || lower(column_name) 
            || ')\n' 
            || '      tests:\n' as column_statement,
	table_name
    FROM (
        SELECT *
        FROM `<project>.<data set>`.INFORMATION_SCHEMA.COLUMNS
        where table_name='<table name>'
    )
),
  
tables as (
    SELECT
        table_name,
        '- name: ' || table_name || '\n'
        || '  description:\n\n'
        || '  columns:\n' || string_agg('    ' || column_statement, '\n') as table_desc
    FROM columns
    GROUP BY table_name
    ORDER BY table_name
)


SELECT 
     '#Generated automatically, please update after generation\n'
    || 'version: 2\n\n'
    || 'models:\n'
    || string_agg('' || table_desc || '\n') 
FROM tables;

@bhavkap
Copy link

bhavkap commented May 13, 2020

Variant of the above script that works on Redshift

select p.name,
'version: 2

models:
  - name: "' || p.name || '"
    columns:
' || listagg(distinct ('      ' || '- name: ' || attname || '\n')) WITHIN GROUP (ORDER BY 1) AS "columns"
FROM pg_attribute a, pg_namespace ns, pg_class c, pg_type t, stv_tbl_perm p, pg_database db
WHERE t.oid=a.atttypid AND a.attrelid=p.id AND ns.oid = c.relnamespace AND db.oid = p.db_id AND c.oid = a.attrelid
AND typname NOT IN ('oid','xid','tid','cid')
AND ns.nspname = 'schema_name'
group by p.name

@dubravcik
Copy link

I think we should extend the scope, not only to initialize a new schema*.yml file but also to sync existing shema*.yml with current model.
Currently I add or delete columns in schema.yml manually and don’t enjoy it. So many times the change remains undocumented in the yml.
Btw I use one yml per one model.

@robastel
Copy link

A couple of coworkers and I built a tool to help automate this process.

The short explanation:

  • Behind the scenes it uses dbt ls for resource selection, and dbt run-operation combined with get_columns_in_query to retrieve column names.
  • It supports creating property (.yml) files for models, seeds, snapshots, and analyses.
    • It even works for ephemeral models!
  • It is designed to create a one-to-one relationship between resources and property files. So for example, if a resource is located at models/path/to/resource.sql, then a property file will be created at models/path/to/resource.yml.
  • If a property file already exists, its column list will be synced against results from the data warehouse at runtime. As long as a column still exists in the results from the data warehouse, its property values such as descriptions, tests, etc. will be preserved. One limitation is that formatting and comments will not be preserved.

Basic usage:

  1. Make sure you have previously dbt run or dbt seed or dbt snapshot the relevant resources. This is important because the columns listed in property files will be based on results from the data warehouse.
  2. pip install dbt-invoke
  3. dbt-invoke properties
    • The first time you run this, you should be prompted to automatically add a macro called _log_columns_list (which wraps get_columns_in_query) to your project.
    • For control over which resources to create property files for, you can use most of the same options available in dbt ls (long flags only):
      • --resource-type
      • --models
      • --select
      • --selector
      • --exclude
      • --project-dir
      • --profiles-dir
      • --profile
      • --target
      • --vars
      • --bypass-cache
      • --state
    • For performance, there is an extra option --threads (be aware that increasing the number of threads may increase the load on your data warehouse)

Notes:

  • When updating existing property files, formatting and comments are not preserved.
  • We've tested on Python 3.6, 3.7, 3.8 as well as dbt 0.18.2 and 0.19.1, but not across different types of data warehouses, although as far as I can tell the implementation doesn't seem to be dependent on warehouse type (famous last words).

Repo with full README:

I hope this is helpful and works for some of you!

@asheone
Copy link

asheone commented Jul 8, 2021

The snowflake script for generating schema.yml with slightly improved identation (from @adrienboutreauu ):

with "columns" as (
	select '    - name: ' || column_name  || '\n            description: '|| lower(column_name) || ' (data type '|| lower(DATA_TYPE) || ')'

            as column_statement,
		table_name
	from information_schema.columns
	where table_schema = 'TABLE_NAME'
  order by 1
),
tables as (
select table_name,
'
    - name: ' || table_name || '
      columns:
'    || listagg('      ' || column_statement || '\n'|| '\n')  as table_desc
from "columns"
group by table_name
order by table_name
)

select '---Generated automatically, please update after generation
version: 2
sources:
  - name: '  || '
\n\nmodels:' || listagg(table_desc )
from tables;

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 6, 2022
@alexrosenfeld10
Copy link
Contributor

alexrosenfeld10 commented Feb 20, 2022

Here's my version of these generation helpers. Most notable differences being:

  • Handles modes v.s. sources via macros and a runner script
  • Ordering is consistently alphabetical (Snowflake's listagg function is unpredictable until tamed)
  • Formatting is a tiny bit cleaner, at least in my IDE 🤷

Models

Macro

gen_models.sql

{% macro gen_models(database_name, schema_name) %}
    {% set sql %}
with "columns" as (
	select '- name: ' || lower(column_name) || '\n        description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'

            as column_statement,
		table_name,
        column_name
	from {{ database_name }}.information_schema.columns
	where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
  - name: ' || lower(table_name) || '
    columns:
'    || listagg('      ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)

select '# This file was generated automatically. Please place desired portions into the project manually.
version: 2

models:' || listagg(table_desc) within group ( order by table_name )
from tables;
{% endset %}


    {%- call statement('generator', fetch_result=True) -%}
    {{ sql }}
    {%- endcall -%}

    {%- set states=load_result('generator') -%}
    {%- set states_data=states['data'] -%}
    {%- set states_status=states['response'] -%}

{# we log as warning so it's easier for jq to find in the output #}
    {% do exceptions.warn(states_data[0][0]) %}
{% endmacro %}

Sources

Macro

gen_sources.sql

{% macro gen_sources(database_name, schema_name) %}
    {% set sql %}
with "columns" as (
	select '- name: ' || lower(column_name) || '\n            description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'

            as column_statement,
		table_name,
	    column_name
	from {{ database_name }}.information_schema.columns
	where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
      - name: ' || lower(table_name) || '
        columns:
'    || listagg('          ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)

select '# Generated automatically, please update after generation
version: 2

sources:
  - name: {{ schema_name }}
    description: you fill this out
    database: you fill this out. It might need some jinja logic.
    schema: {{ schema_name }}
    loader: where does this source originate from?
    loaded_at_field: what field indicates the last time a row was updated from the source?

    tables:' || listagg(table_desc) within group ( order by table_name )
from tables;

{% endset %}


    {%- call statement('generator', fetch_result=True) -%}
    {{ sql }}
    {%- endcall -%}

    {%- set states=load_result('generator') -%}
    {%- set states_data=states['data'] -%}
    {%- set states_status=states['response'] -%}

{# we log as warning so it's easier for jq to find in the output #}
    {% do exceptions.warn(states_data[0][0]) %}
{% endmacro %}

Scripts

For usability, I have a script that does this:

# Sample usage:

# models:
# ./scripts/generate_yml.sh models your_db_name your_schema_name

# sources:
# ./scripts/generate_yml.sh sources your_db_name your_schema_name

echo "Generating your $1..."
if dbt_logs=$(dbt --log-format json run-operation gen_"$1" --args "{database_name: $2, schema_name: $3}"); then
  echo "Model generation succeeded. Parsing dbt logs to obtain query output."
  if generated_yml=$(jq -rs '. |  map(select(.level == "warn")) | first | .data.msg' <<<"$dbt_logs"); then
    echo "Found query output in logs. Writing to generated_$1.yml"
    echo "$generated_yml" >generated_"$1".yml
    exit 0
  else
    echo "Error occurred locating yml in dbt query output. Check the dbt log file for more info."
    exit 1
  fi
else
  echo "Error occurred running dbt or querying snowflake. Ensure your arguments are valid and try again."
  echo "If things still aren't working, your dbt project is probably mis-configured. Run 'dbt compile' to see what might be wrong."
  exit 1
fi

Obviously the generated_[models|sources].yml files are in my .gitignore.

@abhineetgupta
Copy link

I wanted to add the dbt-codegen package which also accomplishes these objectives.

@sizhitu
Copy link

sizhitu commented Dec 27, 2022

For PostgresQL:


WITH columns as (
    SELECT 
        '- name: '
            || column_name
            ||  '\n      description: "'
            || coalesce(lower(column_statement),'')
            || '"\n' as column_statement,
	    table_name,
	    tabcomment
    FROM (
        SELECT c.relname as table_name,
               a.attname AS column_name,
               col_description(a.attrelid,a.attnum) as column_statement,
               obj_description(c.oid) as tabcomment
          FROM pg_class c,pg_attribute a,pg_type t
         WHERE c.relname = 'tablename'
           and a.attnum > 0 
           and a.attrelid = c.oid
           and a.atttypid = t.oid
         ORDER BY a.attnum
    ) tab
),
  
tables as (
    SELECT
        table_name,
        '- name: ' || table_name || '\n'
        || '  description: "'||tabcomment||'"\n'
        || '  columns:\n' || string_agg(('    ' || column_statement), '\n') as table_desc
    FROM columns
    GROUP BY table_name,tabcomment
    ORDER BY table_name
)

SELECT 
     '#Generated automatically, please update after generation\n'
    || 'version: 2\n\n'
    || 'models:\n'
    || string_agg(('' || table_desc ), '\n') 
FROM tables;

@eng-rodrigocunha
Copy link

eng-rodrigocunha commented Mar 1, 2023

I wrote a schema.yml generator query using the information_schema of the generated tables in a postgres warehouse. In case it's useful to anyone:

For BigQuery:

with columns as (
	select '- name: "' || column_name || '"' as column_statement,
		table_name
	from `<project>.<data set>`.INFORMATION_SCHEMA.COLUMNS

),
tables as (
select table_name,
"  - name: " || table_name || "\n" ||
"    columns:\n" ||
string_agg('      ' || column_statement, "\n") as yml_file
from columns
group by table_name
order by table_name
)
select
"version: 2\n" ||
"\n" ||
"models:\n" ||
string_agg(yml_file, "\n") as yml_file
from tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests