-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Command to auto-generate schema.yml files #1082
Comments
Great idea. Shocked we don't have an issue for this already! This becomes a much easier problem if dbt can just inspect existing relations built from your model sql. That would require you to run your models first, but I imagine that's more inline with a typical development workflow anyway, right? |
I think it would be cool if we could define the level at which the schema file is defined, maybe from the project.yml. I can see cases where having one schema file for the entire project would be nice, but can definitely see an argument for having a {{model}}_schema.yml created in a folder structure that mirrors the model structure, where a schemas folder would be created at the root directory of a project. |
@kevinsanz93 thanks for raising this. I have always used separate schema files, one for each model, and had been assuming it would be implemented this way, but clearly that's not a good assumption! |
bootstrap populate a dbt schema.yml is a good idea! |
I wrote a with "columns" as (
select '- name: "' || column_name || '"' as column_statement,
table_name
from information_schema."columns"
where table_schema = 'schema_name'
)
select table_name,
'version: 2
models:
- name: "' || table_name || '"
columns:
' || string_agg(' ' || column_statement, E'\n') || E'\n' as yml_file
from "columns"
group by table_name
order by table_name |
@thalesmello this is super cool! Thanks for sharing :) |
For snowflake :
|
It still would be great if a DBT macro that generates SQL could generate YML files at the same time. We are now using a DBT macro to generate a series of views in Snowflake. These views depack JSON data in discrete columns using source system information_schema metadata As an example, each generate view as a column In a similar fashion we'd like to add more Data Quality tests to these YML files: in a metadata-driven way! |
For SQL Server: DECLARE @schema_yaml varchar(max)='';
WITH [columns] AS (
SELECT [table] = '- name: ' + TABLE_NAME + char(13) + char(10)
+ ' description: ' + char(13) + char(10)
+ ' columns: ' + char(13) + char(10)
,[column] = ' - name: ' + COLUMN_NAME + char(13) + char(10)
+ ' description: ' + char(13) + char(10)
,[column_index] = ORDINAL_POSITION
,[table_schema] = '- '+ TABLE_SCHEMA + char(13) + char(10) + char(13) + char(10)
FROM INFORMATION_SCHEMA.COLUMNS
)
,[tables] AS (
SELECT [table_full] = [table] + STRING_AGG(CAST([column] as varchar(max)),'') WITHIN GROUP (ORDER BY [column_index])
,[table_schema]
FROM [columns]
GROUP BY [table], [table_schema]
)
,[schema] AS (
SELECT [table_schema_full] = [table_schema] + STRING_AGG([table_full],'') WITHIN GROUP (ORDER BY [table_full])
FROM [tables]
GROUP BY [table_schema]
)
,[final] AS (
SELECT [text] = STRING_AGG([table_schema_full],'')
FROM [schema]
)
SELECT @schema_yaml = [text] FROM final
PRINT @schema_yaml
---- if result is longer than 8000 characters, you have to print in iterations
--DECLARE @i int = 1
--WHILE EXISTS(SELECT(SUBSTRING(@schema_yaml,@i,4000))) and (@i < LEN(@schema_yaml))
--BEGIN
-- PRINT SUBSTRING(@schema_yaml,@i,4000)
-- SET @i = @i+4000
--END |
For BigQuery: WITH columns as (
SELECT
'- name: "'
|| column_name
|| '"\n description: '
|| lower(column_name)
|| ')\n'
|| ' tests:\n' as column_statement,
table_name
FROM (
SELECT *
FROM `<project>.<data set>`.INFORMATION_SCHEMA.COLUMNS
where table_name='<table name>'
)
),
tables as (
SELECT
table_name,
'- name: ' || table_name || '\n'
|| ' description:\n\n'
|| ' columns:\n' || string_agg(' ' || column_statement, '\n') as table_desc
FROM columns
GROUP BY table_name
ORDER BY table_name
)
SELECT
'#Generated automatically, please update after generation\n'
|| 'version: 2\n\n'
|| 'models:\n'
|| string_agg('' || table_desc || '\n')
FROM tables; |
Variant of the above script that works on Redshift
|
I think we should extend the scope, not only to initialize a new schema*.yml file but also to sync existing shema*.yml with current model. |
A couple of coworkers and I built a tool to help automate this process. The short explanation:
Basic usage:
Notes:
Repo with full README: I hope this is helpful and works for some of you! |
The snowflake script for generating
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Here's my version of these generation helpers. Most notable differences being:
ModelsMacro
{% macro gen_models(database_name, schema_name) %}
{% set sql %}
with "columns" as (
select '- name: ' || lower(column_name) || '\n description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'
as column_statement,
table_name,
column_name
from {{ database_name }}.information_schema.columns
where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
- name: ' || lower(table_name) || '
columns:
' || listagg(' ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)
select '# This file was generated automatically. Please place desired portions into the project manually.
version: 2
models:' || listagg(table_desc) within group ( order by table_name )
from tables;
{% endset %}
{%- call statement('generator', fetch_result=True) -%}
{{ sql }}
{%- endcall -%}
{%- set states=load_result('generator') -%}
{%- set states_data=states['data'] -%}
{%- set states_status=states['response'] -%}
{# we log as warning so it's easier for jq to find in the output #}
{% do exceptions.warn(states_data[0][0]) %}
{% endmacro %} SourcesMacro
{% macro gen_sources(database_name, schema_name) %}
{% set sql %}
with "columns" as (
select '- name: ' || lower(column_name) || '\n description: "'|| lower(column_name) || ' (snowflake data type: '|| lower(DATA_TYPE) || ')"'
as column_statement,
table_name,
column_name
from {{ database_name }}.information_schema.columns
where table_schema = '{{ schema_name | upper }}'
),
tables as (
select table_name,
'
- name: ' || lower(table_name) || '
columns:
' || listagg(' ' || column_statement || '\n') within group ( order by column_name ) as table_desc
from "columns"
group by table_name
)
select '# Generated automatically, please update after generation
version: 2
sources:
- name: {{ schema_name }}
description: you fill this out
database: you fill this out. It might need some jinja logic.
schema: {{ schema_name }}
loader: where does this source originate from?
loaded_at_field: what field indicates the last time a row was updated from the source?
tables:' || listagg(table_desc) within group ( order by table_name )
from tables;
{% endset %}
{%- call statement('generator', fetch_result=True) -%}
{{ sql }}
{%- endcall -%}
{%- set states=load_result('generator') -%}
{%- set states_data=states['data'] -%}
{%- set states_status=states['response'] -%}
{# we log as warning so it's easier for jq to find in the output #}
{% do exceptions.warn(states_data[0][0]) %}
{% endmacro %} ScriptsFor usability, I have a script that does this: # Sample usage:
# models:
# ./scripts/generate_yml.sh models your_db_name your_schema_name
# sources:
# ./scripts/generate_yml.sh sources your_db_name your_schema_name
echo "Generating your $1..."
if dbt_logs=$(dbt --log-format json run-operation gen_"$1" --args "{database_name: $2, schema_name: $3}"); then
echo "Model generation succeeded. Parsing dbt logs to obtain query output."
if generated_yml=$(jq -rs '. | map(select(.level == "warn")) | first | .data.msg' <<<"$dbt_logs"); then
echo "Found query output in logs. Writing to generated_$1.yml"
echo "$generated_yml" >generated_"$1".yml
exit 0
else
echo "Error occurred locating yml in dbt query output. Check the dbt log file for more info."
exit 1
fi
else
echo "Error occurred running dbt or querying snowflake. Ensure your arguments are valid and try again."
echo "If things still aren't working, your dbt project is probably mis-configured. Run 'dbt compile' to see what might be wrong."
exit 1
fi Obviously the |
I wanted to add the dbt-codegen package which also accomplishes these objectives. |
For PostgresQL:
|
For BigQuery:
|
Feature description
DBT should have a command to bootstrap a
schema.yml
file (or, if you ask me, files) for a given model.dbt bootstrap_schmea --model my_model
Should create one
model_name.yml
file for every model, with the appropriate boilerplate (model names and column names) filled in.It should warn and skip if
model_name.yml
already exists.Who will this benefit?
When creating a new dbt project from scratch, creating all of the testing files by hand is a PITA. This will help encourage people to use tests by alleviating the need for boiler-plate copy/pasting.
How to achieve?
In redshift, you can get the column names by creating a view in an un-used namespace and then inspecting the relevant system tables. Not sure about BQ / SF but I'm hopeful there's an equivalent method.
The text was updated successfully, but these errors were encountered: