Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1544] [Bug] Should get_catalog() respect an overriden source() builtin? #6308

Closed
2 tasks done
jeremyyeo opened this issue Nov 23, 2022 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core

Comments

@jeremyyeo
Copy link
Contributor

jeremyyeo commented Nov 23, 2022

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

If we have an overriden source() function - should the get_catalog() macros (Snowflake example) respect it? Currently it does not.

Expected Behavior

Catalog introspection should respect an overriden source() function.

Steps To Reproduce

Setup our project:

# ~/.dbt/profiles.yml
snowflake:
  target: dev
  outputs:
    dev:
      database: development_jyeo
      ...

# dbt_project.yml
name: my_dbt_project
profile: snowflake
config-version: 2
version: 1.0

# models/sources.yml
version: 2
sources:
  - name: my_sources
    database: jyeo  # Note that this doesn't exist - we will fix it in our source() override.
    tables:
      - name: source_bar
-- models/foo.sql
select * from {{ source('my_sources', 'source_bar') }}

-- macros/override.sql
{% macro source(source_name, table_name) %}
    {% set rel = builtins.source(source_name, table_name) %}
    {% if not rel.database.startswith("raw_") %}
        {% set new_db = "raw_" ~ rel.database %}
        {% set rel = rel.replace_path(database = new_db) %}
    {% endif %}
    {% do return(rel) %}
{% endmacro %}

Run and inspect generated queries:

$ dbt --debug run
...
07:40:17.218401 [debug] [Thread-1  ]: On model.my_dbt_project.foo: /* {"app": "dbt", "dbt_version": "1.3.0", "profile_name": "snowflake", "target_name": "dev", "node_id": "model.my_dbt_project.foo"} */
create or replace  view development_jyeo.dbt_jyeo.foo
  
   as (
    select * from raw_jyeo.my_sources.source_bar
  );
07:40:17.219246 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
07:40:19.362508 [debug] [Thread-1  ]: SQL status: SUCCESS 1 in 2.14 seconds
...

Now let's try a docs generate:

$ dbt --debug docs generate
...
07:41:21.484776 [debug] [ThreadPool]: On jyeo.information_schema: /* {"app": "dbt", "dbt_version": "1.3.0", "profile_name": "snowflake", "target_name": "dev", "connection_name": "jyeo.information_schema"} */
with tables as (

          select
              table_catalog as "table_database",
              table_schema as "table_schema",
              table_name as "table_name",
              table_type as "table_type",
              comment as "table_comment",

              -- note: this is the _role_ that owns the table
              table_owner as "table_owner",

              'Clustering Key' as "stats:clustering_key:label",
              clustering_key as "stats:clustering_key:value",
              'The key used to cluster this table' as "stats:clustering_key:description",
              (clustering_key is not null) as "stats:clustering_key:include",

              'Row Count' as "stats:row_count:label",
              row_count as "stats:row_count:value",
              'An approximate count of rows in this table' as "stats:row_count:description",
              (row_count is not null) as "stats:row_count:include",

              'Approximate Size' as "stats:bytes:label",
              bytes as "stats:bytes:value",
              'Approximate size of the table as reported by Snowflake' as "stats:bytes:description",
              (bytes is not null) as "stats:bytes:include",

              'Last Modified' as "stats:last_modified:label",
              to_varchar(convert_timezone('UTC', last_altered), 'yyyy-mm-dd HH24:MI'||'UTC') as "stats:last_modified:value",
              'The timestamp for last update/change' as "stats:last_modified:description",
              (last_altered is not null and table_type='BASE TABLE') as "stats:last_modified:include"

          from jyeo.INFORMATION_SCHEMA.tables

      ),

      columns as (

          select
              table_catalog as "table_database",
              table_schema as "table_schema",
              table_name as "table_name",

              column_name as "column_name",
              ordinal_position as "column_index",
              data_type as "column_type",
              comment as "column_comment"

          from jyeo.INFORMATION_SCHEMA.columns
      )

      select *
      from tables
      join columns using ("table_database", "table_schema", "table_name")
      where (upper("table_schema") = upper('my_sources'))
      order by "column_index"
07:41:21.487834 [debug] [ThreadPool]: Opening a new connection, currently in state init
07:41:23.980505 [debug] [ThreadPool]: Snowflake adapter: Snowflake query id: 01a87f2c-0502-3448-000d-37830a9859a2
07:41:23.981238 [debug] [ThreadPool]: Snowflake adapter: Snowflake error: 002003 (02000): SQL compilation error:
Database 'JYEO' does not exist or not authorized.
...

Docs generate tries to look up the jyeo database instead of our business logic specified in the overriden source() macro. Since the jyeo database does not exists - we error.

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.8.7
- dbt: 1.3

Which database adapter are you using with dbt?

snowflake

Additional Context

I did a full walkthrough in this loom https://www.loom.com/share/c87391ba33f14304b23ff012efb71f0c on this issue but I think some users are convinced that the catalog queries should respect the business logic we put into the source() macro.

@jeremyyeo jeremyyeo added bug Something isn't working triage labels Nov 23, 2022
@github-actions github-actions bot changed the title [Bug] Should get_catalog() respect an overriden source() builtin? [CT-1544] [Bug] Should get_catalog() respect an overriden source() builtin? Nov 23, 2022
@tphillip33
Copy link

If the docs are going to call information_schema to get database stats, it needs to respect the overridden source() function in order to access the correct database.

@jtcohen6
Copy link
Contributor

@jeremyyeo Thanks for opening, and providing the thorough write-up / easy repro case!

I do get where this expectation is coming from; it also seems like it would be a tricky thing for us to implement.

The {{ source() }} macro doesn't actually change the relational information (database.schema.identifier) configured for the underlying source — it just changes the code rendered in downstream models trying to access that source.

When dbt goes to generate docs, it looks at the sources' configs in the manifest to determine which schemas need cataloguing (in _get_catalog_schemas. The {{ source() }} macro isn't involved in the process at all entirely.

I'm guessing this use of the custom source() macro is a workaround for the lack of programmatic configurability of source configs, e.g. to set a source property based on environment, a la generate_schema_name for models. We've had other issues like that in the past; #4753 has the most thorough discussion about the potential paths we could take, and links to several other issues.

I'm going to close this issue for now, but happy to continue discussing it further in this thread.


Much longer-term thought: I could see this working differently in a future where catalog information is gathered differently. Rather than being pulled in big batches all at once from information_schema tables during a standalone docs generate, each table's cataloguing would happen right after its model is materialized, during the run and within the same thread. (Some on that here: #5325 (comment).)

It would be more feasible to respect that custom ref/source... though still not guaranteed, since other relation identifiers, such as {{ this }}, would still just be templating relations based on the node's actual config. I think that discrepancy is going to run into rough edges whichever way we slice it.

Also, in that future, when would sources be catalogued? I'm not totally sure! It's a ways off yet.

@jtcohen6 jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2022
@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

3 participants