Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-508] [Feature] Honour cluster_by config for Python models #585

Closed
2 tasks done
Tracked by #7561
carlescere opened this issue May 3, 2023 · 6 comments
Closed
2 tasks done
Tracked by #7561

[ADAP-508] [Feature] Honour cluster_by config for Python models #585

carlescere opened this issue May 3, 2023 · 6 comments
Labels
enhancement New feature or request Stale

Comments

@carlescere
Copy link

Is this a new bug in dbt-snowflake?

  • I believe this is a new bug in dbt-snowflake
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When creating an incremental model with dbt.config(cluster_by=['key']) the dbt run will create (the initial run) a CREATE OR REPLACE TABLE ... AS ... query. This query is not followed by an ALTER TABLE ... CLUSTER BY ... query as it does in its SQL counterpart. Additionally, the SQL model will attach ORDER BY <unique key> on the creation of the table to make the clusterisaltion of data faster; this does not appear in the Python table creation.

Expected Behavior

  1. Materialised Python models should trigger an ALTER TABLE ... CLUSTER BY ... query after table creation if there is config for cluster by.
  2. Materialised Python models should ORDER BY the unique key on the cluster by key.

Steps To Reproduce

  1. In a dbt project with a snowflake destination (dbt 1.5.0, dbt-snowflake 1.5.0).
  2. Create a simple python model:
def model(dbt, session):
    dbt.config(materialized='incremental')
    dbt.config(unique_key=['id'])
    dbt.config(incremental_strategy='merge')
    dbt.config(cluster_by=['id'])

    return dbt.source('data', 'table')
  1. Run the model: dbt run --select python_test_model.
  2. Observe the table creation query does not contain an ORDER BY clause.
CREATE  OR  REPLACE    TABLE  DATABASE_NAME.DATA_TRANSFORMATIONS.python_test_model AS  SELECT  *  FROM ( SELECT  *  FROM (DATABASE_NAME.PUBLIC.TABLE))
  1. Observe there is no ALTER TABLE ... CLUSTER BY ... query.

Relevant log output

12:44:09  Running with dbt=1.5.0
12:44:09  Unable to do partial parsing because config vars, config profile, or config target have changed
12:44:10  Found 167 models, 54 tests, 0 snapshots, 0 analyses, 322 macros, 0 operations, 1 seed file, 6 sources, 0 exposures, 0 metrics, 0 groups
12:44:10
12:44:12  Concurrency: 8 threads (target='dev')
12:44:12
12:44:12  1 of 1 START python incremental model DATA_TRANSFORMATIONS.python_test_model  [RUN]
12:44:36  1 of 1 OK created python incremental model DATA_TRANSFORMATIONS.python_test_model  [SUCCESS 1 in 24.74s]
12:44:36
12:44:36  Finished running 1 incremental model in 0 hours 0 minutes and 26.21 seconds (26.21s).
12:44:36
12:44:36  Completed successfully
12:44:36
12:44:36  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Environment

- OS: OSX 13.3.1
- Python: 3.9.13
- dbt-core: 1.5.0
- dbt-snowflake: 1.5.0

Additional Context

No response

@carlescere carlescere added bug Something isn't working triage labels May 3, 2023
@github-actions github-actions bot changed the title [Bug] Cluster by config not honoured by Python models. [ADAP-508] [Bug] Cluster by config not honoured by Python models. May 3, 2023
@dbeatty10
Copy link
Contributor

Thanks for noticing this and reaching out @carlescere !

Adding the relevant cluster_by logic (as it does in its SQL counterpart) makes sense.

Could you share more details on the order by portion? I've seen similar recommendations like this, but it would be helpful to see more justification. Nothing stood out to me when I did a quick scan through the Snowflake docs for clustering keys.

@carlescere
Copy link
Author

I am not an expert on snowflake and I don't think there's many public documentation about the auto clustering internals so, if any snowflake employee or someone more knowledgeable wants to confirm or deny the explanation please do.

My understanding is that micropartitions are created on data arrival (read ordering) and are immutable. In the background, the AUTOMATIC_CLUSTERING warehouse repartitions the data (creating new micropartitions and updating the metadata). So, my thinking is that by ordering the data prior to the creation of the table the micropartitions created (more specifically their metadata) will be closer to the final state hence less auto clustering will be needed and, with that, less cost on the AUTOMATIC_CLUSTERING warehouse.

This reason is based in conversations in the snowflake community forums like the one you shared and limited experimentation on my side so, sadly, I don't have hard proof on this.

Ultimately my point was based on the fact that the SQL counterpart does, in fact, ORDER BY the clustering key.

@dbeatty10
Copy link
Contributor

Thanks for that info @carlescere 👍

I'm going to re-label this as a feature since we treated this type of configuration as out-of-scope during our initial implementation of dbt python models for Snowflake.

Acceptance criteria

  • cluster_by config is applied to the resulting Snowflake table for dbt python models

Optional

Depending on if it makes sense to include or not:

Out of scope

Considerations during implementation

Whether to include order by or not is left as an implementation decision. For context, discussion surrounding the original clustering implementation for dbt SQL models is here and here.

The current approach of table creation for Python models is to use the overwrite mode. There are two implementation options:

  1. Use an ALTER TABLE ... CLUSTER BY ... query after the dataframe is written to the database table
  2. Create an empty table ahead of time in SQL (with the clustering specified), then use the append mode when writing the dataframe.

The former is likely vastly easier to implement. The latter may be more efficient for large tables.

@dbeatty10 dbeatty10 added enhancement New feature or request and removed bug Something isn't working triage labels May 4, 2023
@dbeatty10 dbeatty10 changed the title [ADAP-508] [Bug] Cluster by config not honoured by Python models. [ADAP-508] [Feature] Honour cluster_by config for Python models May 4, 2023
@carlescere
Copy link
Author

Thanks @dbeatty10! 🎉

Copy link
Contributor

github-actions bot commented Nov 1, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 1, 2023
Copy link
Contributor

github-actions bot commented Nov 9, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

2 participants