[ADAP-508] [Feature] Honour `cluster_by` config for Python models #585

carlescere · 2023-05-03T13:07:48Z

Is this a new bug in dbt-snowflake?

I believe this is a new bug in dbt-snowflake
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When creating an incremental model with dbt.config(cluster_by=['key']) the dbt run will create (the initial run) a CREATE OR REPLACE TABLE ... AS ... query. This query is not followed by an ALTER TABLE ... CLUSTER BY ... query as it does in its SQL counterpart. Additionally, the SQL model will attach ORDER BY <unique key> on the creation of the table to make the clusterisaltion of data faster; this does not appear in the Python table creation.

Expected Behavior

Materialised Python models should trigger an ALTER TABLE ... CLUSTER BY ... query after table creation if there is config for cluster by.
Materialised Python models should ORDER BY the unique key on the cluster by key.

Steps To Reproduce

In a dbt project with a snowflake destination (dbt 1.5.0, dbt-snowflake 1.5.0).
Create a simple python model:

def model(dbt, session):
    dbt.config(materialized='incremental')
    dbt.config(unique_key=['id'])
    dbt.config(incremental_strategy='merge')
    dbt.config(cluster_by=['id'])

    return dbt.source('data', 'table')

Run the model: dbt run --select python_test_model.
Observe the table creation query does not contain an ORDER BY clause.

CREATE  OR  REPLACE    TABLE  DATABASE_NAME.DATA_TRANSFORMATIONS.python_test_model AS  SELECT  *  FROM ( SELECT  *  FROM (DATABASE_NAME.PUBLIC.TABLE))

Observe there is no ALTER TABLE ... CLUSTER BY ... query.

Relevant log output

12:44:09  Running with dbt=1.5.0
12:44:09  Unable to do partial parsing because config vars, config profile, or config target have changed
12:44:10  Found 167 models, 54 tests, 0 snapshots, 0 analyses, 322 macros, 0 operations, 1 seed file, 6 sources, 0 exposures, 0 metrics, 0 groups
12:44:10
12:44:12  Concurrency: 8 threads (target='dev')
12:44:12
12:44:12  1 of 1 START python incremental model DATA_TRANSFORMATIONS.python_test_model  [RUN]
12:44:36  1 of 1 OK created python incremental model DATA_TRANSFORMATIONS.python_test_model  [SUCCESS 1 in 24.74s]
12:44:36
12:44:36  Finished running 1 incremental model in 0 hours 0 minutes and 26.21 seconds (26.21s).
12:44:36
12:44:36  Completed successfully
12:44:36
12:44:36  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Environment

- OS: OSX 13.3.1
- Python: 3.9.13
- dbt-core: 1.5.0
- dbt-snowflake: 1.5.0

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-05-03T19:28:01Z

Thanks for noticing this and reaching out @carlescere !

Adding the relevant cluster_by logic (as it does in its SQL counterpart) makes sense.

Could you share more details on the order by portion? I've seen similar recommendations like this, but it would be helpful to see more justification. Nothing stood out to me when I did a quick scan through the Snowflake docs for clustering keys.

carlescere · 2023-05-04T09:55:45Z

I am not an expert on snowflake and I don't think there's many public documentation about the auto clustering internals so, if any snowflake employee or someone more knowledgeable wants to confirm or deny the explanation please do.

My understanding is that micropartitions are created on data arrival (read ordering) and are immutable. In the background, the AUTOMATIC_CLUSTERING warehouse repartitions the data (creating new micropartitions and updating the metadata). So, my thinking is that by ordering the data prior to the creation of the table the micropartitions created (more specifically their metadata) will be closer to the final state hence less auto clustering will be needed and, with that, less cost on the AUTOMATIC_CLUSTERING warehouse.

This reason is based in conversations in the snowflake community forums like the one you shared and limited experimentation on my side so, sadly, I don't have hard proof on this.

Ultimately my point was based on the fact that the SQL counterpart does, in fact, ORDER BY the clustering key.

dbeatty10 · 2023-05-04T13:45:59Z

Thanks for that info @carlescere 👍

I'm going to re-label this as a feature since we treated this type of configuration as out-of-scope during our initial implementation of dbt python models for Snowflake.

Acceptance criteria

cluster_by config is applied to the resulting Snowflake table for dbt python models

Optional

Depending on if it makes sense to include or not:

automatic_clustering

Out of scope

Other config like transient, and copy_grants

Considerations during implementation

Whether to include order by or not is left as an implementation decision. For context, discussion surrounding the original clustering implementation for dbt SQL models is here and here.

The current approach of table creation for Python models is to use the overwrite mode. There are two implementation options:

Use an ALTER TABLE ... CLUSTER BY ... query after the dataframe is written to the database table
Create an empty table ahead of time in SQL (with the clustering specified), then use the append mode when writing the dataframe.

The former is likely vastly easier to implement. The latter may be more efficient for large tables.

carlescere · 2023-05-04T15:39:24Z

Thanks @dbeatty10! 🎉

github-actions · 2023-11-01T01:45:59Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-11-09T01:45:18Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

carlescere added bug Something isn't working triage labels May 3, 2023

github-actions bot changed the title ~~[Bug] Cluster by config not honoured by Python models.~~ [ADAP-508] [Bug] Cluster by config not honoured by Python models. May 3, 2023

dbeatty10 added awaiting_response and removed triage labels May 3, 2023

github-actions bot added triage and removed awaiting_response labels May 4, 2023

dbeatty10 added enhancement New feature or request and removed bug Something isn't working triage labels May 4, 2023

dbeatty10 changed the title ~~[ADAP-508] [Bug] Cluster by config not honoured by Python models.~~ [ADAP-508] [Feature] Honour cluster_by config for Python models May 4, 2023

This was referenced May 9, 2023

[CT-2549] Enable cluster_by for dbt Python models dbt-labs/dbt-core#7561

Closed

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified #606

Open

github-actions bot added the Stale label Nov 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-508] [Feature] Honour `cluster_by` config for Python models #585

[ADAP-508] [Feature] Honour `cluster_by` config for Python models #585

carlescere commented May 3, 2023

dbeatty10 commented May 3, 2023

carlescere commented May 4, 2023

dbeatty10 commented May 4, 2023

carlescere commented May 4, 2023

github-actions bot commented Nov 1, 2023

github-actions bot commented Nov 9, 2023

[ADAP-508] [Feature] Honour cluster_by config for Python models #585

[ADAP-508] [Feature] Honour cluster_by config for Python models #585

Comments

carlescere commented May 3, 2023

Is this a new bug in dbt-snowflake?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

dbeatty10 commented May 3, 2023

carlescere commented May 4, 2023

dbeatty10 commented May 4, 2023

Acceptance criteria

Optional

Out of scope

Considerations during implementation

carlescere commented May 4, 2023

github-actions bot commented Nov 1, 2023

github-actions bot commented Nov 9, 2023

[ADAP-508] [Feature] Honour `cluster_by` config for Python models #585

[ADAP-508] [Feature] Honour `cluster_by` config for Python models #585