Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "cluster by" on Snowflake #634

Closed
drewbanin opened this issue Jan 5, 2018 · 13 comments
Closed

Support "cluster by" on Snowflake #634

drewbanin opened this issue Jan 5, 2018 · 13 comments
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! snowflake

Comments

@drewbanin
Copy link
Contributor

Docs: https://docs.snowflake.net/manuals/sql-reference/sql/create-table.html

The cluster by keyword also requires a column definition list in the create table as statement. Example:

create table "dbt_dbanin".test_cluster
  (id int, name string)
  cluster by(id)
as (
 
  select 1 as id,
         2 as name
  
);

This can be implemented with a custom table materialization override for Snowflake.

@drewbanin drewbanin added enhancement New feature or request snowflake labels Jan 5, 2018
@drewbanin drewbanin added this to the Adapter Parity milestone Mar 6, 2018
@drewbanin drewbanin added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Mar 12, 2018
@drewbanin
Copy link
Contributor Author

This implementation would mirror the sort and dist options for Redshift. An example of overriding the create_table_as for a particular database can be found here

jon-rtr added a commit to jon-rtr/dbt that referenced this issue Apr 7, 2018
Issue dbt-labs#634 add cluster_by support for snowflake data warehouses.
@drewbanin
Copy link
Contributor Author

I'm kicking this out of the 0.10.2 milestone. The code to add cluster by (field_name) is pretty straightforward, but we also need to account for the column definition aspect of this statement. As noted in the description above, cluster by requires a column list when used with create table as.

So, this is not valid in Snowflake:

create table "dbt_dbanin".test_cluster
  cluster by(id)
as (
  select 1 as id
);

Whereas this is valid:

create table "dbt_dbanin".test_cluster (
  id int
) cluster by(id)
as (
  select 1 as id
);

dbt doesn't have a good mechanism for adding a column list to the create table as query, so I'm unsure exactly how we can implement this. One approach might be to use this syntax:

(snowflake docs)

ALTER TABLE <name> CLUSTER BY ( <expr1> [ , <expr2> ... ] )

dbt could run this as a sort of post-hook after the table is created. Conceivably, dbt could hide this implementation detail behind the cluster_by model config at some point in the future. I think we'll be better suited to tackle this after some upgrades to hooks roll out, so deprioritizing for now.

Keen to discuss if anyone has any ideas or comments!

@gkushida
Copy link

It's been a while since the last comment, but this would be a very nice feature to have.

Using ALTER TABLE sort of works but it's just a metadata update - you'll be relying on automatic reclustering (in preview still, i think) to actually recluster the stored data. Or use RECLUSTER explicitly, which is apparently deprecated.

It's hard to use a CTAS but I think we can do this via a view and a CREATE TABLE .. LIKE instead:

CREATE VIEW tmp_view AS (SELECT 1 AS id);
CREATE TABLE clustered LIKE tmp_view CLUSTER BY (id);
INSERT INTO clustered (SELECT * FROM tmp_view);
DROP VIEW tmp_view;

Does this seem workable as a Snowflake-specific extension to the table materialization?

Supporting changing clustering keys for incremental or non-destructive tables could be tricky - maybe that is a better use of automatic reclustering, or just only setting clustering keys when the table is first created.

@drewbanin
Copy link
Contributor Author

@gkushida thanks for the note -- I did see the Snowflake update about alter table ... recluster being deprecated -- that is disappointing. In our experience, automatic clustering was pretty expensive!!

Your solution is pretty clever! I do think that would work, but I don't love the idea of rearchitecting dbt's approach to model building on Snowflake solely to support clustering.

My hope would be that Snowflake could drop the constraint that a column list is provided to add clustering to a table, since your example makes it pretty clear that that's not strictly necessary. I'd say: If you desperately want to include clustering in your dbt models, you can do that by overriding dbt's implementation of create_table_as for Snowflake.

You could do this by copying the snowflake__create_table_as macro into your macros/ dir, adding the supplied logic from your comment into the macro SQL. You could find model configs using config.get().

There are some drawbacks to this approach - we sometimes need to make changes to these macros in dbt-core, and they may not be compatible with your own version of the macro. It is an option that's available to you if you want to use it! Let me know what you end up doing, or if you have other ideas here :)

@gkushida
Copy link

I suppose it depends on the size of your data - if your tables are small, then it doesn't really matter if you optimize clustering keys. But for large tables - especially time-series tables where users generally query a subset - we have found appropriate clustering keys to have a dramatic improvement in partitions-scanned and response time.

IMO the improvements to response-time and cost-efficiency outweigh the benefits of adapter consistency. Not having clustering keys in Snowflake is kind of like not having distribution and sort keys in Redshift. Yes, Snowflake should do some auto-magic optimization, but I think relying on that violates the dbt spirit of trivially re-creatable models.

Also, the BQ table materialization is already significantly different, and Snowflake is subtly different-enough from the base table materialization - so dbt already has incurred some overhead for changes to table materialization.

I also ran into a similar issue with TRANSIENT schemas/tables - we would really like to make all dbt-generated objects TRANSIENT to save on fail-safe cost (not necessary since dbt models should not have unique state). Here there is no workaround since objects can't be made transient after creation.

I was going to go down the 'forked-macro' route but that seemed like a last resort (better resort than waiting for Snowflake, though...).

@drewbanin
Copy link
Contributor Author

Hey @gkushida - thanks for following up! I buy the argument that clustering keys are important and compelling on large Snowflake tables. I've given this some more thought, and I think there is a simple way to accomplish effective table clustering on Snowflake. Curious to hear your thoughts about this:

  1. Snowflake tables can be effectively clustered by just adding an order by clause to the create table ... as (...) statement. Creating a table with an order by clause is equivalent to creating an empty table with a clustering key, then inserting into that table. [1][2]
  2. We can update the snowflake__create_table_as macro to run an alter table ... cluster by (....) query after the table is created to add clustering keys for incremental models.

When 1 & 2 are combined, I think this accomplishes effective clustering for Snowflake tables for both dbt's table and incremental materializations. To see why, consider:

  • For table models, the order by clause will arrange data in storage equivalently to the arrangement of data if a clustering key was set before data was inserted
  • For incremental models, the first run of the model will result in an appropriately clustered table (see 1 above). For subsequent runs, the insert/delete (in the future, merge) DML will trigger automatic clustering for the table, ensuring that the data is clustered appropriately.

This also necessitates that we add a config to enable automatic clustering for Snowflake tables, which I think would be a good idea. Given that alter table ... recluster is deprecated and unavailable in new Snowflake accounts, I don't think there's any way around leveraging automatic clustering here. Let me know if I missed some aspect of your proposed strategy. Note: by including an order by clause for both the initial table creation DDL and subsequent incremental DML, the amount of data that will need to be processed by automatic clustering is hopefully minimized.

[1] I unfortunately can't find a source for this, but I learned it from a Snowflake sales engineer on a call, and it empirically appears to be true!
[2] We can either require that Snowflake users add their own order by clauses, or we can wrap their SQL with something like select * from ( {{ model_sql }} ) order by {{ clustering_keys_csv }}

Last: our next release of dbt will actually include the ability to mark Snowflake tables as transient (docs, code).

Just threw a lot at you here -- this is a super neat discussion! Excited to hear what you think

@gkushida
Copy link

@drewbanin Thank you for the thorough description! This definitely seems like a workable approach - we have heard similar things from our SA and had observed similar results in casual testing.

Re: automatic clustering - my understanding is that this enabled by default for clustered tables if you are in the preview. So presumably it will also be enabled by default whenever it goes to GA. If you want to add a flag to disable automatic reclustering, I think it would have to call ALTER TABLE ... SUSPEND RECLUSTER after the (clustered) table is created initially.

Off the top of my head that flag doesn't seem terribly useful, though - if you know the table isn't going to change, then you don't really need to suspend reclustering. If it is going to change, then you probably want it to be fully-clustered. Maybe you would want to suspend in a data-migration or large-update scenario but that kind of work seems to be outside the usual dbt scope.

Another plus to your proposal is that we can do this today with an explicit ORDER BY and a post-hook. Once we have a first-class cluster-by (or whatever) config option for Snowflake materialization, we can clean up the model/config and avoid redundant column definitions.

I would suggest putting the cluster-keys set as part of the materialization config block rather than repeating them in an ORDER BY -- less opportunity for error / better DRY-compatibility.

I'll try and give the ad-hoc approach a shot soon -- also, apologies for the delays in responding, things are pretty busy here (as I assume they are over there)!

@drewbanin
Copy link
Contributor Author

Hey @gkushida - all of this sounds really good to me. If you're indeed able to try this, I'd love to hear about how it goes! I think this is super doable in dbt, and I'd agree with you that generating the order by plus the cluster by from a config is the way to do this long-term.

If this is something you'd be interested in implementing eventually, I'd be super happy to help out however I can. Otherwise, let me know about your findings of the ad-hoc approach and I can prioritize this from there.

Thanks again for moving this along!

@tayloramurphy
Copy link

In lieu of having a Snowflake-specific clustering config, I think we're going to update our best practices to include an ORDER BYstatement on all of our base models. Seems like an easy way to implement this, but +1 for having this as a configurable attribute!

@gkushida
Copy link

Sounds good! Might want to add the bit about ALTER TABLE ... SUSPEND RECLUSTER in a post-hook as well. Until we did this, we were actually spending a decent amount of credits on unnecessary reclustering.

@bastienboutonnet
Copy link
Contributor

I think I might wanna have a stab at the solution you proposed @drewbanin to wrap the query with an order by that we could read from the config or so. Seems fairly straightforward.

@drewbanin
Copy link
Contributor Author

cool @bastienboutonnet - i think that sounds great! Definitely feel free to tag me a in a PR (or follow up here) if there's anything i can help out with

@drewbanin
Copy link
Contributor Author

This was fixed in #1689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! snowflake
Projects
None yet
Development

No branches or pull requests

4 participants