Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres: ability to create indexes #3106

Merged
merged 9 commits into from
Apr 27, 2021

Conversation

arzavj
Copy link
Contributor

@arzavj arzavj commented Feb 16, 2021

resolves #804

Description

Rough first pass at addressing the issue. Would love to get some thoughts on whether I'm on the right track. Would also love to get some help on adding a unit and integration test so that I can make sure that this code works.

I'm imaging that usage will look like:

{{ config(
    indexes=[
      { 'columns': ['column_a', 'column_b'], 'unique': false },
      { 'columns': ['column_a', 'column_c'], 'unique': true },
    ]
) }}

This PR gives developers the ability to:

  • create multiple indexes inside the same transaction as the create_table_as
  • create a mix of unique and non-unique indexes
  • create indexes on any number of columns
  • address the issue where indexes aren't recreated (as discussed in this discourse) by prepending the index name with a timestamp; I choose to prepend instead of append because if the column names are long or there are a number of columns in the index then postgres might truncate the index name and drop the timestamp
  • address the issue where sometimes only one index out of a few indexes is created because the indexes share the same first few columns -- for example, ['column_a', 'column_b'] and ['column_a', 'column_c']; what happens is that postgres truncates the final index name to fit a certain max length; so in this example postgres might drop the second column from the index name causing both index names to be the same (and hence only one of them is created); of course certain conditions need to be met in order to run into this problem (i.e. shared initial columns and possibly long column names)

Note that this does not support gin indexes and indexes on expressions of columns like lower(column_a)
Also note that I'm assuming snakecased column and table names.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot
Copy link

cla-bot bot commented Feb 16, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@arzavj arzavj marked this pull request as draft February 16, 2021 05:54
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start here @arzavj, and thanks for taking so much into consideration for the first go!

Let's figure out the right handoff points between create_indexes and the materialization code. I don't think it makes sense to pass back a semicolon-delimited string and append it to post_hooks. I'm drawing inspiration from persist_docs, which feels like clear, compelling, and quite similar functionality.

Once we're settled on the syntax, we'll definitely need to add tests for the various permutations.

Also note that I'm assuming snakecased column and table names.

I think you're good! The model name, if quoted and weirdly cased, may "just work" because of how dbt prints {{ relation }}. If the columns are weirdly quoted/cased, the user will have to account for that when defining the indexes config.

@cla-bot
Copy link

cla-bot bot commented Mar 1, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

1 similar comment
@cla-bot
Copy link

cla-bot bot commented Mar 1, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@arzavj
Copy link
Contributor Author

arzavj commented Mar 1, 2021

@jtcohen6 please take a look at the updated PR based on the comments above. Some open questions I have:

  1. You mentioned "avoid any operation unless the model both has an index defined and is running on an adapter with an implementation of create_indexes". Is the approach I've take correct? How does one check to see whether an adapter has implemented get_create_index_sql?
  2. You mentioned updating the incremental materialization as well. Do we also need to take care of the seed and snapshot materializations?
  3. If I do the previous point, I could add validation to ensure that type is one of [btree, hash, gist, gin]. Do you recommend that or is it just better to let the query run and possibly fail with an invalid type?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @arzavj, this is really coming along!

  1. You mentioned "avoid any operation unless the model both has an index defined and is running on an adapter with an implementation of create_indexes". Is the approach I've take correct? How does one check to see whether an adapter has implemented get_create_index_sql?

If an adapter has implemented get_create_index_sql, dbt will use its version; otherwise, dbt will use default__get_create_index_sql.

As a feature, indexes are unique to Postgres among the "core four" databases, but they're pretty common out in the world of databases. That's why I think we should implement create_indexes in such a way that:

  • By default, it does nothing
  • At the same time, any database-agnostic plumbing is reusable (= included in the default implementation)
  • Any PostgreSQL-specific components are wrapped in the postgres__ implementation

The best way to accomplish this, to my mind, was two dispatch macros:

  • create_indexes: Called by materialization. Loops over indexes, calls get_create_index_sql for each. If SQL is returned, call run_query() for it; if nothing is returned, do nothing. All of this can be part of default__create_indexes; it's unlikely someone would need to change/override this, but you never know.
  • get_create_index_sql: Called by create_indexes, one index at a time. Returns the database-specific SQL for creating that index. default__get_create_index_sql should return None (most databases don't have indexes), instead of raising an exception (what you have now). postgres__get_create_index_sql should return the appropriate PostgreSQL (exactly as you have it now). An Oracle, SQLServer, or Materialize user out in the world could reimplement just get_create_index_sql with the right SQL, and still benefit from all of the default plumbing in create_indexes. (They'd also need to define a custom IndexConfig in the python portion of the adapter, but that's a different story.)
  1. You mentioned updating the incremental materialization as well. Do we also need to take care of the seed and snapshot materializations?

Yes!! Really good point, thank you for catching that. The seed materialization logic will look a lot like the incremental (I think {% if full_refresh_mode or not exists_as_table %}). I think the snapshot materialization logic can be even simpler ({% if not target_relation_exists %}), since there's no such thing as full-refreshing.

  1. If I do the previous point, I could add validation to ensure that type is one of [btree, hash, gist, gin]. Do you recommend that or is it just better to let the query run and possibly fail with an invalid type?

See comment below. I'll fully admit I'm not a Postgres pro; depending on how many index types there are, how customizable they are, and how frequently they change, it may make sense to validate in python (set in stone), in Jinja (standardized but can be overridden), or not at all.

@cla-bot
Copy link

cla-bot bot commented Mar 12, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

1 similar comment
@cla-bot
Copy link

cla-bot bot commented Mar 12, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@arzavj
Copy link
Contributor Author

arzavj commented Mar 12, 2021

@jtcohen6 I've responded to all the comments and made sure that the existing tests still pass! I think it's time to add tests for the code added in this PR.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 16, 2021

Brilliant, the code is looking really good!

I think a new integration test is appropriate, 065_postgres_tests/test_indexes.py, that features:

  • a table model with indexes (run once)
  • an incremental model with indexes (run, run again, run --full-refresh)
  • a seed with indexes (seed, seed again, seed --full-refresh)
  • a snapshot with indexes (snapshot twice)

Each of those can be something silly, select 1 as fun or similar, as long as the defined indexes are valid.

And maybe also some failure cases:

  • Invalid index config (assert it raises the exceptions you've defined)
  • Index with a non-existing type (assert it fails / returns a database error)

When you get a chance, could you also sign the CLA, so I can make sure we'll be able to merge once ready?

@arzavj
Copy link
Contributor Author

arzavj commented Mar 25, 2021

Apologies for the delay on this @jtcohen6; I'm on vacation and will be back on the 29th. Will get to this then!

@cla-bot
Copy link

cla-bot bot commented Apr 6, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

3 similar comments
@cla-bot
Copy link

cla-bot bot commented Apr 11, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Apr 11, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Apr 11, 2021

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@arzavj
Copy link
Contributor Author

arzavj commented Apr 12, 2021

@jtcohen6 Please take a look at the tests I added. I also updated the changelog. I haven't added any documentation or comments. Please let me know if I should.

@arzavj arzavj marked this pull request as ready for review April 12, 2021 00:06
@arzavj
Copy link
Contributor Author

arzavj commented Apr 12, 2021

A couple more things:

  1. The unit tests are failing however when I run the same command python -m pytest -v test/unit locally they pass (with 1 skip). Weirdly though when I run make test I get ERROR: InterpreterNotFound: python3.8; not sure why it's trying to use python 3.8 when I'm on python 3.7.10.
  2. I'm struggling to figure out why the cla-bot is unable to recognize me. My git config has the correct email address as output by git config --list | grep email. This matches my primary Github email address.

@jtcohen6
Copy link
Contributor

@arzavj The tests are looking great! I haven't had a chance to do a deep dive, but I'm really happy to see everything you've done here.

Quick things:

  1. That Windows unit test has been failing on all PRs, and it's on us to fix it (Windows unit test consistently (or frequently?) failing #3233). In the meantime, we just merged a PR that disabled it for now, so if you pull from develop it should be skipped on the next one.
  2. I'm not positive, but think the issue is that one of your commits (2e5ff40) is co-authored by your GitHub account, and by... also your GitHub account? I'm not sure:

Screen Shot 2021-04-12 at 11 05 27 AM

@arzavj
Copy link
Contributor Author

arzavj commented Apr 12, 2021

You were right! Fixed the author of that culprit commit and also rebased on the latest develop.

@arzavj
Copy link
Contributor Author

arzavj commented Apr 21, 2021

hey @jtcohen6! Bumping this up on your radar :)

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arzavj Sorry for the delay!

I love how this feature "just works," from the unique/type properties down to the guaranteed-unique index name. And the tests look amazing. Thank you so much for this contribution, and for all your thorough work here over the past few months!

This will be released in v0.20.0 :)

@jtcohen6 jtcohen6 merged commit a260d4e into dbt-labs:develop Apr 27, 2021
@arzavj
Copy link
Contributor Author

arzavj commented Apr 27, 2021

Thank you for the kind words @jtcohen6! Really appreciate all your help on this PR! Excited to update my code to use this new feature once v0.20.0 is released :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for postgres index creation?
3 participants