Skip to content

Move index creation inside CREATE TABLE for massive database creation speedup #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davidshepherd7
Copy link

Hi!

We're in the process of migrating to cockroachdb. We found that naively creating dev databases using sqlalchemy and cockroachdb was dramatically slower than postgres (~20 minutes vs ~30 seconds). We narrowed most of this time down to CREATE INDEX statements.

Talking with @data-matt he suggested doing index creation inside CREATE TABLE rather than as separate statements. This makes a huge difference, bringing our overall db initialisation time down to ~2m30.

So I've experimented with getting sqlalchemy to do it this way. The only approach I could find is to have visit_create_table do the index creation and visit_create_index be a no-op. I've got some prototype code here which works for how we use sqlalchemy at Wave.

I don't know sqlalchemy's internals very well so I'm kind of uncertain about my approach, in particular:

  • Are there important cases where sqlalchemy could emit DDL for the index without emitting it for the table, e.g. does it have any native migration generation which does this?
  • I can't see any places in built-in sqlalchemy DDL Compilers where they return a no-op from a visit function. Is this just a completely insane idea?

Do you have any ideas/thoughts?

Then my other question: if this works, how do we release it? Presumably this is a breaking change, so do we need to put it behind some kind of config flag?

@davidshepherd7 davidshepherd7 changed the title Moving index creation inside CREATE TABLE for massive database creation speedup Move index creation inside CREATE TABLE for massive database creation speedup Jul 18, 2025
index = element.target
assert isinstance(index, Index)
was_created = index.info.get("_cockroachdb_index_created_by_create_table", False)
assert was_created
Copy link
Author

@davidshepherd7 davidshepherd7 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do need to handle emitting CREATE INDEX in cases where we_aren't also creating the corresponding table then we might be able to do that here by doing something like:

if not was_created:
    return compiler.visit_create_index(...)

(Assuming that sqlalchemy always does index creations after the corresponding table creation.)

IDX_USING = re.compile(r"^(?:btree|hash|gist|gin|[\w_]+)$", re.I)


# Heavily based on DDLCompiler.visit_create_index
Copy link
Author

@davidshepherd7 davidshepherd7 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost PostgresqlDDLCompiler.visit_create_index. Differences are:

  • Remove the CREATE and ON {table_name} bits
  • Replacing USING with INVERTED (seems to be needed for crdb?)
  • Removing/commenting some features that I don't think crdb supports.

In the final version I would clean this up a lot more.

I don't think we should attempt to reuse PostgresqlDDLCompiler.visit_create_index - I think the string munging required for that would be quite brittle.

@data-matt
Copy link

@rafiss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants