Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter.get_columns_in_relation has extremely poor performance as the number of glue databases and tables increase #2854

Closed
1 of 5 tasks
brangisom opened this issue Oct 27, 2020 · 1 comment · Fixed by #2855
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! redshift

Comments

@brangisom
Copy link
Contributor

Describe the bug

Background

The Redshift svv_external_* views for describing Redshift Spectrum are essentially a wrapper over API calls to the AWS Glue Catalog. The get-tables endpoint requires a database-name, and optionally takes a parameter, expression REGEX pattern to match table names. The search-tables endpoint takes a JMESPath filter param. In either case, a query that does not have a filter on schemaname would require Redshift to query for and retrieve every glue table for every schema available to the user querying the svv_external_columns view.

Issue

When doing some testing to update to 0.18.1, so I'd get better external table support, I noticed some pretty significant performance issues with our dbt project. The addition of external columns support in the adapter.get_columns_in_relation macro is what I found to be the culprit. I discovered that the `table_schema = '{{ adapter }}' filter does not push down further than the union CTE. This means that redshift would have to make API calls searching all of the glue catalog databases, then bring them over, before unioning them to filter on the schema name. In cases where folks have a bunch of external tables, this is a pretty heavy performance concern.

Steps To Reproduce

  1. Start with an empty svv_external_schemas
  2. Create an external table and catalog with a handful of tables
  3. Build a simple model over one of these tables to benchmark
  4. Create 10 or so other glue catalog tables, and fill them with 1000 table each
  5. Rerun model in step 3 to benchmark

Expected behavior

System table queries should be performant and not cause significant overhead on a dbt run. More specifically, when querying for columns in svv_external_columns, the where clause should filter against the svv_external_columns.schemaname column to ensure the number of api calls are minimized.

Screenshots and log output

Explain plan of the adapter.get_columns_in_relation macro. Notice that the filter is only applied AFTER the union. The svv_external_columns table is a wrapper over Glue API calls, so the filter needs to be put inside the CTE.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

(.venv) ➜  invision-dbt git:(for-dbt-pr) ✗ dbt --version
installed version: 0.18.1
   latest version: 0.18.1

Up to date!

Plugins:
  - bigquery: 0.18.1
  - snowflake: 0.18.1
  - redshift: 0.18.1
  - postgres: 0.18.1

The operating system you're using:

  • 10.15.6
  • Whatever is in python:3.8-slim

The output of python --version:

➜  invision-dbt git:(for-dbt-pr) ✗ python --version
Python 3.8.6

Additional context

Explain Plan

Notice the table_schema='some' filter. It just hits the union, and never gets passed down into the svv_external_columns table.

XN Merge  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=176)
  Merge Key: ordinal_position
  ->  XN Network  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=176)
        Send to leader
        ->  XN Sort  (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=176)
              Sort Key: ordinal_position
              ->  XN Subquery Scan unioned  (cost=145262411.58..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=176)
                    Filter: (table_schema = 'some'::name)
                    ->  XN Append  (cost=145262411.58..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=119 width=634)
                          ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..15.31 rows=5 width=292)
                          ->  XN Subquery Scan "*SELECT* 1"  (cost=145262411.58..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=109 width=599)
                          ->  XN Network  (cost=1000000000020.18..1000000000020.58 rows=5 width=634)
                                Distribute Round Robin
                                ->  XN Subquery Scan "*SELECT* 3"  (cost=1000000000020.18..1000000000020.58 rows=5 width=634)
                                ->  XN Hash Join DS_BCAST_INNER  (cost=145262411.58..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=109 width=599)
                                ->  XN Function Scan on pg_get_late_binding_view_cols cols  (cost=0.00..15.26 rows=5 width=292)
                                      Join Filter: (("inner".usename = ("current_user"())::name) OR (has_table_privilege("outer".oid, 'SELECT'::text) = true) OR (has_table_privilege("outer".oid, 'INSERT'::text) = true) OR (has_table_privilege("outer".oid, 'UPDATE'::text) = true) OR (has_table_privilege("outer".oid, 'REFERENCES'::text) = true))
                                      Hash Cond: ("outer".relowner = "inner".usesysid)
                                      Filter: (view_name = 'table'::name)
                                      ->  XN Subquery Scan svv_external_columns  (cost=1000000000020.18..1000000000020.53 rows=5 width=634)
                                      ->  XN Hash Join DS_BCAST_INNER  (cost=145262410.56..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=109 width=607)
                                      ->  XN Hash  (cost=1.01..1.01 rows=1 width=132)
                                            Hash Cond: ("outer".relnamespace = "inner".oid)
                                            ->  XN Merge  (cost=1000000000020.18..1000000000020.20 rows=5 width=168)
                                            ->  XN Hash Join DS_DIST_INNER  (cost=145262409.49..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=109 width=483)
                                            ->  XN Hash  (cost=1.06..1.06 rows=6 width=132)
                                            ->  LD Seq Scan on pg_shadow  (cost=0.00..1.01 rows=1 width=132)
                                                  Merge Key: ((btrim((ext_cols.schemaname)::text))::character varying)::character varying(128), ((btrim((ext_cols.tablename)::text))::character varying)::character varying(128), ext_cols.columnnum
                                                  Inner Dist Key: c.oid
                                                  Hash Cond: ("outer".attrelid = "inner".oid)
                                                  ->  XN Network  (cost=1000000000020.18..1000000000020.20 rows=5 width=168)
                                                  ->  XN Hash Join DS_BCAST_INNER  (cost=145262383.02..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=25597 width=475)
                                                  ->  XN Hash  (cost=26.46..26.46 rows=2 width=12)
                                                  ->  LD Seq Scan on pg_namespace nc  (cost=0.00..1.06 rows=6 width=132)
                                                        Send to leader
                                                        Hash Cond: ("outer".atttypid = "inner".oid)
                                                        ->  XN Sort  (cost=1000000000020.18..1000000000020.20 rows=5 width=168)
                                                        ->  XN Hash Left Join DS_DIST_BOTH  (cost=3839.55..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=25597 width=170)
                                                        ->  XN Hash  (cost=145258542.22..145258542.22 rows=503 width=309)
                                                        ->  LD Seq Scan on pg_class c  (cost=0.00..26.46 rows=2 width=12)
                                                              Sort Key: ((btrim((ext_cols.schemaname)::text))::character varying)::character varying(128), ((btrim((ext_cols.tablename)::text))::character varying)::character varying(128), ext_cols.columnnum
                                                              Outer Dist Key: a.attrelid
                                                              Inner Dist Key: ad.adrelid
                                                              Hash Cond: (("outer".attrelid = "inner".adrelid) AND ("outer".attnum = "inner".adnum))
                                                              Filter: ((((relname)::information_schema.sql_identifier)::text = 'table'::text) AND ((relkind = 'r'::"char") OR (relkind = 'v'::"char")))
                                                              ->  XN Hash Left Join DS_DIST_BOTH  (cost=8400035.75..145258542.22 rows=503 width=309)
                                                              ->  XN Hash  (cost=2559.70..2559.70 rows=255970 width=6)
                                                              ->  XN Function Scan on pg_get_external_columns ext_cols  (cost=0.00..20.12 rows=5 width=168)
                                                              ->  LD Seq Scan on pg_attribute a  (cost=0.00..236.21 rows=3580 width=170)
                                                                    Outer Dist Key: "outer".typbasetype
                                                                    Join Filter: ("outer".typtype = 'd'::"char")
                                                                    Inner Dist Key: bt.oid
                                                                    Hash Cond: ("outer".typbasetype = "inner".oid)
                                                                    Filter: ((attnum > 0) AND (attisdropped <> true))
                                                                    Filter: ((((btrim((tablename)::text))::character varying)::character varying(128))::text = 'table'::text)
                                                                    ->  XN Hash Join DS_BCAST_INNER  (cost=1.07..8400033.42 rows=503 width=175)
                                                                    ->  XN Hash  (cost=8400033.42..8400033.42 rows=503 width=138)
                                                                    ->  LD Seq Scan on pg_attrdef ad  (cost=0.00..2559.70 rows=255970 width=6)
                                                                          Hash Cond: ("outer".typnamespace = "inner".oid)
                                                                          ->  XN Hash Join DS_BCAST_INNER  (cost=1.07..8400033.42 rows=503 width=138)
                                                                          ->  XN Hash  (cost=1.06..1.06 rows=6 width=132)
                                                                          ->  LD Seq Scan on pg_type t  (cost=0.00..21.03 rows=503 width=51)
                                                                                Hash Cond: ("outer".typnamespace = "inner".oid)
                                                                                ->  XN Hash  (cost=1.06..1.06 rows=6 width=132)
                                                                                ->  LD Seq Scan on pg_type bt  (cost=0.00..21.03 rows=503 width=14)
                                                                                ->  LD Seq Scan on pg_namespace nt  (cost=0.00..1.06 rows=6 width=132)
                                                                                      ->  LD Seq Scan on pg_namespace nbt  (cost=0.00..1.06 rows=6 width=132)
@brangisom brangisom added bug Something isn't working triage labels Oct 27, 2020
@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! redshift and removed triage labels Oct 27, 2020
@jtcohen6
Copy link
Contributor

@brangisom Thanks for doing all this research, and for opening the associated PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! redshift
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants