(#1576) use the information schema on BigQuery #1795

drewbanin · 2019-09-29T17:49:37Z

Work in progress. Use the BigQuery INFORMATION_SCHEMA to fetch the catalog. I actually cheat here and use __TABLES__ (not INFORMATION_SCHEMA.TABLES) because the information in __TABLES__ is a superset of the data in TABLES (ie. row_count and size_bytes).

Couple of things to verify here:

do we need to worry about quoting/casing here? I just want to preserve the existing behavior
it's hard to find good docs on __TABLES__ -- is that appropriate for us to use?
make sure this fix actually addresses Generating docs take and long time #1576 -- how does this hold up for datasets with many date-sharded tables?

To document:

Date sharded tables can be addressed using dbt sources by replacing the date shard suffix with a * in the source specification. When this source is referenced from a model, dbt will expand the wildcard to match all date shards in the table. Additionally, the auto-generated dbt documentation website will correctly collect statistics about all of the date shards in this table.

sources:
    - name: source_schema
      tables:
          - name: events_*

mr2dark · 2019-09-29T19:23:02Z

@drewbanin, I've run the doc generation but it fails because we have some uppercase letters in dataset names. And SQL query which is marked as failing (a sample of which is presented below) contains dataset names (in all CTEs) both lowercased and quoted.
This results in an error like 404 Not found: Dataset project:lower_upper was not found in location US if the original dataset name would be lower_UPPER.
I tried to switch dataset quoting off via dbt_project.yml but it looks like those quoting settings don't affect that SQL.

     2:      (
     3:        with tables as (
     4:            select
     5:                project_id as table_database,
...

drewbanin · 2019-09-29T21:01:01Z

Really good point @mr2dark! These quoting/capitalization configs can be tricky for us to reconcile in dbt. I just pushed a quick fix that will always quote the project and dataset name, but I don't think that's exactly correct here either. This is certainly tractable for us, but I'll need to spend a little more time here to get it exactly right.

If you're so inclined, feel free to grab the latest commit and let me know if it happens to work for you :)

mr2dark · 2019-09-29T22:20:07Z

@drewbanin
For the record, the previous time I ran doc generation with quoting configuration for datasets:

Set explicitly On
Set explicitly Off
Left out (Default)

Nothing affected the result catalog query that time.

It works for me now with the latest commit. It takes about 15 seconds now.
Looking forward for this fix to be released.
Thank you very much!

beckjake · 2019-10-15T15:19:03Z

test/integration/base.py

@@ -1157,6 +1157,13 @@ def __eq__(self, other):
        return isinstance(other, float)


+class AnyString:
+    """Any string. Use this in assertEqual() calls to assert that it is a float.


... to assert that it is a str.!

beckjake

This is awesome! I outlined some structural changes I'd recommend to the adapter to bring it in line with others, and let us use a common get_catalog everywhere. But just moving the catalog into SQL is so nice!

beckjake · 2019-10-15T15:25:42Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+                    'database': True,
+                    'schema': True
+                }
+            ))


The existing get_catalog's call to _get_cache_schemas should do this, and if it doesn't we should fix it for BigQuery! You might have to override the BigQueryRelation.information_schema method in some way.

The big difference from the other get_catalog implementations here is that the BigQuery information schema is (usually) addressed with:

`project-id`.`dataset`.INFORMATION_SCHEMA.COLUMNS

ie. the information_schema is affixed to a dataset (schema) not a project (database).

The exception is SCHEMATA which is addressed at the project level:

`project-id`.INFORMATION_SCHEMA.SCHEMATA

Do you still think we should override get_cache_schemas? I think I might also need to override SchemaSearchMap to return information schema Relations that BigQuery is happy with.

Hmm, ok. In that case for now how about just implementing _get_cache_schemas as:

for database, schema in manifest.get_used_schemas(): yield self.Relation.create( database=database, schema=schema, quote_policy={ 'database': True, 'schema': True } )

In the long run, I think we should probably extend bigquery's Relation subclass to fully account for this quirky interpretation of information_schema, but we can do that at a later date.

just a couple of great minds, thinking alike.

beckjake · 2019-10-15T15:26:10Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+        kwargs = {'information_schemas': information_schemas}
+        table = self.execute_macro(GET_CATALOG_MACRO_NAME,
+                                   kwargs=kwargs,
+                                   release=True)


The base adapter already does this in get_catalog

beckjake · 2019-10-15T15:27:36Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

+            col.name: col.name.replace('__', ':') for col in table.columns
+        })
+
+        return self._catalog_filter_table(table, manifest)


def self._catalog_filter_table(self, table, manifest): # BigQuery doesn't allow ":" chars in column names -- remap them here. table = table.rename(column_names={ col.name: col.name.replace('__', ':') for col in table.columns }) return super()._catalog_filter_table(table, manifest)

beckjake · 2019-10-15T15:28:19Z

plugins/bigquery/dbt/adapters/bigquery/impl.py

-            relation_type == 'table',
-        )
-        return zip(column_names, column_values)
-
    def get_catalog(self, manifest):


Instead of overriding BaseAdapter.get_catalog, we should use its existing behavior and make BigQuery behave more like other adapters do! I've outlined my thoughts on that ~~below~~ above, in github's rendering.

beckjake

Looks great. Is it feasible to use the information schema for listing relations, too? (A separate PR, for sure!)

drewbanin · 2019-10-15T16:50:27Z

yep! #1275 :D

cla-bot bot added the cla:yes label Sep 29, 2019

drewbanin mentioned this pull request Sep 29, 2019

Generating docs take and long time #1576

Closed

(#1576) use the information schema on BigQuery

4c624d0

drewbanin force-pushed the fix/bq-catalog-query branch from 230c177 to 4c624d0 Compare October 15, 2019 01:42

Include Location field in catalog from the SCHEMATA table

15c2619

drewbanin requested a review from beckjake October 15, 2019 15:01

beckjake reviewed Oct 15, 2019

View reviewed changes

pr feedback

414716b

beckjake reviewed Oct 15, 2019

View reviewed changes

improve the BigQuery Relations understanding of the information schema

f757a08

beckjake approved these changes Oct 15, 2019

View reviewed changes

drewbanin merged commit 00a22e1 into dev/louisa-may-alcott Oct 15, 2019

drewbanin deleted the fix/bq-catalog-query branch October 15, 2019 18:44

jtcohen6 mentioned this pull request Feb 7, 2022

[CT-171] [CT-60] [Bug] It takes too long to generate docs (BigQuery) dbt-labs/dbt-bigquery#115

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(#1576) use the information schema on BigQuery #1795

(#1576) use the information schema on BigQuery #1795

drewbanin commented Sep 29, 2019 •

edited

Loading

mr2dark commented Sep 29, 2019 •

edited

Loading

drewbanin commented Sep 29, 2019

mr2dark commented Sep 29, 2019

beckjake Oct 15, 2019

beckjake left a comment

beckjake Oct 15, 2019

drewbanin Oct 15, 2019

beckjake Oct 15, 2019

drewbanin Oct 15, 2019

beckjake Oct 15, 2019

beckjake Oct 15, 2019

beckjake Oct 15, 2019 •

edited

Loading

beckjake left a comment

drewbanin commented Oct 15, 2019

(#1576) use the information schema on BigQuery #1795

(#1576) use the information schema on BigQuery #1795

Conversation

drewbanin commented Sep 29, 2019 • edited Loading

mr2dark commented Sep 29, 2019 • edited Loading

drewbanin commented Sep 29, 2019

mr2dark commented Sep 29, 2019

beckjake Oct 15, 2019

Choose a reason for hiding this comment

beckjake left a comment

Choose a reason for hiding this comment

beckjake Oct 15, 2019

Choose a reason for hiding this comment

drewbanin Oct 15, 2019

Choose a reason for hiding this comment

beckjake Oct 15, 2019

Choose a reason for hiding this comment

drewbanin Oct 15, 2019

Choose a reason for hiding this comment

beckjake Oct 15, 2019

Choose a reason for hiding this comment

beckjake Oct 15, 2019

Choose a reason for hiding this comment

beckjake Oct 15, 2019 • edited Loading

Choose a reason for hiding this comment

beckjake left a comment

Choose a reason for hiding this comment

drewbanin commented Oct 15, 2019

drewbanin commented Sep 29, 2019 •

edited

Loading

mr2dark commented Sep 29, 2019 •

edited

Loading

beckjake Oct 15, 2019 •

edited

Loading