Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(#1576) use the information schema on BigQuery #1795

Merged
merged 4 commits into from
Oct 15, 2019

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Sep 29, 2019

Fixes #1576

Work in progress. Use the BigQuery INFORMATION_SCHEMA to fetch the catalog. I actually cheat here and use __TABLES__ (not INFORMATION_SCHEMA.TABLES) because the information in __TABLES__ is a superset of the data in TABLES (ie. row_count and size_bytes).

Couple of things to verify here:

  • do we need to worry about quoting/casing here? I just want to preserve the existing behavior
  • it's hard to find good docs on __TABLES__ -- is that appropriate for us to use?
  • make sure this fix actually addresses Generating docs take and long time #1576 -- how does this hold up for datasets with many date-sharded tables?

To document:

Date sharded tables can be addressed using dbt sources by replacing the date shard suffix with a * in the source specification. When this source is referenced from a model, dbt will expand the wildcard to match all date shards in the table. Additionally, the auto-generated dbt documentation website will correctly collect statistics about all of the date shards in this table.

sources:
    - name: source_schema
      tables:
          - name: events_*

@mr2dark
Copy link

mr2dark commented Sep 29, 2019

@drewbanin, I've run the doc generation but it fails because we have some uppercase letters in dataset names. And SQL query which is marked as failing (a sample of which is presented below) contains dataset names (in all CTEs) both lowercased and quoted.
This results in an error like 404 Not found: Dataset project:lower_upper was not found in location US if the original dataset name would be lower_UPPER.
I tried to switch dataset quoting off via dbt_project.yml but it looks like those quoting settings don't affect that SQL.

     2:      (
     3:        with tables as (
     4:            select
     5:                project_id as table_database,
...

@drewbanin
Copy link
Contributor Author

Really good point @mr2dark! These quoting/capitalization configs can be tricky for us to reconcile in dbt. I just pushed a quick fix that will always quote the project and dataset name, but I don't think that's exactly correct here either. This is certainly tractable for us, but I'll need to spend a little more time here to get it exactly right.

If you're so inclined, feel free to grab the latest commit and let me know if it happens to work for you :)

@mr2dark
Copy link

mr2dark commented Sep 29, 2019

@drewbanin
For the record, the previous time I ran doc generation with quoting configuration for datasets:

  1. Set explicitly On
  2. Set explicitly Off
  3. Left out (Default)

Nothing affected the result catalog query that time.

It works for me now with the latest commit. It takes about 15 seconds now.
Looking forward for this fix to be released.
Thank you very much!

@@ -1157,6 +1157,13 @@ def __eq__(self, other):
return isinstance(other, float)


class AnyString:
"""Any string. Use this in assertEqual() calls to assert that it is a float.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to assert that it is a str.!

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I outlined some structural changes I'd recommend to the adapter to bring it in line with others, and let us use a common get_catalog everywhere. But just moving the catalog into SQL is so nice!

'database': True,
'schema': True
}
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing get_catalog's call to _get_cache_schemas should do this, and if it doesn't we should fix it for BigQuery! You might have to override the BigQueryRelation.information_schema method in some way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big difference from the other get_catalog implementations here is that the BigQuery information schema is (usually) addressed with:

`project-id`.`dataset`.INFORMATION_SCHEMA.COLUMNS

ie. the information_schema is affixed to a dataset (schema) not a project (database).

The exception is SCHEMATA which is addressed at the project level:

`project-id`.INFORMATION_SCHEMA.SCHEMATA

Do you still think we should override get_cache_schemas? I think I might also need to override SchemaSearchMap to return information schema Relations that BigQuery is happy with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok. In that case for now how about just implementing _get_cache_schemas as:

for database, schema in manifest.get_used_schemas():
    yield self.Relation.create(
        database=database,
        schema=schema,
        quote_policy={
            'database': True,
            'schema': True
        }
    )

In the long run, I think we should probably extend bigquery's Relation subclass to fully account for this quirky interpretation of information_schema, but we can do that at a later date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of great minds, thinking alike.

kwargs = {'information_schemas': information_schemas}
table = self.execute_macro(GET_CATALOG_MACRO_NAME,
kwargs=kwargs,
release=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base adapter already does this in get_catalog

col.name: col.name.replace('__', ':') for col in table.columns
})

return self._catalog_filter_table(table, manifest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def self._catalog_filter_table(self, table, manifest):
    # BigQuery doesn't allow ":" chars in column names -- remap them here.
    table = table.rename(column_names={
        col.name: col.name.replace('__', ':') for col in table.columns
    })

    return super()._catalog_filter_table(table, manifest)

relation_type == 'table',
)
return zip(column_names, column_values)

def get_catalog(self, manifest):
Copy link
Contributor

@beckjake beckjake Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overriding BaseAdapter.get_catalog, we should use its existing behavior and make BigQuery behave more like other adapters do! I've outlined my thoughts on that below above, in github's rendering.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Is it feasible to use the information schema for listing relations, too? (A separate PR, for sure!)

@drewbanin
Copy link
Contributor Author

yep! #1275 :D

@drewbanin drewbanin merged commit 00a22e1 into dev/louisa-may-alcott Oct 15, 2019
@drewbanin drewbanin deleted the fix/bq-catalog-query branch October 15, 2019 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generating docs take and long time
3 participants