Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/snowflake source quoting #1326

Closed
wants to merge 10 commits into from

Conversation

drewbanin
Copy link
Contributor

Information Schema queries for Sources

This PR revamps how dbt handles identifier quoting for sources. It addresses #1317, but also addresses another 0.13.0 bug that we don't currently have an issue for. This second bug was way gnarlier than #1317, and the majority of this PR is intended to fix that bug

Source quoting

In 0.13.0, dbt was modified to better support custom databases. As a part of this change, an
information_schema_name macro was added which, given a database literal, returns the database and schema for the relevant information_schema.

This macro uses the project-wide quoting config to determine if databases should be quoted. This is the difference between:

select * from "raw".information_schema.tables

and

select * from raw.information_schema.tables

There is no difference between these two queries on pg/redshift (and we don't use BQ's information schema), so the effects of this macro are localized to Snowflake. The key problem here is that dbt is using project-level quoting configs to determine if it should quote source databases. While these seem similar, they're actually super different. Because dbt uses the quoting config to both 1) build and 2) query relations, it is internally consistent. For source databases however, dbt's quoting config has no bearing at all on whether or not the source database should be quoted! The existing information_schema_name logic would effectively make it impossible to address a lowercase+quoted source database on snowflake if database quoting was turned off for the project. This PR divorces the logic of generating information_schema relations for sources from the quoting config specified in the dbt_project.yml file.

Two approaches

There are two approaches we can take here:

  1. quote the database/schema/identifier exactly as supplied in schema.yml files
  2. do not quote sources automatically, and instead force the user to quote their identifiers twice in schema.yml files (the one implemented here)

The first approach is implemented in the first commit for this PR, and it worked! It also resulted in a pretty unpleasant experience for Snowflake users. This required users to write code like:

version: 2
sources:
  - name: snowplow
    database: RAW
    schema: SNOWPLOW
    tables:
      - name: event
        identifier: EVENT
      - name: web_page
        identifier: WEB_PAGE

Effectively every table name would need to be duplicated in order to provide a sane (lower case) name and correct upper-cased identifier. The 90% case for snowflake sources is that they are uppercase/unquoted, so it didn't feel right to impose this burden on dbt users.

The second approach results in a much nicer UX at the expense of some... not ideal.... logic in dbt. This makes the users with troublesome source data identifiers feel the pain, which is probably how it should be :). In the approach implemented in this PR, dbt inspects the specified identifiers for sources. If the identifiers include quotes, then dbt will strip those quotes and record that the given identifier should be quoted in information_schema queries. In this version, the above schema.yml can be written as:

version: 2
sources:
  - name: snowplow
    database: raw
    tables:
      - name: event
      - name: web_page

Much nicer! If source tables need to be quoted (b/c they include illegal characters, or are lowercased on Snowflake) users can do something like:

version: 2
sources:
  - name: snowplow
    database: '"raw-database"' # example database that needs to be quoted
    tables:
      - name: event

When dbt tries to query the raw-database information schema, it will render SQL like:

select * from "raw-database".information_schema.tables

So, i think this checks all the boxes:

  • mostly transparent to users
  • databases that must be quoted are now addressable by sources
  • the adapter contract hasn't really changed, which is a nice plus

There's some cleanup work to do here, but wanted to get some initial feedback as I've been chasing my tail here for a couple of days now.

TODO:

  • lots of tests
  • update the presto catalog query to remove this line of code
  • improve error messages

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the structure looks good here. you're on the right track but this isn't quite ready to go yet.

identifier='__phony__',
).include(identifier=False).quote(database=quote_db, schema=False)

information_schema[relation.database.lower()] = information_schema_rel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a duplicate check here?

key = database.lower()
database_counter[key] += 1
if database_counter[key] > 1:
raise RuntimeError("not unique db name TODO TODO")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check should be moved to the get_information_schema_mapping, no?

'identifier': True,
'database': source.quoting['database'],
'schema': source.quoting['schema'],
'identifier': source.quoting['identifier'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this new version is more correct than the old one, right? it guarantees that the rendered relation will exactly match what was supplied in the source spec. i like it

@@ -781,14 +782,39 @@ def get_catalog(self, manifest):
Returns an agate.Table of catalog information.
"""
# make it a list so macros can index into it.
context = {'databases': list(manifest.get_used_databases())}
context = {
'databases': list(manifest.get_used_databases()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this should be removed

Copy link
Contributor Author

@drewbanin drewbanin Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that! All of the get_catalog macros in dbt loop over this value. The alternative would be to make the catalog queries loop over the information_schema dict values... unsure if that's better

identifier.endswith(self.quote_character):
return True, identifier[1:-1]
else:
return False, identifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i have any doubt about this branch, it's this fn. the idea of it is fine, but i can see it breaking easily...

do we trim whitespace on identifiers specified in schema.yml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really good question! I think there's some amount of caveat emptor that needs to happen here. I'm super happy to trim whitespace (this is a good idea) but i imagine that there are always going to be intractable edge cases with an approach like this. If your database starts and ends with quotation marks, then you have bigger problems than dbt's sources not working i think.

I'd agree with you that this is that part of the PR that i feel the least comfortable with. Happy to explore other ways of doing this, or possibly talk about to avoid string hackery all together. I think this functionality results in the nicest UX for users, but i could equivalently imagine:

sources:
  - name: snowplow
    database: raw
    quoting:
      database: true
    ....

I think that's doable, but i imagine it's unpleasant both for dbt devs and for users. Can think about this a little more for sure

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments!

table = self.execute_macro(GET_CATALOG_MACRO_NAME,
context_override=context,
release=True)

results = self._catalog_filter_table(table, manifest)
return results

@available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you really want this available in macros/materializations, given that you're already injecting its result into the context.

@@ -8,6 +8,7 @@
from dbt.node_types import NodeType
from dbt.logger import GLOBAL_LOGGER as logger
from dbt import tracking
from collections import defaultdict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need this

@@ -676,12 +676,17 @@ class Hook(APIObject):
# the manifest search stuff really requires this, sadly
'resource_type': {
'enum': [NodeType.Source],
},
'quoting': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these members should probably also be required and exist under properties as a 'type': 'object', right? Something like this:

'quoting': {
    'type': 'object',
    'additionalProperties': False,
    'properties': {
                'database': {'type': 'boolean'},
                'schema': {'type': 'boolean'},
                'identifier': {'type': 'boolean'},
   },
   'required': ['database', 'schema', 'identifier'],
}

@@ -257,6 +258,7 @@ def build_test_node(self, test_target, package_name, test, root_dir, path,

source_package = self.all_projects.get(package_name)
if source_package is None:
# TODO : model_name is not defined here...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think s/model_name/test_target fixes this? I'd have to check the types to be 100% on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i run my pep8 linter, it (surprisingly frequently) shows that vars aren't defined. I'll make this quick fix here, but wanted to flag it in the initial commit so i didn't forget about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using pylint instead of pep8 itself to get those? It's great for this stuff but dbt makes it cry so much I can't stand using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think i'm using flake8. Can try to dig up where my configs are defined -- pretty sure I'm not using the stock settings but can't seem to find them right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think that's enough to go on. maybe we should add it to our CI process instead of pep8 itself? I'm pretty sure flake8 is just pep8 + a mccabe complexity thing + pyflakes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i'm into that. It's pretty chatty, but all good things. Eg:
screen shot 2019-03-04 at 9 53 26 am

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll open an issue about moving it - it seems like something we should do unless we find an unbearable false positive rate. In the meantime I've installed it in my editor so I'll fix stuff as I work on other things.

@beckjake
Copy link
Contributor

beckjake commented Mar 6, 2019

The behavior in this PR is implemented by #1338, closing this one accordingly

@beckjake beckjake closed this Mar 6, 2019
@drewbanin drewbanin deleted the fix/snowflake-source-quoting branch September 16, 2019 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants