Fix/snowflake source quoting #1326

drewbanin · 2019-02-28T02:56:48Z

Information Schema queries for Sources

This PR revamps how dbt handles identifier quoting for sources. It addresses #1317, but also addresses another 0.13.0 bug that we don't currently have an issue for. This second bug was way gnarlier than #1317, and the majority of this PR is intended to fix that bug

Source quoting

In 0.13.0, dbt was modified to better support custom databases. As a part of this change, an
information_schema_name macro was added which, given a database literal, returns the database and schema for the relevant information_schema.

This macro uses the project-wide quoting config to determine if databases should be quoted. This is the difference between:

select * from "raw".information_schema.tables

and

select * from raw.information_schema.tables

There is no difference between these two queries on pg/redshift (and we don't use BQ's information schema), so the effects of this macro are localized to Snowflake. The key problem here is that dbt is using project-level quoting configs to determine if it should quote source databases. While these seem similar, they're actually super different. Because dbt uses the quoting config to both 1) build and 2) query relations, it is internally consistent. For source databases however, dbt's quoting config has no bearing at all on whether or not the source database should be quoted! The existing information_schema_name logic would effectively make it impossible to address a lowercase+quoted source database on snowflake if database quoting was turned off for the project. This PR divorces the logic of generating information_schema relations for sources from the quoting config specified in the dbt_project.yml file.

Two approaches

There are two approaches we can take here:

quote the database/schema/identifier exactly as supplied in schema.yml files
do not quote sources automatically, and instead force the user to quote their identifiers twice in schema.yml files (the one implemented here)

The first approach is implemented in the first commit for this PR, and it worked! It also resulted in a pretty unpleasant experience for Snowflake users. This required users to write code like:

version: 2
sources:
  - name: snowplow
    database: RAW
    schema: SNOWPLOW
    tables:
      - name: event
        identifier: EVENT
      - name: web_page
        identifier: WEB_PAGE

Effectively every table name would need to be duplicated in order to provide a sane (lower case) name and correct upper-cased identifier. The 90% case for snowflake sources is that they are uppercase/unquoted, so it didn't feel right to impose this burden on dbt users.

The second approach results in a much nicer UX at the expense of some... not ideal.... logic in dbt. This makes the users with troublesome source data identifiers feel the pain, which is probably how it should be :). In the approach implemented in this PR, dbt inspects the specified identifiers for sources. If the identifiers include quotes, then dbt will strip those quotes and record that the given identifier should be quoted in information_schema queries. In this version, the above schema.yml can be written as:

version: 2
sources:
  - name: snowplow
    database: raw
    tables:
      - name: event
      - name: web_page

Much nicer! If source tables need to be quoted (b/c they include illegal characters, or are lowercased on Snowflake) users can do something like:

version: 2
sources:
  - name: snowplow
    database: '"raw-database"' # example database that needs to be quoted
    tables:
      - name: event

When dbt tries to query the raw-database information schema, it will render SQL like:

select * from "raw-database".information_schema.tables

So, i think this checks all the boxes:

mostly transparent to users
databases that must be quoted are now addressable by sources
the adapter contract hasn't really changed, which is a nice plus

There's some cleanup work to do here, but wanted to get some initial feedback as I've been chasing my tail here for a couple of days now.

TODO:

lots of tests
update the presto catalog query to remove this line of code
improve error messages

cmcarthur

the structure looks good here. you're on the right track but this isn't quite ready to go yet.

cmcarthur · 2019-02-28T15:15:56Z

core/dbt/adapters/base/impl.py

+                identifier='__phony__',
+            ).include(identifier=False).quote(database=quote_db, schema=False)
+
+            information_schema[relation.database.lower()] = information_schema_rel


do we need a duplicate check here?

cmcarthur · 2019-02-28T15:16:41Z

core/dbt/contracts/graph/manifest.py

+            key = database.lower()
+            database_counter[key] += 1
+            if database_counter[key] > 1:
+                raise RuntimeError("not unique db name TODO TODO")


this check should be moved to the get_information_schema_mapping, no?

cmcarthur · 2019-02-28T15:21:59Z

core/dbt/adapters/base/relation.py

-                'identifier': True,
+                'database': source.quoting['database'],
+                'schema': source.quoting['schema'],
+                'identifier': source.quoting['identifier'],


this new version is more correct than the old one, right? it guarantees that the rendered relation will exactly match what was supplied in the source spec. i like it

cmcarthur · 2019-02-28T15:22:14Z

core/dbt/adapters/base/impl.py

@@ -781,14 +782,39 @@ def get_catalog(self, manifest):
        Returns an agate.Table of catalog information.
        """
        # make it a list so macros can index into it.
-        context = {'databases': list(manifest.get_used_databases())}
+        context = {
+            'databases': list(manifest.get_used_databases()),


seems like this should be removed

I don't think we can do that! All of the get_catalog macros in dbt loop over this value. The alternative would be to make the catalog queries loop over the information_schema dict values... unsure if that's better

cmcarthur · 2019-02-28T15:24:38Z

core/dbt/parser/schemas.py

+           identifier.endswith(self.quote_character):
+            return True, identifier[1:-1]
+        else:
+            return False, identifier


if i have any doubt about this branch, it's this fn. the idea of it is fine, but i can see it breaking easily...

do we trim whitespace on identifiers specified in schema.yml?

really good question! I think there's some amount of caveat emptor that needs to happen here. I'm super happy to trim whitespace (this is a good idea) but i imagine that there are always going to be intractable edge cases with an approach like this. If your database starts and ends with quotation marks, then you have bigger problems than dbt's sources not working i think.

I'd agree with you that this is that part of the PR that i feel the least comfortable with. Happy to explore other ways of doing this, or possibly talk about to avoid string hackery all together. I think this functionality results in the nicest UX for users, but i could equivalently imagine:

sources: - name: snowplow database: raw quoting: database: true ....

I think that's doable, but i imagine it's unpleasant both for dbt devs and for users. Can think about this a little more for sure

beckjake

Some comments!

beckjake · 2019-03-04T14:21:49Z

core/dbt/adapters/base/impl.py

        table = self.execute_macro(GET_CATALOG_MACRO_NAME,
                                   context_override=context,
                                   release=True)

        results = self._catalog_filter_table(table, manifest)
        return results

+    @available


I don't think you really want this available in macros/materializations, given that you're already injecting its result into the context.

beckjake · 2019-03-04T14:22:59Z

core/dbt/contracts/graph/manifest.py

@@ -8,6 +8,7 @@
 from dbt.node_types import NodeType
 from dbt.logger import GLOBAL_LOGGER as logger
 from dbt import tracking
+from collections import defaultdict


I think you don't need this

beckjake · 2019-03-04T14:23:37Z

core/dbt/contracts/graph/parsed.py

@@ -676,12 +676,17 @@ class Hook(APIObject):
            # the manifest search stuff really requires this, sadly
            'resource_type': {
                'enum': [NodeType.Source],
+            },
+            'quoting': {


these members should probably also be required and exist under properties as a 'type': 'object', right? Something like this:

'quoting': { 'type': 'object', 'additionalProperties': False, 'properties': { 'database': {'type': 'boolean'}, 'schema': {'type': 'boolean'}, 'identifier': {'type': 'boolean'}, }, 'required': ['database', 'schema', 'identifier'], }

beckjake · 2019-03-04T14:26:04Z

core/dbt/parser/schemas.py

@@ -257,6 +258,7 @@ def build_test_node(self, test_target, package_name, test, root_dir, path,

        source_package = self.all_projects.get(package_name)
        if source_package is None:
+            # TODO : model_name is not defined here...


I think s/model_name/test_target fixes this? I'd have to check the types to be 100% on that.

when i run my pep8 linter, it (surprisingly frequently) shows that vars aren't defined. I'll make this quick fix here, but wanted to flag it in the initial commit so i didn't forget about it

Are you using pylint instead of pep8 itself to get those? It's great for this stuff but dbt makes it cry so much I can't stand using it.

think i'm using flake8. Can try to dig up where my configs are defined -- pretty sure I'm not using the stock settings but can't seem to find them right now

Cool, I think that's enough to go on. maybe we should add it to our CI process instead of pep8 itself? I'm pretty sure flake8 is just pep8 + a mccabe complexity thing + pyflakes

sure, i'm into that. It's pretty chatty, but all good things. Eg:

Ok. I'll open an issue about moving it - it seems like something we should do unless we find an unbearable false positive rate. In the meantime I've installed it in my editor so I'll fix stuff as I work on other things.

beckjake · 2019-03-06T14:01:53Z

The behavior in this PR is implemented by #1338, closing this one accordingly

drewbanin added 3 commits February 27, 2019 17:45

map source databases to correct identifiers in catalog queries

ac96901

require users to quote identifiers explicitly in sources

065a907

error when duplicate databases are specified

e145c1e

drewbanin assigned cmcarthur Feb 28, 2019

cmcarthur requested changes Feb 28, 2019

View reviewed changes

pr feedback

5c6c3b2

beckjake reviewed Mar 4, 2019

View reviewed changes

drewbanin added 6 commits March 4, 2019 19:25

make source quoting configs more palatable

cf3b0d0

Merge branch 'dev/stephen-girard' into fix/snowflake-source-quoting

fe5dad4

pep8+tests

5ebb792

fix docs integration tests

ecaf086

default to information schema for the specified database

133ab7a

Merge branch 'dev/stephen-girard' into fix/snowflake-source-quoting

4d80ccd

drewbanin mentioned this pull request Mar 5, 2019

0.13.0 fixes around database quoting and rendering #1332

Merged

beckjake mentioned this pull request Mar 6, 2019

Fix/jeb snowflake source quoting #1338

Merged

beckjake closed this Mar 6, 2019

drewbanin deleted the fix/snowflake-source-quoting branch September 16, 2019 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/snowflake source quoting #1326

Fix/snowflake source quoting #1326

drewbanin commented Feb 28, 2019

cmcarthur left a comment

cmcarthur Feb 28, 2019

cmcarthur Feb 28, 2019

cmcarthur Feb 28, 2019

cmcarthur Feb 28, 2019

cmcarthur Feb 28, 2019

drewbanin Feb 28, 2019 •

edited

Loading

cmcarthur Feb 28, 2019

drewbanin Feb 28, 2019

beckjake left a comment

beckjake Mar 4, 2019

beckjake Mar 4, 2019

beckjake Mar 4, 2019

beckjake Mar 4, 2019

drewbanin Mar 4, 2019

beckjake Mar 4, 2019

drewbanin Mar 4, 2019

beckjake Mar 4, 2019

drewbanin Mar 4, 2019

beckjake Mar 4, 2019

beckjake commented Mar 6, 2019

Fix/snowflake source quoting #1326

Fix/snowflake source quoting #1326

Conversation

drewbanin commented Feb 28, 2019

Information Schema queries for Sources

Source quoting

Two approaches

cmcarthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckjake left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckjake commented Mar 6, 2019

drewbanin Feb 28, 2019 •

edited

Loading