Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-711] [Draft] Add Structured Logs to Materializations #5325

Closed
2 tasks
nathaniel-may opened this issue Jun 1, 2022 · 6 comments
Closed
2 tasks

[CT-711] [Draft] Add Structured Logs to Materializations #5325

nathaniel-may opened this issue Jun 1, 2022 · 6 comments
Labels
enhancement New feature or request logging stale Issues that have gone stale

Comments

@nathaniel-may
Copy link
Contributor

nathaniel-may commented Jun 1, 2022

Motivation

Today, run_results.json is a static artifact written at the end of a run which contains a summary of the objects that dbt run created and updated along with an incomplete view of the sql from each materialization. This information should be more granular and be visible during a run. Structured logging is a formal interface for programmatically consuming metadata about a dbt run, and is where this more granular information will be streamed to outside consumers.

Context

The materialization macro and the adapter functions it calls are natural places to put these log lines. However, many adapters override materialiations directly (i.e. snowflake, bigquery, spark), and others override key adapter functions such as adapter.execute (i.e. bigquery) which would require these and future adapter maintainers to remember to copy our log lines exactly the same. Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid, which is why this ticket puts these logs lines before and after the call to execute in statement.sql (source). Since statement.sql is not a Python file, structured logging must be accessible to jinja and today it is not.

Implementation

  • Create a new function in the jinja context log_event like log (source) so that instead of taking a string msg, it takes a string event_name and a list of values of type List[Any]. This will fire an event with the same name as the string passed with the parameters in the list. (Using Event.__subclasses__(), and the splat operator for lists (i.e. *[foo, bar]) may help here). Mypy will not be able to catch any type mismatches here since the event constructors aren't passed concrete values till runtime. Since these log lines will run with every materialization, tests should catch any issues with these log lines.
  • Update SQLQueryStatus to take in the full adapter_response object, and node name. Today we stringify the adapter reponse first making it much harder to consume. Special attention will have to be paid to serialization since each warehouse will be putting different values into their own responses. Using examples from today's warehouses for inputs to serialization tests is a good idea. The name of the currently running node is accessible via the "jinja god context" (TODO: link example of how to access the jinja god context) (TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.)
@nathaniel-may nathaniel-may added enhancement New feature or request triage labels Jun 1, 2022
@github-actions github-actions bot changed the title [Draft] Add Structured Logs to Materializations [CT-711] [Draft] Add Structured Logs to Materializations Jun 1, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 1, 2022

Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid

Agree in principle. I'm on the fence about sticking this in the statement macro, vs. just expanding the SQLQuery event fired in add_query. This would indeed require some work in dbt-bigquery's custom execute, which doesn't feel great, but... almost all adapters out in the world inherit from the "SQL" adapter. [Update: I remembered that dbt-snowflake also overrides add_query, though it does ultimately call super().add_query as well.]

Definitely curious to hear Gerda's thoughts on how to factor this code. Also tagging @dataders for visibility / fruit for thought: To what extent must we shy away from high-impact logging, or other important metadata, living in the adapter?

TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.

The important ones I can find:

We use it in our functional test utilities as well, here and here

Logging relation info as soon as it's built

This was another thing we discussed earlier today, related to this effort but possibly its own ticket

Let's add structured logging about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?

(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)

For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:

  • Additional fields like columns (with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract as CatalogInfo
  • Steps at the end of each materialization that describe the just-built relation, to populate those fields, which will then be logged out once the materialization completes

Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.

@jtcohen6
Copy link
Contributor

I've come around on this, given the way that we've implemented Python on some adapters. The statement macro has become the "gateway" for both SQL + Python code, and it will call either adapter.execute or adapter.submit_python_job (on some adapters). Rather than require logging in both of those methods, trusting adapter maintainers to implement it for us, we should have that logging in one place — statement.

We'd still want to handle cases where:

  • We hook into adapter.execute directly today (linked above). It's theoretically possible to call adapter.execute directly from the Jinja context, bypassing the statement / run_query situation — though it's not something we document.
  • It's theoretically possible for an end user to override the statement / run_query macro... though I think a LOT of things would break :)

@ChenyuLInx
Copy link
Contributor

In python model we currently added a decorator to log things.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 25, 2023
@jtcohen6 jtcohen6 self-assigned this Jan 25, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 25, 2023

Assigning myself to reopen as 2-3 separate issues, and then close this one

@jtcohen6
Copy link
Contributor

Closing in favor of #6731, #6732, #6733

@jtcohen6 jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2023
@jtcohen6 jtcohen6 removed their assignment Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request logging stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants