[CT-711] [Draft] Add Structured Logs to Materializations #5325

nathaniel-may · 2022-06-01T17:14:09Z

Motivation

Today, run_results.json is a static artifact written at the end of a run which contains a summary of the objects that dbt run created and updated along with an incomplete view of the sql from each materialization. This information should be more granular and be visible during a run. Structured logging is a formal interface for programmatically consuming metadata about a dbt run, and is where this more granular information will be streamed to outside consumers.

Context

The materialization macro and the adapter functions it calls are natural places to put these log lines. However, many adapters override materialiations directly (i.e. snowflake, bigquery, spark), and others override key adapter functions such as adapter.execute (i.e. bigquery) which would require these and future adapter maintainers to remember to copy our log lines exactly the same. Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid, which is why this ticket puts these logs lines before and after the call to execute in statement.sql (source). Since statement.sql is not a Python file, structured logging must be accessible to jinja and today it is not.

Implementation

Create a new function in the jinja context log_event like log (source) so that instead of taking a string msg, it takes a string event_name and a list of values of type List[Any]. This will fire an event with the same name as the string passed with the parameters in the list. (Using Event.__subclasses__(), and the splat operator for lists (i.e. *[foo, bar]) may help here). Mypy will not be able to catch any type mismatches here since the event constructors aren't passed concrete values till runtime. Since these log lines will run with every materialization, tests should catch any issues with these log lines.
Update SQLQueryStatus to take in the full adapter_response object, and node name. Today we stringify the adapter reponse first making it much harder to consume. Special attention will have to be paid to serialization since each warehouse will be putting different values into their own responses. Using examples from today's warehouses for inputs to serialization tests is a good idea. The name of the currently running node is accessible via the "jinja god context" (TODO: link example of how to access the jinja god context) (TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.)

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-06-01T19:20:13Z

Requiring adapter maintainers to take direct action to maintain the reliability of the structured logging interface is something we'd like to avoid

Agree in principle. I'm on the fence about sticking this in the statement macro, vs. just expanding the SQLQuery event fired in add_query. This would indeed require some work in dbt-bigquery's custom execute, which doesn't feel great, but... almost all adapters out in the world inherit from the "SQL" adapter. [Update: I remembered that dbt-snowflake also overrides add_query, though it does ultimately call super().add_query as well.]

Definitely curious to hear Gerda's thoughts on how to factor this code. Also tagging @dataders for visibility / fruit for thought: To what extent must we shy away from high-impact logging, or other important metadata, living in the adapter?

TODO: adapter.execute is called in other places too. We should add log lines there too. Link exact lines here.

The important ones I can find:

run_hooks in the run task (as in, on-run-start/on-run-end hooks)
SqlExecuteRunner, used by dbt.lib.execute_sql

We use it in our functional test utilities as well, here and here

Logging relation info as soon as it's built

This was another thing we discussed earlier today, related to this effort but possibly its own ticket

Let's add structured logging about the relations produced by the materialization, as soon as it finishes building them (here). Materializations already return the set of relations it creates/updates, for the purposes of updating dbt's cache. Why not share the wealth with programmatic consumers of dbt metadata?

(Will serializing Relation objects be an absolute nightmare? Relation objects can be reimplemented by adapter, of course, though they all inherit from BaseRelation, which should be serializable. Even so, we may not want all the object attributes included in the logging event—probably just a subset.)

For now, the only really valuable information included on the relation object is database location (database.schema.identifier) and relation type (view, table, etc). However, I could see doing two things to make this very valuable, for which this logging lays necessary groundwork:

Additional fields like columns (with data types) and table statistics (maybe even column-level stats, too) — a.k.a. the same basic contract as CatalogInfo
Steps at the end of each materialization that describe the just-built relation, to populate those fields, which will then be logged out once the materialization completes

Put it all together, and we'll be able to provide realer-time access to catalog info, rather than trying to grab it all in one big memory-intensive batch during docs generate.

jtcohen6 · 2022-07-21T18:28:39Z

I've come around on this, given the way that we've implemented Python on some adapters. The statement macro has become the "gateway" for both SQL + Python code, and it will call either adapter.execute or adapter.submit_python_job (on some adapters). Rather than require logging in both of those methods, trusting adapter maintainers to implement it for us, we should have that logging in one place — statement.

We'd still want to handle cases where:

We hook into adapter.execute directly today (linked above). It's theoretically possible to call adapter.execute directly from the Jinja context, bypassing the statement / run_query situation — though it's not something we document.
It's theoretically possible for an end user to override the statement / run_query macro... though I think a LOT of things would break :)

ChenyuLInx · 2022-07-27T22:31:55Z

In python model we currently added a decorator to log things.

github-actions · 2023-01-25T01:59:11Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

jtcohen6 · 2023-01-25T09:02:15Z

Assigning myself to reopen as 2-3 separate issues, and then close this one

jtcohen6 · 2023-01-25T18:03:23Z

Closing in favor of #6731, #6732, #6733

nathaniel-may added enhancement New feature or request triage labels Jun 1, 2022

github-actions bot changed the title ~~[Draft] Add Structured Logs to Materializations~~ [CT-711] [Draft] Add Structured Logs to Materializations Jun 1, 2022

nathaniel-may added logging and removed triage labels Jun 1, 2022

nathaniel-may mentioned this issue Jun 3, 2022

Expose more structured logging capabilities to {{ log() }} #4184

Closed

jtcohen6 mentioned this issue Jul 22, 2022

A decorator for log code execution #5510

Merged

jtcohen6 mentioned this issue Sep 1, 2022

[CT-1107] Log output for materialization incremental incorrectly reports bytes processed dbt-labs/dbt-bigquery#293

Closed

jtcohen6 mentioned this issue Nov 23, 2022

[CT-1544] [Bug] Should get_catalog() respect an overriden source() builtin? #6308

Closed

2 tasks

jtcohen6 mentioned this issue Jan 8, 2023

[CT-1764] [Feature] write_file not to disk #6544

Closed

github-actions bot added the stale Issues that have gone stale label Jan 25, 2023

jtcohen6 self-assigned this Jan 25, 2023

This was referenced Jan 25, 2023

[CT-1904] Consolidate code execution result events, include adapter_response #6731

Closed

[CT-1905] [Spike] Get model "catalog" info after building, and fire in an event #6732

Closed

[CT-1906] Enable user-space code to fire typed events #6733

Closed

jtcohen6 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2023

jtcohen6 removed their assignment Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-711] [Draft] Add Structured Logs to Materializations #5325

[CT-711] [Draft] Add Structured Logs to Materializations #5325

nathaniel-may commented Jun 1, 2022 •

edited

Loading

jtcohen6 commented Jun 1, 2022 •

edited

Loading

jtcohen6 commented Jul 21, 2022

ChenyuLInx commented Jul 27, 2022

github-actions bot commented Jan 25, 2023

jtcohen6 commented Jan 25, 2023 •

edited

Loading

jtcohen6 commented Jan 25, 2023

[CT-711] [Draft] Add Structured Logs to Materializations #5325

[CT-711] [Draft] Add Structured Logs to Materializations #5325

Comments

nathaniel-may commented Jun 1, 2022 • edited Loading

Motivation

Context

Implementation

jtcohen6 commented Jun 1, 2022 • edited Loading

Logging relation info as soon as it's built

jtcohen6 commented Jul 21, 2022

ChenyuLInx commented Jul 27, 2022

github-actions bot commented Jan 25, 2023

jtcohen6 commented Jan 25, 2023 • edited Loading

jtcohen6 commented Jan 25, 2023

nathaniel-may commented Jun 1, 2022 •

edited

Loading

jtcohen6 commented Jun 1, 2022 •

edited

Loading

jtcohen6 commented Jan 25, 2023 •

edited

Loading