Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
Only commit 2a02e55 is relevant - I branched off #482 which is unmerged and so the commit for that PR shows up in here too for now.
Implements an initial set of traces and metrics on the table scan plan phase.
Approach guided by prior discussions on #482 and on monthly Iceberg-rust community sync discussion a few months ago.
Are these changes tested?
An integration test has been added that also functions as a mini example. Jaeger has been added to the integration test docker compose stack, and the test exports traces to it which can be viewed by browsing to Jaeger. If you're running the integration tests, setting the env var
ICEBERG_INTEG_TEST_PERSISTENT_DOCKER_STACK=1
will ensure that the docker stack is kept up between test runs so that you can view traces on the Jaeger container by browsing to http://localhost:16686Configuration and Example Output
info
level tracesHere's an example of how the traces look with
RUST_LOG
specifyingiceberg=info
. The aim at this level is for the trace to only contain spans that represent the calls made to the public API.We see just a single span for
plan_files
. There are attributes on the span for the scan predicate, the field ids, case sensitivity, and the scan's snapshot id.debug
level tracesWith
RUST_LOG
settingiceberg=debug
, the trace is more verbose, but not overly so. This is intended as a good default for when you want to see how long the major sub-tasks of a file plan take, as well as some info on file pruning. Here's what the trace at this level looks like when expanded just one level:Expanding one more level, we can see the individual data file manifest spans - useful for seeing how many manifests there are, how many are processed in parallel, and what kind of variance there is between the retrieval and processing times of each.
Expanding the detail for one of those manifests, we see attributes detailing the file path of the manifest in question and the number of entries within it.
Inspecting the
object_cache.get_manifest
span shows us that this was a cache miss (kind of obvious given the span duration, but useful when analysing traces in aggregate).Finally, looking at the deepest levels in the
debug
level trace shows us more detail for a given manifest. We see the time taken to retrieve that manifest, plus individual spans representing each manifest entry / data file. The expanded details for each of these contain the data file path, plus info on the filtering applied. We see here for this file that it was filtered out because its partition did not match the predicate.trace
level tracesThe most detailed level of trace is enabled by specifying
iceberg=trace
inRUST_LOG
. This may generate traces that are too large for general use but could be useful when activated selectively for troubleshooting specific issues.At this level we get spans for the
eval
calls on the expression evaluator and metrics evaluator, showing the time taken to evaluate the predicate. Not usually useful but I've found it occasionally useful when I've received pathologically large queries on very fragmented tables.