GetMetrics use second pass #2765

mdisibio · 2023-08-04T14:41:47Z

What this PR does:
There is a subtle behavior issue in the metrics summary api (traceqlmetrics/GetMetrics) that needs to be fixed, but doing so isn't tenable without also addressing the memory inefficiencies in the second pass layer in the parquet encodings.

** Main goal **
The metrics summary api is currently putting the groupBy conditions in the first pass, so this isn't quite working as expected. If you group by span.foo (which doesn't exist), it should return a series with nil values (all the spans matching the query that don't have this attribute), but instead it returns nothing. You can think of it as the difference between these two queries:

{ span.foo }

vs:

{ } | select(span.foo)

We need it to operate like the latter. So this updates traceqlmetrics.GetMetrics to move the groupBy to the second pass (the same as select()).

** Memory impact **
Doing that starting hitting a lot of memory inefficiencies in the second pass layer. This isn't normally an issue because searches typically hit a smallish number of spansets (1000 or less) but the metrics summary api will hit millions. There is ~10x increase in memory in BenchmarkBackendBlockGetMetrics without these changes.

There are two main changes:

Export the pooling methods in parquetquery to allow external iterators to get and put things back to the pool. The rebatchIterator and bridgeIterator need to do this. They were creating a lot of IteratorResults outside the pool.
Add a Release method to traceql.Spanset so the final user of the data (traceqlmetrics.GetMetrics) can return the spansets back to the pools. I tried a couple of variations of this and I'm happy with how this ended up. It's optional and things fallback to GC if this is never used (and in fact I didn't update any other traceql areas to use it.)

Which issue(s) this PR fixes:
Fixes n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…messy memory pooling required to keep things from exploding

…less churn

…xpected test block version

joe-elliott

ci is mad at you, but lgtm.

this should improve memory usage on traceql as well?

mdisibio · 2023-08-09T11:14:55Z

this should improve memory usage on traceql as well?

Right, since it is all the same pipeline any query using select() will also benefit. But to a lesser degree as mentioned because searches typically don't inspect as many spansets as metrics calls.

* [DOC] Update screenshots tracing (#2767) * Change screenshots to use GCP * Remove conflict images * Update grafana/dskit dependency (#2773) * Update grafana/dskit dependency * go mod tidy all the go.mod's * Update dskit again to catch patch fix * Check in all of vendor/ * GetMetrics use second pass (#2765) * traceqlmetris use second pass for more correct output. However super messy memory pooling required to keep things from exploding * Expose parquetquery methods better * comment * Move Release() to the spanset instead of the Fetcher interface, much less churn * cleanup, simplification * cleanup * Small cleanup * Apply same pooling changes to vparquet3. Update benchmark to verify expected test block version * fix test * changelog * Refactor user-configurable overrides API and client; add detailed logging (#2755) * Refactor user-configurable overrides API and client * Refactor user-configurable overrides API and client * Bug smash * Replace weaveworks imports * 🤦 * Address why GCS is faling e2e test * Add patch to e2e test * Sprinkle some more println in tests * Update various OTel dependencies (#2778) * Update various OTel dependencies * Breaking changes, fix compilation issues --------- Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Co-authored-by: Martin Disibio <martin.disibio@grafana.com>

mdisibio added 8 commits August 3, 2023 16:50

traceqlmetris use second pass for more correct output. However super …

b14ee7a

…messy memory pooling required to keep things from exploding

Expose parquetquery methods better

b43d413

comment

a026af3

Move Release() to the spanset instead of the Fetcher interface, much …

f5a0b74

…less churn

cleanup, simplification

379a7b0

cleanup

d9a8013

Small cleanup

c3bcade

Apply same pooling changes to vparquet3. Update benchmark to verify e…

14cbdcd

…xpected test block version

mdisibio requested review from joe-elliott, annanay25, mapno, kvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners August 4, 2023 14:41

fix test

85e306a

joe-elliott approved these changes Aug 7, 2023

View reviewed changes

Merge branch 'main' into metrics-mem-pooling

a809f9d

changelog

e52e65e

mdisibio merged commit abd742a into grafana:main Aug 9, 2023
14 checks passed

knylander-grafana mentioned this pull request Aug 18, 2023

[DOC] Tempo 2.2.1 release notes #2811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetMetrics use second pass #2765

GetMetrics use second pass #2765

mdisibio commented Aug 4, 2023 •

edited

Loading

joe-elliott left a comment

mdisibio commented Aug 9, 2023

GetMetrics use second pass #2765

GetMetrics use second pass #2765

Conversation

mdisibio commented Aug 4, 2023 • edited Loading

joe-elliott left a comment

Choose a reason for hiding this comment

mdisibio commented Aug 9, 2023

mdisibio commented Aug 4, 2023 •

edited

Loading