Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetMetrics use second pass #2765

Merged
merged 11 commits into from
Aug 9, 2023
Merged

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Aug 4, 2023

What this PR does:
There is a subtle behavior issue in the metrics summary api (traceqlmetrics/GetMetrics) that needs to be fixed, but doing so isn't tenable without also addressing the memory inefficiencies in the second pass layer in the parquet encodings.

** Main goal **
The metrics summary api is currently putting the groupBy conditions in the first pass, so this isn't quite working as expected. If you group by span.foo (which doesn't exist), it should return a series with nil values (all the spans matching the query that don't have this attribute), but instead it returns nothing. You can think of it as the difference between these two queries:

{ span.foo }

vs:

{ } | select(span.foo)

We need it to operate like the latter. So this updates traceqlmetrics.GetMetrics to move the groupBy to the second pass (the same as select()).

** Memory impact **
Doing that starting hitting a lot of memory inefficiencies in the second pass layer. This isn't normally an issue because searches typically hit a smallish number of spansets (1000 or less) but the metrics summary api will hit millions. There is ~10x increase in memory in BenchmarkBackendBlockGetMetrics without these changes.

There are two main changes:

  1. Export the pooling methods in parquetquery to allow external iterators to get and put things back to the pool. The rebatchIterator and bridgeIterator need to do this. They were creating a lot of IteratorResults outside the pool.
  2. Add a Release method to traceql.Spanset so the final user of the data (traceqlmetrics.GetMetrics) can return the spansets back to the pools. I tried a couple of variations of this and I'm happy with how this ended up. It's optional and things fallback to GC if this is never used (and in fact I didn't update any other traceql areas to use it.)

Which issue(s) this PR fixes:
Fixes n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci is mad at you, but lgtm.

this should improve memory usage on traceql as well?

@mdisibio
Copy link
Contributor Author

mdisibio commented Aug 9, 2023

this should improve memory usage on traceql as well?

Right, since it is all the same pipeline any query using select() will also benefit. But to a lesser degree as mentioned because searches typically don't inspect as many spansets as metrics calls.

@mdisibio mdisibio merged commit abd742a into grafana:main Aug 9, 2023
14 checks passed
kvrhdn added a commit that referenced this pull request Aug 10, 2023
* [DOC] Update screenshots tracing (#2767)

* Change screenshots to use GCP

* Remove conflict images

* Update grafana/dskit dependency (#2773)

* Update grafana/dskit dependency

* go mod tidy all the go.mod's

* Update dskit again to catch patch fix

* Check in all of vendor/

* GetMetrics use second pass (#2765)

* traceqlmetris use second pass for more correct output.  However super messy memory pooling required to keep things from exploding

* Expose parquetquery methods better

* comment

* Move Release() to the spanset instead of the Fetcher interface, much less churn

* cleanup, simplification

* cleanup

* Small cleanup

* Apply same pooling changes to vparquet3.  Update benchmark to verify expected test block version

* fix test

* changelog

* Refactor user-configurable overrides API and client; add detailed logging (#2755)

* Refactor user-configurable overrides API and client

* Refactor user-configurable overrides API and client

* Bug smash

* Replace weaveworks imports

* 🤦

* Address why GCS is faling e2e test

* Add patch to e2e test

* Sprinkle some more println in tests

* Update various OTel dependencies (#2778)

* Update various OTel dependencies

* Breaking changes, fix compilation issues

---------

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
Co-authored-by: Martin Disibio <martin.disibio@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants