Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata caching for parquet blocks #1564
Metadata caching for parquet blocks #1564
Changes from 2 commits
5b7065e
e8b6d08
852ebef
9f6aded
3cebad6
e25ed27
14b7a7a
7c22f46
348613e
7f122f7
1d607a2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're kind of "forking" the read path here. We're sending special queries off to the backend reader and other queries to the in memory buffered reader at. Is there a cleaner implementation here?
It feels like all calls should go through the same layers:
Parquet File -> In memory cache -> Off process cache (for specific segments of the file) -> Object storage
Is there a clean way to achieve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All calls are going through the same layers, ie. Tempo Querier -> Cached Reader -> Backend reader -> Object storage. The same underlying
BackendReaderAt
is called each time. If you look at this piece of logic here - https://github.com/grafana/tempo/pull/1564/files#diff-6c42681fadfdec225eb9164d0d49a2fe2033bb8ff0e820ecbd5010bb0bb629d8R37-R51The difference is that just one of the readers is buffered.
We are just setting
shouldCache
totrue
in some special cases and leaving it asfalse
for most other calls.Not sure if I'm missing something, we can also chat offline to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can't say I have great ideas here. I think what I expect in some perfect world is
However, I think any attempt to do this while also caching the special ranges would be difficult and would make other sections of code worse. My largest concern is that by adding this code we will be increasing calls to the backend especially in cases where off process cache is not configured. Here is a read listing on searching the cluster column:
The first read is the footer. Then it attempts to start reading the column in one of the row groups. You can see there is a buffered cache miss (hit = false) everytime it jumps to a new row group. If that first call is grabbing some kind of section metadata would it increase the number of calls to the backend b/c it wouldn't be hitting the buffered reader?
Maybe all I'm wanting here is a way to configure this on/off to assess impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, and went ahead with making cache settings configurable. Let me know if this works!
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.