Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet page caching #3196

Merged
merged 16 commits into from
Dec 15, 2023
Merged

Conversation

joe-elliott
Copy link
Member

@joe-elliott joe-elliott commented Nov 30, 2023

What this PR does:
Builds on #3166 to add parquet "page" level caching. Since reads from the parquet library are opaque we don't actually attempt to determine whether or not a given read is a page. Instead we simply use the read buffer size which controls how much data is pulled at once from object storage.

Other fixes/changes:

  • Added a "none" cache role to allow reads to signal no caching
  • Correctly add defaults for background cache to the cache provider. This was missed in the last PR and causing a panic when using cache.
  • Fixed a bug that was preventing footer caching from actually doing anything.

Points of discussion:

  • This changes blurs the lines between "caching policy" and "role" here. It's not clear to me whether or not the buffer size check/determination if the read should be cached should be made here or in the cache layer.
  • Removed the buffered reader from the search path.
  • Pages must be copied before being placed into the cache queue. The write queue is async and it's possible for parquet to reuse the buffer before it's written resulting in corrupted data in cache.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@knylander-grafana
Copy link
Contributor

@joe-elliott Will we need doc for this? I can create a doc issue, if so.

@joe-elliott
Copy link
Member Author

@knylander-grafana good call. in this PR i'll add the existence of the new settings option.

for the next release (or before) we should consider the following docs regarding the caching changes:

  • update the search perf tuning page to include the new cache options and how they can be used
  • migration docs from the old cache settings to the new

Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Copy link
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. Nice work.

Copy link
Collaborator

@stoewer stoewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -89,7 +89,16 @@ func (p *provider) starting(_ context.Context) error {
}

func (p *provider) stopping(_ error) error {
// we can only stop a cache once (or they panic). use this map
// to track which caches we've stopped.
stopped := map[cache.Cache]struct{}{}
Copy link
Collaborator

@stoewer stoewer Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly: is this necessary because p.caches contain the same cache multiple times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. If a cache has multiple roles we add it to the map multiple times here:

for _, role := range cacheCfg.Role {
p.caches[role] = c
}

So this is just a bit of bookkeeping to prevent stopping one twice.

@joe-elliott joe-elliott merged commit efb11ea into grafana:main Dec 15, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants