-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042
cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042
Conversation
6dc48ac
to
58cd09c
Compare
58cd09c
to
91d85df
Compare
91d85df
to
249e8da
Compare
394042c
to
c16fb12
Compare
c16fb12
to
cc49769
Compare
cc49769
to
0c21989
Compare
These will no longer be internal to remote_partition, once we start managing the population of materialized_segment_state objects (and their associated readers) from a central place to improve resource management.
…state This is necessary to be able to evict the segment and/or its readers from outside of rememote partition: we need to know which remote_partition::_segments to update.
0c21989
to
eeedc3e
Compare
This object lives inside `remote` which is passed around to all the right places as the "api" object: we may later refactor remote into a true api wrapper and move the put/get methods down to some sub-object that is identifiably a remote storage client.
This was previously storage::adjustable_allowance, used in storage_resources for concurrency limits that could be adjusted at runtime via configuration bindings. Renaming for clarity (as the noun 'allowance' didn't really say anything that 'semaphore' doesn't say more clearly). This will also be used in cloud_storage for similar purposese.
The default reader limit is rather generous, and can easily overwhelm the RAM on a GB-per-core test configuration (less than 1000 readers is enough).
…_busy Now that we limit the reader concurrency, this should not bad_alloc any more.
Previously this was being instantiated and stopped, but never started. That results in segment/reader eviction not happening, now that `remote` contains the `materialized_segments` state.
eeedc3e
to
96f9d5c
Compare
Push: update the commit that updates unit tests, to cover the unit tests added in the rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once it's green/rebased
auto deadline = st.atime + max_idle; | ||
if (now >= deadline && !st.segment->download_in_progress()) { | ||
if (st.segment.owned()) { | ||
vlog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log line shuld have an ntp because many partitions could use the same GC loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'll add it in the followup PR.
// be disposed before the remote_partition it points to. | ||
vassert( | ||
false, | ||
"materialized_segment_state outlived remote_partition"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add ntp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this path we unfortunately can't, because the parent pointer is needed to fetch the ntp
Cover letter
Previously, there was no limit on how many remote_segment_batch_reader objects might be instantiated at the same time. These are comparatively heavyweight objects, containing:
Readers are re-used when a client consumes a contiguous range (i.e. next fetch picks up the reader from the previous fetch and continues it), but clients can also do random reads, picking offsets and issuing a single fetch there. In the random read case, the population of readers could grow to the limit of one per offset. Collection of stale readers only happens when a segment is hydrated (which is never if all the segments are already hydrated), or after a segment's TTL is exceeded (which is also never if there are continuous fetch requests that keep touching the segment atime).
For a more robust bound on the number of readers that exist at any one time, the tracking of
materialized_segment_state
is moved from each individualremote_partition
into a newmaterialized_segments
object that is responsible for managing all the temporary read state for partitions. Readers themselves now carry semaphore units that belong to a central semaphore which has an initial size set by a new config property,cloud_storage_max_readers_per_shard
.Fixes #6111
Fixes #6023
Backport Required
If it becomes necessary for someone hitting this in the field, we might backport to 22.2, but otherwise we shouldn't: it's a rather big change for a backport.
UX changes
None
Release notes
Improvements
cloud_storage_max_readers_per_shard
is added, which controls the maximum number of cloud storage reader objects that may exist per CPU core: this may be tuned downward to reduce memory consumption at the possible cost of read throughput. The default setting one per partition (i.e. the value oftopic_partitions_per_shard
is used).