Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

Merged
merged 10 commits into from
Nov 4, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Nov 1, 2022

Cover letter

Previously, there was no limit on how many remote_segment_batch_reader objects might be instantiated at the same time. These are comparatively heavyweight objects, containing:

  • A buffer of batches being read for the user (up to the size of the requested bytes in a fetch)
  • A batch parser, with its contained file input stream (this has a userspace buffer and does readahead)
  • All the fields of the struct itself.

Readers are re-used when a client consumes a contiguous range (i.e. next fetch picks up the reader from the previous fetch and continues it), but clients can also do random reads, picking offsets and issuing a single fetch there. In the random read case, the population of readers could grow to the limit of one per offset. Collection of stale readers only happens when a segment is hydrated (which is never if all the segments are already hydrated), or after a segment's TTL is exceeded (which is also never if there are continuous fetch requests that keep touching the segment atime).

For a more robust bound on the number of readers that exist at any one time, the tracking of materialized_segment_state is moved from each individual remote_partition into a new materialized_segments object that is responsible for managing all the temporary read state for partitions. Readers themselves now carry semaphore units that belong to a central semaphore which has an initial size set by a new config property, cloud_storage_max_readers_per_shard.

Fixes #6111
Fixes #6023

Backport Required

If it becomes necessary for someone hitting this in the field, we might backport to 22.2, but otherwise we shouldn't: it's a rather big change for a backport.

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

Improvements

  • Improved stability under random read workloads to tiered storage topics.
  • A new cluster configuration property cloud_storage_max_readers_per_shard is added, which controls the maximum number of cloud storage reader objects that may exist per CPU core: this may be tuned downward to reduce memory consumption at the possible cost of read throughput. The default setting one per partition (i.e. the value of topic_partitions_per_shard is used).

@jcsp jcsp changed the title cloud_storage: cloud_storage: limit reader concurrency Nov 1, 2022
@jcsp jcsp changed the title cloud_storage: limit reader concurrency cloud_storage: limit reader concurrency to avoid bad_allocs under random write loads Nov 1, 2022
@jcsp jcsp changed the title cloud_storage: limit reader concurrency to avoid bad_allocs under random write loads cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads Nov 1, 2022
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch from 6dc48ac to 58cd09c Compare November 1, 2022 14:00
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch from 58cd09c to 91d85df Compare November 1, 2022 20:17
@mmedenjak mmedenjak added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Nov 2, 2022
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch from 91d85df to 249e8da Compare November 2, 2022 11:32
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch 2 times, most recently from 394042c to c16fb12 Compare November 2, 2022 13:47
@jcsp jcsp marked this pull request as ready for review November 2, 2022 14:20
These will no longer be internal to remote_partition, once
we start managing the population of materialized_segment_state
objects (and their associated readers) from a central place
to improve resource management.
…state

This is necessary to be able to evict the segment and/or its readers
from outside of rememote partition: we need to know which
remote_partition::_segments to update.
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch from 0c21989 to eeedc3e Compare November 4, 2022 09:24
This object lives inside `remote` which is passed around
to all the right places as the "api" object: we may later
refactor remote into a true api wrapper and move the put/get
methods down to some sub-object that is identifiably a
remote storage client.
This was previously storage::adjustable_allowance, used
in storage_resources for concurrency limits that could
be adjusted at runtime via configuration bindings.

Renaming for clarity (as the noun 'allowance' didn't
really say anything that 'semaphore' doesn't say more
clearly).

This will also be used in cloud_storage for similar
purposese.
The default reader limit is rather generous, and can
easily overwhelm the RAM on a GB-per-core test
configuration (less than 1000 readers is enough).
…_busy

Now that we limit the reader concurrency, this should not bad_alloc any
more.
Previously this was being instantiated and stopped, but
never started.  That results in segment/reader eviction
not happening, now that `remote` contains the `materialized_segments`
state.
@jcsp jcsp force-pushed the issue-6111-global-materialized-state branch from eeedc3e to 96f9d5c Compare November 4, 2022 09:34
@jcsp
Copy link
Contributor Author

jcsp commented Nov 4, 2022

Push: update the commit that updates unit tests, to cover the unit tests added in the rebase.

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once it's green/rebased

auto deadline = st.atime + max_idle;
if (now >= deadline && !st.segment->download_in_progress()) {
if (st.segment.owned()) {
vlog(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log line shuld have an ntp because many partitions could use the same GC loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll add it in the followup PR.

// be disposed before the remote_partition it points to.
vassert(
false,
"materialized_segment_state outlived remote_partition");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add ntp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this path we unfortunately can't, because the parent pointer is needed to fetch the ntp

@jcsp jcsp merged commit e8f0853 into redpanda-data:dev Nov 4, 2022
@jcsp jcsp deleted the issue-6111-global-materialized-state branch November 4, 2022 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda kind/bug Something isn't working
Projects
None yet
3 participants