cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

jcsp · 2022-11-01T13:23:14Z

Cover letter

Previously, there was no limit on how many remote_segment_batch_reader objects might be instantiated at the same time. These are comparatively heavyweight objects, containing:

A buffer of batches being read for the user (up to the size of the requested bytes in a fetch)
A batch parser, with its contained file input stream (this has a userspace buffer and does readahead)
All the fields of the struct itself.

Readers are re-used when a client consumes a contiguous range (i.e. next fetch picks up the reader from the previous fetch and continues it), but clients can also do random reads, picking offsets and issuing a single fetch there. In the random read case, the population of readers could grow to the limit of one per offset. Collection of stale readers only happens when a segment is hydrated (which is never if all the segments are already hydrated), or after a segment's TTL is exceeded (which is also never if there are continuous fetch requests that keep touching the segment atime).

For a more robust bound on the number of readers that exist at any one time, the tracking of materialized_segment_state is moved from each individual remote_partition into a new materialized_segments object that is responsible for managing all the temporary read state for partitions. Readers themselves now carry semaphore units that belong to a central semaphore which has an initial size set by a new config property, cloud_storage_max_readers_per_shard.

Fixes #6111
Fixes #6023

Backport Required

If it becomes necessary for someone hitting this in the field, we might backport to 22.2, but otherwise we shouldn't: it's a rather big change for a backport.

UX changes

None

Release notes

Improvements

Improved stability under random read workloads to tiered storage topics.
A new cluster configuration property cloud_storage_max_readers_per_shard is added, which controls the maximum number of cloud storage reader objects that may exist per CPU core: this may be tuned downward to reduce memory consumption at the possible cost of read throughput. The default setting one per partition (i.e. the value of topic_partitions_per_shard is used).

src/v/cloud_storage/cache_service.cc

These will no longer be internal to remote_partition, once we start managing the population of materialized_segment_state objects (and their associated readers) from a central place to improve resource management.

…state This is necessary to be able to evict the segment and/or its readers from outside of rememote partition: we need to know which remote_partition::_segments to update.

This object lives inside `remote` which is passed around to all the right places as the "api" object: we may later refactor remote into a true api wrapper and move the put/get methods down to some sub-object that is identifiably a remote storage client.

This was previously storage::adjustable_allowance, used in storage_resources for concurrency limits that could be adjusted at runtime via configuration bindings. Renaming for clarity (as the noun 'allowance' didn't really say anything that 'semaphore' doesn't say more clearly). This will also be used in cloud_storage for similar purposese.

The default reader limit is rather generous, and can easily overwhelm the RAM on a GB-per-core test configuration (less than 1000 readers is enough).

…_busy Now that we limit the reader concurrency, this should not bad_alloc any more.

Previously this was being instantiated and stopped, but never started. That results in segment/reader eviction not happening, now that `remote` contains the `materialized_segments` state.

jcsp · 2022-11-04T09:34:56Z

Push: update the commit that updates unit tests, to cover the unit tests added in the rebase.

Lazin

LGTM once it's green/rebased

Lazin · 2022-11-04T11:29:18Z

src/v/cloud_storage/materialized_segments.cc

+        auto deadline = st.atime + max_idle;
+        if (now >= deadline && !st.segment->download_in_progress()) {
+            if (st.segment.owned()) {
+                vlog(


the log line shuld have an ntp because many partitions could use the same GC loop

Right, I'll add it in the followup PR.

Lazin · 2022-11-04T11:29:31Z

src/v/cloud_storage/materialized_segments.cc

+                    // be disposed before the remote_partition it points to.
+                    vassert(
+                      false,
+                      "materialized_segment_state outlived remote_partition");


pls add ntp

In this path we unfortunately can't, because the parent pointer is needed to fetch the ntp

jcsp changed the title ~~cloud_storage:~~ cloud_storage: limit reader concurrency Nov 1, 2022

github-actions bot added the area/redpanda label Nov 1, 2022

jcsp changed the title ~~cloud_storage: limit reader concurrency~~ cloud_storage: limit reader concurrency to avoid bad_allocs under random write loads Nov 1, 2022

jcsp changed the title ~~cloud_storage: limit reader concurrency to avoid bad_allocs under random write loads~~ cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads Nov 1, 2022

jcsp force-pushed the issue-6111-global-materialized-state branch from 6dc48ac to 58cd09c Compare November 1, 2022 14:00

jcsp mentioned this pull request Nov 1, 2022

cloud_storage: limit segment readers per partition #6112

Closed

5 tasks

jcsp force-pushed the issue-6111-global-materialized-state branch from 58cd09c to 91d85df Compare November 1, 2022 20:17

mmedenjak added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Nov 2, 2022

jcsp force-pushed the issue-6111-global-materialized-state branch from 91d85df to 249e8da Compare November 2, 2022 11:32

github-actions bot added the area/redpanda label Nov 2, 2022

jcsp force-pushed the issue-6111-global-materialized-state branch 2 times, most recently from 394042c to c16fb12 Compare November 2, 2022 13:47

jcsp marked this pull request as ready for review November 2, 2022 14:20

jcsp commented Nov 2, 2022

View reviewed changes

src/v/cloud_storage/cache_service.cc Outdated Show resolved Hide resolved

jcsp requested review from ztlpn and Lazin November 3, 2022 12:48

jcsp mentioned this pull request Nov 3, 2022

Limit file handles used via remote_segments open #6023

Closed

jcsp force-pushed the issue-6111-global-materialized-state branch from c16fb12 to cc49769 Compare November 3, 2022 22:09

jcsp mentioned this pull request Nov 3, 2022

cloud_storage: limit on hydrated segments per shard #7082

Merged

6 tasks

jcsp force-pushed the issue-6111-global-materialized-state branch from cc49769 to 0c21989 Compare November 3, 2022 22:52

jcsp added 3 commits November 4, 2022 09:24

storage: expose log_reader_config in fwd

9d3ea02

cloud_storage: move definitions of segment state structs

22bdcc1

These will no longer be internal to remote_partition, once we start managing the population of materialized_segment_state objects (and their associated readers) from a central place to improve resource management.

cloud_storage: carry a weak ref to partition in materialized_segment_…

ec54023

…state This is necessary to be able to evict the segment and/or its readers from outside of rememote partition: we need to know which remote_partition::_segments to update.

jcsp force-pushed the issue-6111-global-materialized-state branch from 0c21989 to eeedc3e Compare November 4, 2022 09:24

jcsp added 4 commits November 4, 2022 09:28

config: add cloud_storage_max_readers_per_shard

c21d3ae

cloud_storage: concurrency limit on remote_segment readers

191ed68

jcsp added 3 commits November 4, 2022 09:28

tests/e2e_shadow_indexing_test: set a conservative reader limit

1c9070a

The default reader limit is rather generous, and can easily overwhelm the RAM on a GB-per-core test configuration (less than 1000 readers is enough).

tests: reinstate consumer count on test_create_or_delete_topics_while…

f21f22f

…_busy Now that we limit the reader concurrency, this should not bad_alloc any more.

cloud_storage/tests: start remote instances

96f9d5c

Previously this was being instantiated and stopped, but never started. That results in segment/reader eviction not happening, now that `remote` contains the `materialized_segments` state.

jcsp force-pushed the issue-6111-global-materialized-state branch from eeedc3e to 96f9d5c Compare November 4, 2022 09:34

Lazin approved these changes Nov 4, 2022

View reviewed changes

jcsp merged commit e8f0853 into redpanda-data:dev Nov 4, 2022

jcsp deleted the issue-6111-global-materialized-state branch November 4, 2022 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

jcsp commented Nov 1, 2022 •

edited

Loading

jcsp commented Nov 4, 2022

Lazin left a comment

Lazin Nov 4, 2022

jcsp Nov 4, 2022

Lazin Nov 4, 2022

jcsp Nov 4, 2022

cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

cloud_storage: limit reader concurrency to avoid bad_allocs under random read loads #7042

Conversation

jcsp commented Nov 1, 2022 • edited Loading

Cover letter

Backport Required

UX changes

Release notes

Improvements

jcsp commented Nov 4, 2022

Lazin left a comment

Choose a reason for hiding this comment

Lazin Nov 4, 2022

Choose a reason for hiding this comment

jcsp Nov 4, 2022

Choose a reason for hiding this comment

Lazin Nov 4, 2022

Choose a reason for hiding this comment

jcsp Nov 4, 2022

Choose a reason for hiding this comment

jcsp commented Nov 1, 2022 •

edited

Loading