Limit memory used while fetching many partitions #10905

dlex · 2023-05-20T01:52:24Z

A semaphore is added to kafka::server to control the memory used for the buffers in the fetch path. The semaphore is set to a configurable fraction of kafka shard memory (default 0.5). As the fetch work is delegated to different shards, the first partition requested from a shard is always served, other partitions may be skipped if the memory semaphore on the shard does not have enough units available. Memory reservation is done based on the estimation of 1 MiB per batch (default, configurable), which is adjusted as soon as the real used size is known. The semaphore is local to the partition home shard.

The overall kafka server memory semaphore is also used the same way, so sizes of fetch responses are also limited from the overall kafka server perspective as well. It is also the partition home shard's semaphore, not the one associated with the connection that has received the fetch request. This is done because the memory for the buffers is acquired on the partition home shard.

Fetch request memory estimator is not changed, because it only works with the memory semaphore associated with the connection shard, and at the time of the estimation we don't even know what shards will be affected.

Neither of the semaphores is waited upon, the amounts only get consumed from there. Both semaphores can go negative as the result. So, when overused:

fetch memory semaphore does not delay anything, but read_from_ntps will only perform the minimal read then (one batch per shard's part of the plan)
kafka server memory semaphore will do the same, and additionally it will delay incoming requests served by the shard.

Actually, fetch memory semaphore is never waited upon and could have been an integer instead. The only reason it is a semaphore is a convenience of semaphore_units.

As a side effect, this PR adds support for floating point bounded properties.

The differences from the reverted #10371 are:

In 10371, both semaphores decremented before reading from partition were the ones associated with the connection shard; now they both are the ones associated with the partition home shard. Besides the fact that was a logic error, in 10371 that also caused the kafka server memory semaphore to be used in both connection shard (for request memory estimation), and in other (partition) shards. Allegedly due to promises in the semaphore waitlist created and signalled in different shard, there was memory corruption that was manifested in an unrelated location.
In 10371 there was an attempt made to estimate request memory with a custom fetch_memory_estimator, not done in this PR because memory estimator only affects connection shard memory semaphore, while the idea of the PR is to account memory against the memory semaphores of the shards where the memory is getting allocated.

Fixes #3409.

Backports Required

Release Notes

Improvements

Memory control is improved in Fetch request handler, covering the cases when a client tries to fetch too many partitions lead by the same shard. Some of the request partitions will not be fetched if the broker does not have enough memory for that, or if that would violate constraints set by kafka_max_bytes_per_fetch and fetch_max_bytes.
If kafka_max_bytes_per_fetch is not configured properly, redpanda is now more robust against huge Fetch requests by controlling memory consumption dynamically.

dlex · 2023-05-23T17:35:20Z

CI failure in https://buildkite.com/redpanda/redpanda/builds/29645#0188471d-0cf2-4691-ad2e-4722e90f3f7f:

existing: CI Failure (timeout on start) in PartitionMoveInterruption.test_cancellations_interrupted_with_restarts #10497

dlex · 2023-05-23T19:19:37Z

/ci-repeat 3

dlex · 2023-05-24T04:15:19Z

CI failures in https://buildkite.com/redpanda/redpanda/builds/29714#01884a44-7df4-4a4f-8a28-16a952c0ee31:

existing: CompactionE2EIdempotencyTest.test_basic_compaction: last offset off by one error #8698

in https://buildkite.com/redpanda/redpanda/builds/29714#01884a37-0b3d-4a6c-9a29-322f4cb9d22d

existing: CI Failure (TimeoutError) in KgoRepeaterSelfTest.test_kgo_repeater #10865
new, unrelated (test timing measuring cluster config updates): CI Failure (assert self.admin.get_cluster_config) in ClusterConfigTest.test_simple_live_change #10977

travisdowns · 2023-05-24T20:16:32Z

@dlex - I couldn't quite determine how this works from the description in the PR. Can you lengthen it a bit?

What isn't clear to be exactly the relationship between the new and existing (kafka RPC) semaphore on all shards involved in a reuqestion: the "connection" shard where the planner is running and then the shards targeted in the fetch. Which semaphores are decremented when and by how much?

Maybe a small worked example would help me if you're willing to provide it.

dlex · 2023-05-25T19:15:59Z

Force push: rebased upstream

dlex · 2023-05-26T04:50:37Z

Force push: linter errors

jcsp · 2023-05-26T14:50:32Z

@dlex could you summarize the changes between the previous reverted change and this one? I'm curious what the issue was.

dlex · 2023-05-26T17:12:33Z

@dlex could you summarize the changes between the previous reverted change and this one? I'm curious what the issue was.

@jcsp I've added the summary to the PR description.

dlex · 2023-05-26T18:05:21Z

CI failures in https://buildkite.com/redpanda/redpanda/builds/29944#0188567a-a1c7-4d0a-9cd8-402aa44074ee:

existing: CI Failure (consumer timed out waiting for offsets) in ControllerLogLimitMirrorMakerTests.test_mirror_maker_with_limits #10502

dlex · 2023-05-26T18:07:29Z

CI failure in https://buildkite.com/redpanda/redpanda/builds/29944#01885699-716c-4fc4-bb28-fd604f647434:

existing: CI Failure (failed to hydrate chunk starting at 0) in CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves #10868

dlex · 2023-05-26T21:54:52Z

Force push: rebased upstream

dlex · 2023-05-26T21:58:40Z

Force push: some comments revisited

dlex · 2023-05-28T19:01:55Z

Force push: test improvements

raised the upper limit for memory_share_for_fetch to 0.95; it is likely that it failed for tha values above 0.8 befere because of the logic error
unit test: now memory units returned from both memory semaphores are the same

dlex · 2023-05-28T21:07:15Z

CI failure in https://buildkite.com/redpanda/redpanda/builds/30076#018863d1-7035-46dd-b6c8-1a0de1f45a6d:

existing CI Failure (TimeoutError) in KgoRepeaterSelfTest.test_kgo_repeater #10865

dlex · 2023-05-28T21:10:01Z

CI failures in https://buildkite.com/redpanda/redpanda/builds/30076#018863d1-7037-485f-a7e0-202b0df3ddae:

Kafka server now stores (per shard) memory semaphore that will limit memory usage by fetch request handler. Semaphore count is configured based on the "kafka_memory_share_for_fetch" property and the kafka rpc service memory size. Metric `vectorized_kafka_rpc_fetch_avail_mem_bytes` added to control the semaphore level. There is a sharded `server` accessor in `request_context` to reach the local shard instance of the new semaphore, as well as the local instance of `net::memory` semaphore.

Access `replica_selector` via the newly exposed `sharded<server>` to reach the local shard instance of `kafka::server` and its replica_selector. This prevents cross shard access to `metadata_cache` and future objects when replica selectors evolve.

dlex · 2023-06-01T22:32:07Z

Force push:

fixed: the semaphores were still taken from the connection shard
fixed: replica_selector instance used in fetch_from_ntp was from the connection shard
fixed: server_configuration was only initialized in one shard in the test fixture

src/v/kafka/server/handlers/fetch.cc

Consult with memory semaphores on whether there is enough memory available to perform the fetch while concurrently fetching from ntps. Both general kafka memory semaphore, and the new kafka fetch memory semaphores are used. With the former one, the amount consumed from it by request memory estimator is considered. Since batch size is not known ahead, it is estimated at 1 MiB. The first partition in the list is fetched regardless of the semaphores values, to satisfy the requirement that at least a signle partition from the fetch request must advance. The amount of units held is adjusted to the actual size used as soon as it is known. The acquired units of the memory semaphores are held with `read_result` until it is destroyed at the end of the fetch request processing. When `read_result` is destroyed in the connection shard, the semaphore units are returned in the shard where they have been acquired. If request's max_size bytes is more than either semaphore holds, max_size is reduced to the memory actually available, also considering the minimum batch size.

In kafka_server_rpfixture, an extra `kafka::server` is created using a barely initialized `server_configuration` instance. A garbage in `max_service_memory_per_core` has caused issues now because of the new arithmetics done with in in the kafka::server ctor.

Test the algorithm that decides whether can a fetch request proceed in an ntp based on the resources available. Move the existing testing-only symbols into the `testing` ns.

RAM increased to 512M because redpanda was failing on 256M for unrelated reasons. Test with different values for "kafka_memory_share_for_fetch".

dlex · 2023-06-02T04:21:33Z

Force push: release semaphore units in the shard where they have been acquired.

dlex · 2023-06-02T15:58:58Z

CI failure in https://buildkite.com/redpanda/redpanda/builds/30406#01887a6c-b74b-4194-b5b1-607f475755f2:

existing CI Failure (TimeoutError) in CloudStorageChunkReadTest.test_read_when_segment_size_smaller_than_chunk_size #11151

dlex · 2023-06-02T17:31:26Z

/ci-repeat 10
skip-unit

dlex · 2023-06-02T22:15:44Z

CI failures in https://buildkite.com/redpanda/redpanda/builds/30485 debug DT:

in release DT:

vbotbuildovich · 2023-06-07T04:21:27Z

/backport v23.1.x

vbotbuildovich · 2023-06-07T04:21:28Z

/backport v22.3.x

vbotbuildovich · 2023-06-07T04:22:20Z

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-10905-v22.3.x-170 remotes/upstream/v22.3.x
git cherry-pick -x 1db2edcd91fd7e692d069b1c059a03c353a8c689 ee5e069e3623dd655688df2d2db75c59a11b63f5 eb8a915a9497e6594ba712b0de58109d176c68bd d891efe7fc3dbc0ef27b17a53946336e17759296 58a9de0f0e32537f879404f3a4633fd5e94ed879 7b3860153a544aedd97a94a4d221ae034cf0439d d3d9276237a4c6f4ff3246c85ad3af40fc1e522e c1d77cd216d78e935586fac16274671329b70d97 2474ef67ed1e078cab1011e8835a47e71cede509 623e6136b35c4331a0a39aed1d9508edb8a7f5e0 950abc7d8ddaff656cfd035a061a999479b0a09e 06a38b97cfde550dcde20f6989ebc150750923e7

Workflow run logs.

vbotbuildovich · 2023-06-07T04:22:31Z

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-10905-v23.1.x-279 remotes/upstream/v23.1.x
git cherry-pick -x 1db2edcd91fd7e692d069b1c059a03c353a8c689 ee5e069e3623dd655688df2d2db75c59a11b63f5 eb8a915a9497e6594ba712b0de58109d176c68bd d891efe7fc3dbc0ef27b17a53946336e17759296 58a9de0f0e32537f879404f3a4633fd5e94ed879 7b3860153a544aedd97a94a4d221ae034cf0439d d3d9276237a4c6f4ff3246c85ad3af40fc1e522e c1d77cd216d78e935586fac16274671329b70d97 2474ef67ed1e078cab1011e8835a47e71cede509 623e6136b35c4331a0a39aed1d9508edb8a7f5e0 950abc7d8ddaff656cfd035a061a999479b0a09e 06a38b97cfde550dcde20f6989ebc150750923e7

Workflow run logs.

github-actions bot added the area/redpanda label May 20, 2023

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch 2 times, most recently from 1d26c57 to 484db77 Compare May 23, 2023 04:51

dlex requested review from dotnwat, jcsp, travisdowns and graphcareful May 23, 2023 17:36

dlex marked this pull request as ready for review May 23, 2023 17:37

dlex self-assigned this May 25, 2023

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 484db77 to acb23eb Compare May 25, 2023 19:15

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from acb23eb to 6fbfe0a Compare May 26, 2023 04:49

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 6fbfe0a to 4021c51 Compare May 26, 2023 21:54

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 4021c51 to bd02089 Compare May 26, 2023 21:58

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from bd02089 to 93af53b Compare May 28, 2023 18:59

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 93af53b to 9be0d3b Compare May 28, 2023 21:18

dlex added 3 commits June 1, 2023 18:07

config: a property for memory batch size estimation

c1d77cd

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from d81dad6 to d17be67 Compare June 1, 2023 22:28

dlex requested a review from dotnwat June 1, 2023 22:32

dotnwat reviewed Jun 1, 2023

View reviewed changes

src/v/kafka/server/handlers/fetch.cc Show resolved Hide resolved

dlex added 4 commits June 2, 2023 00:17

k/tests: UT for the memory limiting algo

950abc7

Test the algorithm that decides whether can a fetch request proceed in an ntp based on the resources available. Move the existing testing-only symbols into the `testing` ns.

tests: enable test_fetch_with_many_partitions

06a38b9

RAM increased to 512M because redpanda was failing on 256M for unrelated reasons. Test with different values for "kafka_memory_share_for_fetch".

dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from d17be67 to 06a38b9 Compare June 2, 2023 04:20

dlex requested a review from dotnwat June 2, 2023 04:21

dlex added the doc-needed label Jun 2, 2023

dotnwat approved these changes Jun 7, 2023

View reviewed changes

dotnwat merged commit dd85155 into redpanda-data:dev Jun 7, 2023

vbotbuildovich mentioned this pull request Jun 7, 2023

[v22.3.x] Limit memory used while fetching many partitions #11261

Closed

vbotbuildovich mentioned this pull request Jun 7, 2023

[v23.1.x] Limit memory used while fetching many partitions #11262

Closed

dlex mentioned this pull request Jun 8, 2023

[v23.1.x] Limit memory used while fetching many partitions #11302

Closed

rockwotj mentioned this pull request Jun 12, 2023

treewide: Mark all probes as immovable #11160

Merged

7 tasks

dlex mentioned this pull request Jun 16, 2023

CI Failure (Redpanda process unexpectedly stopped) in MemoryStressTest.test_fetch_with_many_partitions #11458

Closed

BenPope mentioned this pull request Jul 4, 2023

[v23.1.x] Limit memory used while fetching many partitions #11858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit memory used while fetching many partitions #10905

Limit memory used while fetching many partitions #10905

dlex commented May 20, 2023 •

edited

Loading

dlex commented May 23, 2023

dlex commented May 23, 2023

dlex commented May 24, 2023 •

edited

Loading

travisdowns commented May 24, 2023

dlex commented May 25, 2023

dlex commented May 26, 2023

jcsp commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 28, 2023

dlex commented May 28, 2023

dlex commented May 28, 2023

dlex commented Jun 1, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023 •

edited

Loading

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

Limit memory used while fetching many partitions #10905

Limit memory used while fetching many partitions #10905

Conversation

dlex commented May 20, 2023 • edited Loading

Backports Required

Release Notes

Improvements

dlex commented May 23, 2023

dlex commented May 23, 2023

dlex commented May 24, 2023 • edited Loading

travisdowns commented May 24, 2023

dlex commented May 25, 2023

dlex commented May 26, 2023

jcsp commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 26, 2023

dlex commented May 28, 2023

dlex commented May 28, 2023

dlex commented May 28, 2023

dlex commented Jun 1, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023

dlex commented Jun 2, 2023 • edited Loading

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

vbotbuildovich commented Jun 7, 2023

dlex commented May 20, 2023 •

edited

Loading

dlex commented May 24, 2023 •

edited

Loading

dlex commented Jun 2, 2023 •

edited

Loading