Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit memory used while fetching many partitions #10905

Merged
merged 12 commits into from
Jun 7, 2023

Conversation

dlex
Copy link
Contributor

@dlex dlex commented May 20, 2023

A semaphore is added to kafka::server to control the memory used for the buffers in the fetch path. The semaphore is set to a configurable fraction of kafka shard memory (default 0.5). As the fetch work is delegated to different shards, the first partition requested from a shard is always served, other partitions may be skipped if the memory semaphore on the shard does not have enough units available. Memory reservation is done based on the estimation of 1 MiB per batch (default, configurable), which is adjusted as soon as the real used size is known. The semaphore is local to the partition home shard.

The overall kafka server memory semaphore is also used the same way, so sizes of fetch responses are also limited from the overall kafka server perspective as well. It is also the partition home shard's semaphore, not the one associated with the connection that has received the fetch request. This is done because the memory for the buffers is acquired on the partition home shard.

Fetch request memory estimator is not changed, because it only works with the memory semaphore associated with the connection shard, and at the time of the estimation we don't even know what shards will be affected.

Neither of the semaphores is waited upon, the amounts only get consumed from there. Both semaphores can go negative as the result. So, when overused:

  • fetch memory semaphore does not delay anything, but read_from_ntps will only perform the minimal read then (one batch per shard's part of the plan)
  • kafka server memory semaphore will do the same, and additionally it will delay incoming requests served by the shard.

Actually, fetch memory semaphore is never waited upon and could have been an integer instead. The only reason it is a semaphore is a convenience of semaphore_units.

As a side effect, this PR adds support for floating point bounded properties.

The differences from the reverted #10371 are:

  • In 10371, both semaphores decremented before reading from partition were the ones associated with the connection shard; now they both are the ones associated with the partition home shard. Besides the fact that was a logic error, in 10371 that also caused the kafka server memory semaphore to be used in both connection shard (for request memory estimation), and in other (partition) shards. Allegedly due to promises in the semaphore waitlist created and signalled in different shard, there was memory corruption that was manifested in an unrelated location.
  • In 10371 there was an attempt made to estimate request memory with a custom fetch_memory_estimator, not done in this PR because memory estimator only affects connection shard memory semaphore, while the idea of the PR is to account memory against the memory semaphores of the shards where the memory is getting allocated.

Fixes #3409.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Improvements

  • Memory control is improved in Fetch request handler, covering the cases when a client tries to fetch too many partitions lead by the same shard. Some of the request partitions will not be fetched if the broker does not have enough memory for that, or if that would violate constraints set by kafka_max_bytes_per_fetch and fetch_max_bytes.

  • If kafka_max_bytes_per_fetch is not configured properly, redpanda is now more robust against huge Fetch requests by controlling memory consumption dynamically.

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch 2 times, most recently from 1d26c57 to 484db77 Compare May 23, 2023 04:51
@dlex dlex marked this pull request as ready for review May 23, 2023 17:37
@dlex
Copy link
Contributor Author

dlex commented May 23, 2023

/ci-repeat 3

@travisdowns
Copy link
Member

@dlex - I couldn't quite determine how this works from the description in the PR. Can you lengthen it a bit?

What isn't clear to be exactly the relationship between the new and existing (kafka RPC) semaphore on all shards involved in a reuqestion: the "connection" shard where the planner is running and then the shards targeted in the fetch. Which semaphores are decremented when and by how much?

Maybe a small worked example would help me if you're willing to provide it.

@dlex dlex self-assigned this May 25, 2023
@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 484db77 to acb23eb Compare May 25, 2023 19:15
@dlex
Copy link
Contributor Author

dlex commented May 25, 2023

Force push: rebased upstream

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from acb23eb to 6fbfe0a Compare May 26, 2023 04:49
@dlex
Copy link
Contributor Author

dlex commented May 26, 2023

Force push: linter errors

@jcsp
Copy link
Contributor

jcsp commented May 26, 2023

@dlex could you summarize the changes between the previous reverted change and this one? I'm curious what the issue was.

@dlex
Copy link
Contributor Author

dlex commented May 26, 2023

@dlex could you summarize the changes between the previous reverted change and this one? I'm curious what the issue was.

@jcsp I've added the summary to the PR description.

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 6fbfe0a to 4021c51 Compare May 26, 2023 21:54
@dlex
Copy link
Contributor Author

dlex commented May 26, 2023

Force push: rebased upstream

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 4021c51 to bd02089 Compare May 26, 2023 21:58
@dlex
Copy link
Contributor Author

dlex commented May 26, 2023

Force push: some comments revisited

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from bd02089 to 93af53b Compare May 28, 2023 18:59
@dlex
Copy link
Contributor Author

dlex commented May 28, 2023

Force push: test improvements

  • raised the upper limit for memory_share_for_fetch to 0.95; it is likely that it failed for tha values above 0.8 befere because of the logic error
  • unit test: now memory units returned from both memory semaphores are the same

@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from 93af53b to 9be0d3b Compare May 28, 2023 21:18
dlex added 3 commits June 1, 2023 18:07
Kafka server now stores (per shard) memory semaphore that will limit
memory usage by fetch request handler. Semaphore count is configured
based on the "kafka_memory_share_for_fetch" property and the kafka
rpc service memory size.

Metric `vectorized_kafka_rpc_fetch_avail_mem_bytes` added to control
the semaphore level.

There is a sharded `server` accessor in `request_context` to reach
the local shard instance of the new semaphore, as well as the local
instance of `net::memory` semaphore.
Access `replica_selector` via the newly exposed `sharded<server>`
to reach the local shard instance of `kafka::server` and its
replica_selector. This prevents cross shard access to `metadata_cache`
and future objects when replica selectors evolve.
@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from d81dad6 to d17be67 Compare June 1, 2023 22:28
@dlex
Copy link
Contributor Author

dlex commented Jun 1, 2023

Force push:

  • fixed: the semaphores were still taken from the connection shard
  • fixed: replica_selector instance used in fetch_from_ntp was from the connection shard
  • fixed: server_configuration was only initialized in one shard in the test fixture

@dlex dlex requested a review from dotnwat June 1, 2023 22:32
dlex added 4 commits June 2, 2023 00:17
Consult with memory semaphores on whether there is enough memory available
to perform the fetch while concurrently fetching from ntps. Both general
kafka memory semaphore, and the new kafka fetch memory semaphores
are used. With the former one, the amount consumed from it by request
memory estimator is considered.

Since batch size is not known ahead, it is estimated at 1 MiB. The first
partition in the list is fetched regardless of the semaphores values, to
satisfy the requirement that at least a signle partition from the
fetch request must advance.

The amount of units held is adjusted to the actual size used as soon as
it is known.

The acquired units of the memory semaphores are held with `read_result`
until it is destroyed at the end of the fetch request processing. When
`read_result` is destroyed in the connection shard, the semaphore units
are returned in the shard where they have been acquired.

If request's max_size bytes is more than either semaphore holds,
max_size is reduced to the memory actually available, also considering
the minimum batch size.
In kafka_server_rpfixture, an extra `kafka::server` is created using
a barely initialized `server_configuration` instance. A garbage in
`max_service_memory_per_core` has caused issues now because of the
new arithmetics done with in in the kafka::server ctor.
Test the algorithm that decides whether can a fetch request proceed
in an ntp based on the resources available.

Move the existing testing-only symbols into the `testing` ns.
RAM increased to 512M because redpanda was failing on 256M for unrelated
reasons.

Test with different values for "kafka_memory_share_for_fetch".
@dlex dlex force-pushed the 3409/oom-fetching-many-partitions-3 branch from d17be67 to 06a38b9 Compare June 2, 2023 04:20
@dlex
Copy link
Contributor Author

dlex commented Jun 2, 2023

Force push: release semaphore units in the shard where they have been acquired.

@dlex dlex requested a review from dotnwat June 2, 2023 04:21
@dlex dlex added the doc-needed label Jun 2, 2023
@dlex
Copy link
Contributor Author

dlex commented Jun 2, 2023

/ci-repeat 10
skip-unit

@dotnwat dotnwat merged commit dd85155 into redpanda-data:dev Jun 7, 2023
@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-10905-v22.3.x-170 remotes/upstream/v22.3.x
git cherry-pick -x 1db2edcd91fd7e692d069b1c059a03c353a8c689 ee5e069e3623dd655688df2d2db75c59a11b63f5 eb8a915a9497e6594ba712b0de58109d176c68bd d891efe7fc3dbc0ef27b17a53946336e17759296 58a9de0f0e32537f879404f3a4633fd5e94ed879 7b3860153a544aedd97a94a4d221ae034cf0439d d3d9276237a4c6f4ff3246c85ad3af40fc1e522e c1d77cd216d78e935586fac16274671329b70d97 2474ef67ed1e078cab1011e8835a47e71cede509 623e6136b35c4331a0a39aed1d9508edb8a7f5e0 950abc7d8ddaff656cfd035a061a999479b0a09e 06a38b97cfde550dcde20f6989ebc150750923e7

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the commands below:

git checkout -b backport-pr-10905-v23.1.x-279 remotes/upstream/v23.1.x
git cherry-pick -x 1db2edcd91fd7e692d069b1c059a03c353a8c689 ee5e069e3623dd655688df2d2db75c59a11b63f5 eb8a915a9497e6594ba712b0de58109d176c68bd d891efe7fc3dbc0ef27b17a53946336e17759296 58a9de0f0e32537f879404f3a4633fd5e94ed879 7b3860153a544aedd97a94a4d221ae034cf0439d d3d9276237a4c6f4ff3246c85ad3af40fc1e522e c1d77cd216d78e935586fac16274671329b70d97 2474ef67ed1e078cab1011e8835a47e71cede509 623e6136b35c4331a0a39aed1d9508edb8a7f5e0 950abc7d8ddaff656cfd035a061a999479b0a09e 06a38b97cfde550dcde20f6989ebc150750923e7

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kafka: fetch requests across large partition counts can use unbounded memory (fetch.max.bytes)
5 participants