-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (NodeCrash: OOM in continuous_batch_parser
) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown
#9375
Comments
Apparently similar OOM was on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24716#0186c758-39eb-4b1c-80c3-bc06927ba36d:
|
This is the one we discussed here https://redpandadata.slack.com/archives/C04ASNM2ZLK/p1678306064561969 |
I spent some time trying to add memory tracking to the segment readers to confirm where the consumed memory is coming from, though I've run into a wall with plumbing Instead, some analysis from logs:
|
Multiple OOMs have been reported in the test that report a large number of 128K spans having been allocated. It appears, from failed allocations, to be caused by the failure to allocate buffers on the read path, which are tuned by storage configs. This commit updates these configs to reduce the amount of allocation in this test, given the point of the test is to stress the number of segment readers, not memory consumed or performance under load. Fixes redpanda-data#9375
Multiple OOMs have been reported in the test that report a large number of 128K spans having been allocated. It appears, from failed allocations, to be caused by the failure to allocate buffers on the read path, which are tuned by storage configs. This commit updates these configs to reduce the amount of allocation in this test, given the point of the test is to stress the number of segment readers, not memory consumed or performance under load. Fixes redpanda-data#9375
Looking at an instance of this in https://buildkite.com/redpanda/redpanda/builds/26497:
|
This will crash on demand with the allocation failure if you run a producer at the same time as the consumer:
|
|
This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375
One thing very apparent in a heap profile of this was excess memory consumption from segment indices: this test creates a huge number of tiny segments, which were all allocating up to 24kib for their indices in spite of only having 1-2 batches. I've opened #9904 to fix that, but I can still make the reproducer OOM pretty easily. There is a weird part of the profile that claims the circular buffer in Experimenting with stricter enforcement of reader limits, I notice that when the underlying tiered storage reads are throttled, the client ends up sending overlapping fetch requests: this might be part of what triggers resource exhaustion. This is tangentially related to #3409, although in this instance the kafka fetch code itself doesn't appear to be misbehaving, it's more that the underlying resource cost of servicing a fixed number of bytes is might higher when that happens to touch a lot of micro segments. |
This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375 (cherry picked from commit f5e6b57)
This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375 (cherry picked from commit f5e6b57)
This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375
|
Closing old issues that have not occurred in 2 months. |
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24700#0186c665-8820-43de-be97-c81f81d91d56
docker-rp-9
:The text was updated successfully, but these errors were encountered: