-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of remote segment readers allocated (bad_allocs in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
with 10 readers, 1 writer, 1GB ram per core)
#6111
Labels
Comments
jcsp
added
kind/bug
Something isn't working
area/cloud-storage
Shadow indexing subsystem
labels
Aug 19, 2022
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 19, 2022
Previously, we only evicted stale segments, not readers. So if the segments remained materialized, they could accumulate ever-larger numbers of readers, resulting in out of memory conditions. After this change, materialized segments are only allowed to have one reader in their `readers` list after a call into borrow_reader(), the net result is that a segment can have up to two readers cached. Fixes redpanda-data#6111
5 tasks
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 19, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 22, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 22, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 22, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Aug 22, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111
pvsune
pushed a commit
that referenced
this issue
Aug 24, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: #6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Sep 13, 2022
Previously, if we were instantiating many readers on many materialized segments, we were vulnerable to instantiating an unbounded number of readers: - excess readers on materialized segments were only GC'd when we hydrated another segment. If readers were hitting already-hydrated segments then we would never trim the per-segment cache of readers - in-use readers (i.e. those not stashed in segment's `readers` list) were not tracked anywhere + there was no limit on how many might be created. This change does not apply any backpressure, but it triggers proactive dropping of readers when a partition's reader count exceeds the capacity of a semaphore. Fixes redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Sep 14, 2022
Previously, if we were instantiating many readers on many materialized segments, we were vulnerable to instantiating an unbounded number of readers: - excess readers on materialized segments were only GC'd when we hydrated another segment. If readers were hitting already-hydrated segments then we would never trim the per-segment cache of readers - in-use readers (i.e. those not stashed in segment's `readers` list) were not tracked anywhere + there was no limit on how many might be created. This change does not apply any backpressure, but it triggers proactive dropping of readers when a partition's reader count exceeds the capacity of a semaphore. Fixes redpanda-data#6111
mmaslankaprv
pushed a commit
to mmaslankaprv/redpanda
that referenced
this issue
Sep 19, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111 (cherry picked from commit 5a1273c)
mmaslankaprv
pushed a commit
to mmaslankaprv/redpanda
that referenced
this issue
Sep 19, 2022
This test will bad_alloc sometimes in docker if using the original parallelism. This is a redpanda bug, as the parallelism wasn't terribly high. It will be fixed separately, but this commit stabilizes the test in the meantime. Related: redpanda-data#6111 (cherry picked from commit 5a1273c)
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Sep 23, 2022
Previously, if we were instantiating many readers on many materialized segments, we were vulnerable to instantiating an unbounded number of readers: - excess readers on materialized segments were only GC'd when we hydrated another segment. If readers were hitting already-hydrated segments then we would never trim the per-segment cache of readers - in-use readers (i.e. those not stashed in segment's `readers` list) were not tracked anywhere + there was no limit on how many might be created. This change does not apply any backpressure, but it triggers proactive dropping of readers when a partition's reader count exceeds the capacity of a semaphore. Fixes redpanda-data#6111
jcsp
added a commit
to jcsp/redpanda
that referenced
this issue
Oct 18, 2022
test_write_with_node_failures was disabled for a ticket that was fixed already. test_write_with_node_failures was disable unnecessarily, because the test body was already tweaked to work around redpanda-data#6111 by using smaller reader count, until we fix the code to limit concurrent readers. Related: redpanda-data#6111
6 tasks
jcsp
changed the title
bad_allocs in
Limit number of remote segment readers allocated (bad_allocs in Oct 31, 2022
ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
with 10 readers, 1 writer, 1GB ram per coreShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
with 10 readers, 1 writer, 1GB ram per core)
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue was seen while updating kgo-verifier. The new version of was a bit more efficient in how it looped readers, and that probably explains why it was hitting redpanda slightly harder: this destabilized the
ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy
test, which was writing 24GB of data via a single producer, and concurrently reading it via 10 random-access readers.This was hitting bad_allocs in docker, where redpanda runs with 2 threads and 2GB RAM. It's a low-resource environment, but 1GB of RAM really should be enough to service 10 readers.
The allocator dump shows 850MB of memory in 128kb extents.
I think it may be caused by lack of bound on the number of readers on materialized segments.
The text was updated successfully, but these errors were encountered: