Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of remote segment readers allocated (bad_allocs in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy with 10 readers, 1 writer, 1GB ram per core) #6111

Closed
jcsp opened this issue Aug 19, 2022 · 0 comments · Fixed by #7042
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Aug 19, 2022

This issue was seen while updating kgo-verifier. The new version of was a bit more efficient in how it looped readers, and that probably explains why it was hitting redpanda slightly harder: this destabilized the ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy test, which was writing 24GB of data via a single producer, and concurrently reading it via 10 random-access readers.

This was hitting bad_allocs in docker, where redpanda runs with 2 threads and 2GB RAM. It's a low-resource environment, but 1GB of RAM really should be enough to service 10 readers.

The allocator dump shows 850MB of memory in 128kb extents.

I think it may be caused by lack of bound on the number of readers on materialized segments.

@jcsp jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem labels Aug 19, 2022
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 19, 2022
Previously, we only evicted stale segments, not readers.

So if the segments remained materialized, they could accumulate
ever-larger numbers of readers, resulting in out of memory
conditions.

After this change, materialized segments are only allowed
to have one reader in their `readers` list after a call
into borrow_reader(), the net result is that a segment
can have up to two readers cached.

Fixes redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 19, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 22, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 22, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 22, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Aug 22, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
pvsune pushed a commit that referenced this issue Aug 24, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: #6111
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 13, 2022
Previously, if we were instantiating many readers
on many materialized segments, we were vulnerable to
instantiating an unbounded number of readers:
- excess readers on materialized segments were only GC'd
  when we hydrated another segment.  If readers were hitting
  already-hydrated segments then we would never trim
  the per-segment cache of readers
- in-use readers (i.e. those not stashed in segment's `readers`
  list) were not tracked anywhere + there was no limit on how
  many might be created.

This change does not apply any backpressure, but it triggers
proactive dropping of readers when a partition's reader count
exceeds the capacity of a semaphore.

Fixes redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 14, 2022
Previously, if we were instantiating many readers
on many materialized segments, we were vulnerable to
instantiating an unbounded number of readers:
- excess readers on materialized segments were only GC'd
  when we hydrated another segment.  If readers were hitting
  already-hydrated segments then we would never trim
  the per-segment cache of readers
- in-use readers (i.e. those not stashed in segment's `readers`
  list) were not tracked anywhere + there was no limit on how
  many might be created.

This change does not apply any backpressure, but it triggers
proactive dropping of readers when a partition's reader count
exceeds the capacity of a semaphore.

Fixes redpanda-data#6111
mmaslankaprv pushed a commit to mmaslankaprv/redpanda that referenced this issue Sep 19, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
(cherry picked from commit 5a1273c)
mmaslankaprv pushed a commit to mmaslankaprv/redpanda that referenced this issue Sep 19, 2022
This test will bad_alloc sometimes in docker if using
the original parallelism.  This is a redpanda bug, as
the parallelism wasn't terribly high.  It will be fixed
separately, but this commit stabilizes the test in the
meantime.

Related: redpanda-data#6111
(cherry picked from commit 5a1273c)
jcsp added a commit to jcsp/redpanda that referenced this issue Sep 23, 2022
Previously, if we were instantiating many readers
on many materialized segments, we were vulnerable to
instantiating an unbounded number of readers:
- excess readers on materialized segments were only GC'd
  when we hydrated another segment.  If readers were hitting
  already-hydrated segments then we would never trim
  the per-segment cache of readers
- in-use readers (i.e. those not stashed in segment's `readers`
  list) were not tracked anywhere + there was no limit on how
  many might be created.

This change does not apply any backpressure, but it triggers
proactive dropping of readers when a partition's reader count
exceeds the capacity of a semaphore.

Fixes redpanda-data#6111
jcsp added a commit to jcsp/redpanda that referenced this issue Oct 18, 2022
test_write_with_node_failures was disabled for a ticket that
was fixed already.

test_write_with_node_failures was disable unnecessarily, because
the test body was already tweaked to work around redpanda-data#6111 by using
smaller reader count, until we fix the code to limit concurrent
readers.

Related: redpanda-data#6111
@jcsp jcsp changed the title bad_allocs in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy with 10 readers, 1 writer, 1GB ram per core Limit number of remote segment readers allocated (bad_allocs in ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy with 10 readers, 1 writer, 1GB ram per core) Oct 31, 2022
@jcsp jcsp closed this as completed in #7042 Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem kind/bug Something isn't working
Projects
None yet
3 participants