CI Failure (NodeCrash: OOM in `continuous_batch_parser`) in `ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown` #9375

dlex · 2023-03-10T07:23:30Z

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24700#0186c665-8820-43de-be97-c81f81d91d56

Module: rptest.tests.e2e_shadow_indexing_test
Class:  ShadowIndexingManyPartitionsTest
Method: test_many_partitions_shutdown

test_id:    rptest.tests.e2e_shadow_indexing_test.ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown
status:     FAIL
run time:   2 minutes 50.439 seconds


    <NodeCrash docker-rp-9: Redpanda process unexpectedly stopped>
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 432, in test_many_partitions_shutdown
    seq_consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 181, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/root/tests/rptest/services/redpanda.py", line 1278, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/redpanda.py", line 1275, in wrapped
    assert self.all_up()
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 66, in wrapped
    redpanda.raise_on_crash()
  File "/root/tests/rptest/services/redpanda.py", line 1732, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash docker-rp-9: Redpanda process unexpectedly stopped>

docker-rp-9:

DEBUG 2023-03-09 12:57:11,971 [shard 1] seastar_memory - Failed to allocate 125952 bytes at 0x5f262d4 0x5c18f3d 0x5c0ea9e 0x5bd96dc 0x5bd903e 0x5bda9a8 0x5cc2def 0x5cc6ac7 0x5d09e15 0x5c64bbf /opt/redpanda_installs/ci/lib/libc.so.6+0x91016 /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > seastar::future<unsigned long>::then_impl_nrvo<seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > >(seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >&&, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&, seastar::future_state<unsigned long>&&), unsigned long>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<unsigned char> >::then_impl_nrvo<seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > >(char&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>)&, seastar::future_state<seastar::temporary_buffer<unsigned char> >&&), seastar::temporary_buffer<unsigned char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >), seastar::futurize<seastar::future<seastar::temporary_buffer<char> > >::type seastar::future<seastar::temporary_buffer<char> >::then_wrapped_nrvo<seastar::future<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)>(seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>), seastar::future<seastar::temporary_buffer<char> > >(seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>), seastar::future<void> seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>), seastar::future<void> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::internal::do_until_state<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda'(), read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<iobuf>, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'(), seastar::future<iobuf> seastar::future<void>::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'(), seastar::future<iobuf> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'()&&)::'lambda'(seastar::internal::promise_base_with_type<iobuf>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'()&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::do_with_state<std::__1::tuple<iobuf, unsigned long>, seastar::future<iobuf> >
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<model::record_batch_header, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<model::record_batch_header, std::__1::error_code, void> > >::promise_type
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >, storage::continuous_batch_parser::consume_one()::$_0, seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > > seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::then_impl_nrvo<storage::continuous_batch_parser::consume_one()::$_0, seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > > >(storage::continuous_batch_parser::consume_one()::$_0&&)::'lambda'(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&, storage::continuous_batch_parser::consume_one()::$_0&, seastar::future_state<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&), boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >), seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::then_impl_nrvo<storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >), seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >)&, seastar::future_state<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&), boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >
   --------
   seastar::internal::repeater<storage::continuous_batch_parser::consume()::$_2>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > >, storage::continuous_batch_parser::consume()::$_3, seastar::future<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > > seastar::future<void>::then_impl_nrvo<storage::continuous_batch_parser::consume()::$_3, seastar::future<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > > >(storage::continuous_batch_parser::consume()::$_3&&)::'lambda'(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > >&&, storage::continuous_batch_parser::consume()::$_3&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::error_code, void> > >::promise_type
   --------
   seastar::internal::coroutine_traits_base<std::__1::variant<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t> >::promise_type
ERROR 2023-03-09 12:57:11,972 [shard 1] seastar_memory - Dumping seastar memory diagnostics
Used memory:  1023M
Free memory:  768K
Total memory: 1G

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	177	20K	19K	93
10	4K	1	8K	8K	99
12	4K	1	8K	8K	99
14	4K	2	8K	8K	99
16	4K	1k	60K	42K	70
32	4K	10k	320K	18K	5
32	4K	4k	348K	232K	66
32	4K	2k	72K	8K	10
32	4K	3k	360K	262K	72
48	4K	3k	164K	9K	5
48	4K	10k	464K	13K	2
64	4K	7k	624K	170K	27
64	4K	68k	4M	96K	2
80	4K	8k	760K	116K	15
96	4K	4k	504K	171K	33
112	4K	704	200K	123K	61
128	4K	137	92K	75K	81
160	4K	5k	952K	130K	13
192	4K	326	424K	363K	85
224	4K	960	260K	50K	19
256	4K	882	1M	964K	81
320	8K	591	384K	199K	51
384	8K	644	464K	222K	47
448	4K	3k	1M	233K	15
512	4K	848	576K	152K	26
640	16K	63	416K	376K	90
768	16K	2k	2M	6K	0
896	8K	297	576K	316K	54
1024	4K	75	152K	77K	50
1280	32K	2k	2M	69K	2
1536	32K	6	128K	119K	92
1792	16K	7	144K	131K	91
2048	8K	59	152K	34K	22
2560	64K	143	576K	218K	37
3072	64K	1	256K	252K	98
3584	32K	3	288K	277K	96
4096	16K	81	880K	556K	63
5120	128K	2	384K	370K	96
6144	128K	25	512K	360K	70
7168	64K	18	576K	448K	77
8192	32K	210	4M	2M	53
10240	64K	205	2M	440K	17
12288	64K	59	960K	252K	26
14336	128K	391	6M	280K	4
16384	64K	7k	108M	960K	0
Page spans:
index	size	free	used	spans
0	4K	344K	13M	3k
1	8K	184K	2M	220
2	16K	208K	3M	200
3	32K	32K	88M	3k
4	64K	0B	150M	2k
5	128K	0B	695M	6k
6	256K	0B	768K	3
7	512K	0B	0B	0
8	1M	0B	0B	0
9	2M	0B	0B	0
10	4M	0B	8M	2
11	8M	0B	0B	0
12	16M	0B	64M	4
13	32M	0B	0B	0
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0

ERROR 2023-03-09 12:57:11,972 [shard 1] seastar - Failed to allocate 125952 bytes
Aborting on shard 1.
Backtrace:
  0x5ca4cb6
  0x5d086b2
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42abf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x92e3b
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42a15
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2c82e
  0x5c190ba
  0x5c0ea9e
  0x5bd96dc
  0x5bd903e
  0x5bda9a8
  0x5cc2def
  0x5cc6ac7
  0x5d09e15
  0x5c64bbf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x91016
  /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf

The text was updated successfully, but these errors were encountered:

dlex · 2023-03-10T07:34:07Z

Apparently similar OOM was on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24716#0186c758-39eb-4b1c-80c3-bc06927ba36d:

DEBUG 2023-03-09 17:21:31,364 [shard 1] seastar_memory - Failed to allocate 131072 bytes at 0x5f262d4 0x5c18f3d 0x5c0ea9e 0x5bd96dc 0x5bd903e 0x5bda9a8 0x5cc2def 0x5cc6ac7 0x5d09e15 0x5c64bbf /opt/redpanda_installs/ci/lib/libc.so.6+0x91016 /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > seastar::future<unsigned long>::then_impl_nrvo<seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > >(seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >&&, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&, seastar::future_state<unsigned long>&&), unsigned long>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<unsigned char> >::then_impl_nrvo<seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > >(char&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>)&, seastar::future_state<seastar::temporary_buffer<unsigned char> >&&), seastar::temporary_buffer<unsigned char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >), seastar::futurize<seastar::future<seastar::temporary_buffer<char> > >::type seastar::future<seastar::temporary_buffer<char> >::then_wrapped_nrvo<seastar::future<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)>(seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
ERROR 2023-03-09 17:21:31,364 [shard 1] seastar_memory - Dumping seastar memory diagnostics
Used memory:  1023M
Free memory:  884K
Total memory: 1G

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	186	20K	19K	92
10	4K	1	8K	8K	99
12	4K	1	8K	8K	99
14	4K	2	8K	8K	99
16	4K	1k	64K	46K	71
32	4K	10k	320K	18K	5
32	4K	4k	448K	333K	74
32	4K	2k	72K	8K	10
32	4K	3k	368K	273K	74
48	4K	3k	172K	13K	7
48	4K	10k	460K	6K	1
64	4K	7k	628K	160K	25
64	4K	68k	4M	172K	3
80	4K	9k	804K	133K	16
96	4K	4k	532K	178K	33
112	4K	735	280K	200K	71
128	4K	173	80K	58K	72
160	4K	5k	976K	139K	14
192	4K	338	476K	413K	86
224	4K	966	260K	49K	18
256	4K	2k	1M	1021K	71
320	8K	683	392K	178K	45
384	8K	711	472K	205K	43
448	4K	3k	2M	205K	12
512	4K	911	588K	133K	22
640	16K	150	432K	338K	78
768	16K	2k	2M	44K	2
896	8K	373	696K	369K	53
1024	4K	74	152K	78K	51
1280	32K	2k	3M	418K	15
1536	32K	13	128K	108K	84
1792	16K	7	144K	131K	91
2048	8K	60	152K	32K	21
2560	64K	167	576K	158K	27
3072	64K	1	256K	252K	98
3584	32K	9	288K	256K	88
4096	16K	78	912K	600K	65
5120	128K	14	384K	310K	80
6144	128K	15	512K	420K	82
7168	64K	13	576K	483K	83
8192	32K	217	5M	3M	64
10240	64K	199	2M	500K	20
12288	64K	41	960K	468K	48
14336	128K	420	6M	644K	9
16384	64K	7k	108M	960K	0
Page spans:
index	size	free	used	spans
0	4K	220K	14M	4k
1	8K	88K	2M	225
2	16K	288K	3M	210
3	32K	96K	90M	3k
4	64K	192K	150M	2k
5	128K	0B	692M	6k
6	256K	0B	768K	3
7	512K	0B	0B	0
8	1M	0B	0B	0
9	2M	0B	0B	0
10	4M	0B	8M	2
11	8M	0B	0B	0
12	16M	0B	64M	4
13	32M	0B	0B	0
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0

ERROR 2023-03-09 17:21:31,365 [shard 1] seastar - Failed to allocate 131072 bytes
Aborting on shard 1.
Backtrace:
  0x5ca4cb6
  0x5d086b2
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42abf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x92e3b
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42a15
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2c82e
  0x5c190ba
  0x5c0ea9e
  0x5bd96dc
  0x5bd903e
  0x5bda9a8
  0x5cc2def
  0x5cc6ac7
  0x5d09e15
  0x5c64bbf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x91016
  /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf

jcsp · 2023-03-10T17:35:29Z

This is the one we discussed here https://redpandadata.slack.com/archives/C04ASNM2ZLK/p1678306064561969

andrwng · 2023-03-14T07:23:19Z

I spent some time trying to add memory tracking to the segment readers to confirm where the consumed memory is coming from, though I've run into a wall with plumbing tracking_allocator into the underlying iobufs (memory tracking in that class seems challenging because it allows sharing of underlying buffers, and temporary buffers look like they go as deep as Seastar fstream).

Instead, some analysis from logs:

There are ~6k spans of size ~128K, totaling a good chunk (692M) of the total memory 1GB, despite the topics-per-shard limit (and thus, the readers per shard limit) being dropped to 1000
The failed allocation is ~128K, so it for the sake of digging in, let's assume it has the same culprit as those 128K allocations.
From the trace, the failed allocation is coming from read-ahead buffers, which get configured for remote storage by default as 128KiB and 10 read aheads.
It isn't obvious to me that the memory consumed is all from read-ahead, but it does seem like a quick win to reducing memory consumption here.

Multiple OOMs have been reported in the test that report a large number of 128K spans having been allocated. It appears, from failed allocations, to be caused by the failure to allocate buffers on the read path, which are tuned by storage configs. This commit updates these configs to reduce the amount of allocation in this test, given the point of the test is to stress the number of segment readers, not memory consumed or performance under load. Fixes redpanda-data#9375

jcsp · 2023-04-05T15:22:55Z

Looking at an instance of this in https://buildkite.com/redpanda/redpanda/builds/26497:

Most of the memory is in 32kiB buffers, suggesting it might come from someplace other than our usual 128kiB bulk I/Os, although the actual allocation failure is on a 128kiB buffer
Partition manifests aren't particularly big: it's about 100 segments each in about 128 manifests.
The consumer is just one sequential consumer: would be surprising if this was using much memory, unless it's leaving readers behind somehow.
During consume, uploads to remote storage are still going on.
This is a miniature test node with 2GiB of memory and 2 cores
The fetch that hits an allocation error is the first fetch the consumer does. It is reading from 65 partitions in parallel, and ends up building a 67MiB response. Big, but not really enough to explain fully running out of memory.

jcsp · 2023-04-05T16:15:22Z

This will crash on demand with the allocation failure if you run a producer at the same time as the consumer:

diff --git a/tests/rptest/tests/e2e_shadow_indexing_test.py b/tests/rptest/tests/e2e_shadow_indexing_test.py
index f447bc0d51..432d9666ef 100644
--- a/tests/rptest/tests/e2e_shadow_indexing_test.py
+++ b/tests/rptest/tests/e2e_shadow_indexing_test.py
@@ -422,8 +422,24 @@ class ShadowIndexingManyPartitionsTest(PreallocNodesTest):
                                               self.topic,
                                               0,
                                               nodes=self.preallocated_nodes)
-        seq_consumer.start(clean=False)
-        seq_consumer.wait()
+
+        # Intentionally overlap producer and consumer so that we have segment uploads
+        # going on at the same time as wide consumer reads -> memory stress
+        producer = KgoVerifierProducer(self.test_context,
+                                       self.redpanda,
+                                       self.topic,
+                                       msg_size=1024,
+                                       msg_count=1000 * 1000,
+                                       custom_node=self.preallocated_nodes)
+        producer.start()
+        try:
+            seq_consumer.start(clean=False)
+            seq_consumer.wait()
+
+        finally:
+            producer.stop()
+            producer.wait()

jcsp · 2023-04-05T16:15:51Z

sev/medium because it's a crash, but it's happening on knowingly-underpowered nodes.

This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375

jcsp · 2023-04-06T22:36:26Z

One thing very apparent in a heap profile of this was excess memory consumption from segment indices: this test creates a huge number of tiny segments, which were all allocating up to 24kib for their indices in spite of only having 1-2 batches. I've opened #9904 to fix that, but I can still make the reproducer OOM pretty easily.

There is a weird part of the profile that claims the circular buffer in storage::range is occupying lots of memory, but I can't validate that: I tried adding some hacky counters for how many of those objects exist, and it's <100 at any given moment (the readers for segment uploads, plus the readers for raft stm apply() calls). If you believe the profile, we have some massive leak of reader objects' leases somewhere, but I'm suspicious of the telemetry.

Experimenting with stricter enforcement of reader limits, I notice that when the underlying tiered storage reads are throttled, the client ends up sending overlapping fetch requests: this might be part of what triggers resource exhaustion.

This is tangentially related to #3409, although in this instance the kafka fetch code itself doesn't appear to be misbehaving, it's more that the underlying resource cost of servicing a fixed number of bytes is might higher when that happens to touch a lot of micro segments.

This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375 (cherry picked from commit f5e6b57)

This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375

NyaliaLui · 2023-05-26T13:33:43Z

FAIL test: ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown (2/30 runs)
  failure at 2023-05-26T07:16:43.314Z: <NodeCrash docker-rp-17: Redpanda process unexpectedly stopped>
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29954#018856b6-6364-4f73-bd27-dea22ecf2c8c

piyushredpanda · 2023-08-22T03:05:33Z

Closing old issues that have not occurred in 2 months.

dlex added kind/bug Something isn't working ci-failure labels Mar 10, 2023

jcsp assigned andrwng Mar 10, 2023

andrwng mentioned this issue Mar 14, 2023

e2e_shadow_indexing_test: limit read buffers #9419

Open

7 tasks

jcsp added the area/cloud-storage Shadow indexing subsystem label Mar 16, 2023

andrwng mentioned this issue Mar 31, 2023

[v23.1.x] Added test verifying behavior when data folder is lost #9346

Merged

jcsp mentioned this issue Apr 5, 2023

tests: use low manifest upload interval in UpgradeFromPriorFeatureVersionCloudStorageTest #9840

Merged

7 tasks

jcsp added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Apr 5, 2023

jcsp added a commit to jcsp/redpanda that referenced this issue Apr 6, 2023

storage: shrink index vectors on flush

f5e6b57

This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375

jcsp mentioned this issue Apr 6, 2023

storage: shrink index vectors on flush #9904

Merged

7 tasks

piyushredpanda mentioned this issue Apr 13, 2023

[v23.1.x] Flush at set intervals in recovery_stm #10063

Merged

jcsp mentioned this issue Apr 26, 2023

[v23.1.x] cloud_storage: handle 'partial message' errors #10046

Merged

ballard26 pushed a commit to ballard26/redpanda that referenced this issue May 9, 2023

storage: shrink index vectors on flush

55d74a1

This saves up to 24kib of memory per index, where a segment is closed with very few entries in its index. Related: redpanda-data#9375

dlex mentioned this issue May 10, 2023

[v23.1.x] Reproduce the OOM on fetching from many partitions #10652

Merged

piyushredpanda closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (NodeCrash: OOM in `continuous_batch_parser`) in `ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown` #9375

CI Failure (NodeCrash: OOM in `continuous_batch_parser`) in `ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown` #9375

dlex commented Mar 10, 2023

dlex commented Mar 10, 2023

jcsp commented Mar 10, 2023

andrwng commented Mar 14, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 6, 2023

NyaliaLui commented May 26, 2023

piyushredpanda commented Aug 22, 2023

CI Failure (NodeCrash: OOM in continuous_batch_parser) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown #9375

CI Failure (NodeCrash: OOM in continuous_batch_parser) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown #9375

Comments

dlex commented Mar 10, 2023

dlex commented Mar 10, 2023

jcsp commented Mar 10, 2023

andrwng commented Mar 14, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 5, 2023

jcsp commented Apr 6, 2023

NyaliaLui commented May 26, 2023

piyushredpanda commented Aug 22, 2023

CI Failure (NodeCrash: OOM in `continuous_batch_parser`) in `ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown` #9375

CI Failure (NodeCrash: OOM in `continuous_batch_parser`) in `ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown` #9375