Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (NodeCrash: OOM in continuous_batch_parser) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown #9375

Closed
dlex opened this issue Mar 10, 2023 · 9 comments · May be fixed by #9419
Assignees
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@dlex
Copy link
Contributor

dlex commented Mar 10, 2023

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24700#0186c665-8820-43de-be97-c81f81d91d56

Module: rptest.tests.e2e_shadow_indexing_test
Class:  ShadowIndexingManyPartitionsTest
Method: test_many_partitions_shutdown
test_id:    rptest.tests.e2e_shadow_indexing_test.ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown
status:     FAIL
run time:   2 minutes 50.439 seconds


    <NodeCrash docker-rp-9: Redpanda process unexpectedly stopped>
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/e2e_shadow_indexing_test.py", line 432, in test_many_partitions_shutdown
    seq_consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 181, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/root/tests/rptest/services/redpanda.py", line 1278, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/services/redpanda.py", line 1275, in wrapped
    assert self.all_up()
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/services/cluster.py", line 66, in wrapped
    redpanda.raise_on_crash()
  File "/root/tests/rptest/services/redpanda.py", line 1732, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash docker-rp-9: Redpanda process unexpectedly stopped>

docker-rp-9:

DEBUG 2023-03-09 12:57:11,971 [shard 1] seastar_memory - Failed to allocate 125952 bytes at 0x5f262d4 0x5c18f3d 0x5c0ea9e 0x5bd96dc 0x5bd903e 0x5bda9a8 0x5cc2def 0x5cc6ac7 0x5d09e15 0x5c64bbf /opt/redpanda_installs/ci/lib/libc.so.6+0x91016 /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > seastar::future<unsigned long>::then_impl_nrvo<seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > >(seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >&&, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&, seastar::future_state<unsigned long>&&), unsigned long>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<unsigned char> >::then_impl_nrvo<seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > >(char&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>)&, seastar::future_state<seastar::temporary_buffer<unsigned char> >&&), seastar::temporary_buffer<unsigned char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >), seastar::futurize<seastar::future<seastar::temporary_buffer<char> > >::type seastar::future<seastar::temporary_buffer<char> >::then_wrapped_nrvo<seastar::future<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)>(seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>), seastar::future<seastar::temporary_buffer<char> > >(seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::input_stream<char>::read_up_to(unsigned long)::'lambda'(seastar::temporary_buffer<char>)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>), seastar::future<void> seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>), seastar::future<void> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()::operator()() const::'lambda'(seastar::temporary_buffer<char>)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
   --------
   seastar::internal::do_until_state<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda'(), read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda0'()>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<iobuf>, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'(), seastar::future<iobuf> seastar::future<void>::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'(), seastar::future<iobuf> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'()&&)::'lambda'(seastar::internal::promise_base_with_type<iobuf>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_3::operator()(iobuf&, unsigned long&) const::'lambda1'()&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::do_with_state<std::__1::tuple<iobuf, unsigned long>, seastar::future<iobuf> >
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<model::record_batch_header, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<model::record_batch_header, std::__1::error_code, void> > >::promise_type
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >, storage::continuous_batch_parser::consume_one()::$_0, seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > > seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::then_impl_nrvo<storage::continuous_batch_parser::consume_one()::$_0, seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > > >(storage::continuous_batch_parser::consume_one()::$_0&&)::'lambda'(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&, storage::continuous_batch_parser::consume_one()::$_0&, seastar::future_state<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&), boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >), seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > seastar::future<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >::then_impl_nrvo<storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >), seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, storage::continuous_batch_parser::consume()::$_2::operator()() const::'lambda'(boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> >)&, seastar::future_state<boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >&&), boost::outcome_v2::basic_result<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::bool_class<storage::stop_parser_tag>, std::__1::error_code, void> > >
   --------
   seastar::internal::repeater<storage::continuous_batch_parser::consume()::$_2>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > >, storage::continuous_batch_parser::consume()::$_3, seastar::future<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > > seastar::future<void>::then_impl_nrvo<storage::continuous_batch_parser::consume()::$_3, seastar::future<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > > >(storage::continuous_batch_parser::consume()::$_3&&)::'lambda'(seastar::internal::promise_base_with_type<boost::outcome_v2::basic_result<unsigned long, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<unsigned long, std::__1::error_code, void> > >&&, storage::continuous_batch_parser::consume()::$_3&, seastar::future_state<seastar::internal::monostate>&&), void>
   --------
   seastar::internal::coroutine_traits_base<boost::outcome_v2::basic_result<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::error_code, boost::outcome_v2::policy::error_code_throw_as_system_error<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, std::__1::error_code, void> > >::promise_type
   --------
   seastar::internal::coroutine_traits_base<std::__1::variant<seastar::circular_buffer<model::record_batch, std::__1::allocator<model::record_batch> >, model::record_batch_reader::foreign_data_t> >::promise_type
ERROR 2023-03-09 12:57:11,972 [shard 1] seastar_memory - Dumping seastar memory diagnostics
Used memory:  1023M
Free memory:  768K
Total memory: 1G

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	177	20K	19K	93
10	4K	1	8K	8K	99
12	4K	1	8K	8K	99
14	4K	2	8K	8K	99
16	4K	1k	60K	42K	70
32	4K	10k	320K	18K	5
32	4K	4k	348K	232K	66
32	4K	2k	72K	8K	10
32	4K	3k	360K	262K	72
48	4K	3k	164K	9K	5
48	4K	10k	464K	13K	2
64	4K	7k	624K	170K	27
64	4K	68k	4M	96K	2
80	4K	8k	760K	116K	15
96	4K	4k	504K	171K	33
112	4K	704	200K	123K	61
128	4K	137	92K	75K	81
160	4K	5k	952K	130K	13
192	4K	326	424K	363K	85
224	4K	960	260K	50K	19
256	4K	882	1M	964K	81
320	8K	591	384K	199K	51
384	8K	644	464K	222K	47
448	4K	3k	1M	233K	15
512	4K	848	576K	152K	26
640	16K	63	416K	376K	90
768	16K	2k	2M	6K	0
896	8K	297	576K	316K	54
1024	4K	75	152K	77K	50
1280	32K	2k	2M	69K	2
1536	32K	6	128K	119K	92
1792	16K	7	144K	131K	91
2048	8K	59	152K	34K	22
2560	64K	143	576K	218K	37
3072	64K	1	256K	252K	98
3584	32K	3	288K	277K	96
4096	16K	81	880K	556K	63
5120	128K	2	384K	370K	96
6144	128K	25	512K	360K	70
7168	64K	18	576K	448K	77
8192	32K	210	4M	2M	53
10240	64K	205	2M	440K	17
12288	64K	59	960K	252K	26
14336	128K	391	6M	280K	4
16384	64K	7k	108M	960K	0
Page spans:
index	size	free	used	spans
0	4K	344K	13M	3k
1	8K	184K	2M	220
2	16K	208K	3M	200
3	32K	32K	88M	3k
4	64K	0B	150M	2k
5	128K	0B	695M	6k
6	256K	0B	768K	3
7	512K	0B	0B	0
8	1M	0B	0B	0
9	2M	0B	0B	0
10	4M	0B	8M	2
11	8M	0B	0B	0
12	16M	0B	64M	4
13	32M	0B	0B	0
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0

ERROR 2023-03-09 12:57:11,972 [shard 1] seastar - Failed to allocate 125952 bytes
Aborting on shard 1.
Backtrace:
  0x5ca4cb6
  0x5d086b2
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42abf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x92e3b
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42a15
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2c82e
  0x5c190ba
  0x5c0ea9e
  0x5bd96dc
  0x5bd903e
  0x5bda9a8
  0x5cc2def
  0x5cc6ac7
  0x5d09e15
  0x5c64bbf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x91016
  /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf
@dlex dlex added kind/bug Something isn't working ci-failure labels Mar 10, 2023
@dlex
Copy link
Contributor Author

dlex commented Mar 10, 2023

Apparently similar OOM was on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24716#0186c758-39eb-4b1c-80c3-bc06927ba36d:

DEBUG 2023-03-09 17:21:31,364 [shard 1] seastar_memory - Failed to allocate 131072 bytes at 0x5f262d4 0x5c18f3d 0x5c0ea9e 0x5bd96dc 0x5bd903e 0x5bda9a8 0x5cc2def 0x5cc6ac7 0x5d09e15 0x5c64bbf /opt/redpanda_installs/ci/lib/libc.so.6+0x91016 /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > seastar::future<unsigned long>::then_impl_nrvo<seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8, seastar::future<seastar::temporary_buffer<unsigned char> > >(seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<unsigned char> >&&, seastar::posix_file_impl::do_dma_read_bulk(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::$_8&, seastar::future_state<unsigned long>&&), unsigned long>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > seastar::future<seastar::temporary_buffer<unsigned char> >::then_impl_nrvo<seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>), seastar::future<seastar::temporary_buffer<char> > >(char&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::future<seastar::temporary_buffer<char> > seastar::file::dma_read_bulk<char>(unsigned long, unsigned long, seastar::io_priority_class const&, seastar::io_intent*)::'lambda'(seastar::temporary_buffer<unsigned char>)&, seastar::future_state<seastar::temporary_buffer<unsigned char> >&&), seastar::temporary_buffer<unsigned char> >
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >), seastar::futurize<seastar::future<seastar::temporary_buffer<char> > >::type seastar::future<seastar::temporary_buffer<char> >::then_wrapped_nrvo<seastar::future<seastar::temporary_buffer<char> >, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)>(seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&&)::'lambda'(seastar::internal::promise_base_with_type<seastar::temporary_buffer<char> >&&, seastar::file_data_source_impl::issue_read_aheads(unsigned int)::'lambda'(seastar::future<seastar::temporary_buffer<char> >)&, seastar::future_state<seastar::temporary_buffer<char> >&&), seastar::temporary_buffer<char> >
ERROR 2023-03-09 17:21:31,364 [shard 1] seastar_memory - Dumping seastar memory diagnostics
Used memory:  1023M
Free memory:  884K
Total memory: 1G

Small pools:
objsz	spansz	usedobj	memory	unused	wst%
8	4K	186	20K	19K	92
10	4K	1	8K	8K	99
12	4K	1	8K	8K	99
14	4K	2	8K	8K	99
16	4K	1k	64K	46K	71
32	4K	10k	320K	18K	5
32	4K	4k	448K	333K	74
32	4K	2k	72K	8K	10
32	4K	3k	368K	273K	74
48	4K	3k	172K	13K	7
48	4K	10k	460K	6K	1
64	4K	7k	628K	160K	25
64	4K	68k	4M	172K	3
80	4K	9k	804K	133K	16
96	4K	4k	532K	178K	33
112	4K	735	280K	200K	71
128	4K	173	80K	58K	72
160	4K	5k	976K	139K	14
192	4K	338	476K	413K	86
224	4K	966	260K	49K	18
256	4K	2k	1M	1021K	71
320	8K	683	392K	178K	45
384	8K	711	472K	205K	43
448	4K	3k	2M	205K	12
512	4K	911	588K	133K	22
640	16K	150	432K	338K	78
768	16K	2k	2M	44K	2
896	8K	373	696K	369K	53
1024	4K	74	152K	78K	51
1280	32K	2k	3M	418K	15
1536	32K	13	128K	108K	84
1792	16K	7	144K	131K	91
2048	8K	60	152K	32K	21
2560	64K	167	576K	158K	27
3072	64K	1	256K	252K	98
3584	32K	9	288K	256K	88
4096	16K	78	912K	600K	65
5120	128K	14	384K	310K	80
6144	128K	15	512K	420K	82
7168	64K	13	576K	483K	83
8192	32K	217	5M	3M	64
10240	64K	199	2M	500K	20
12288	64K	41	960K	468K	48
14336	128K	420	6M	644K	9
16384	64K	7k	108M	960K	0
Page spans:
index	size	free	used	spans
0	4K	220K	14M	4k
1	8K	88K	2M	225
2	16K	288K	3M	210
3	32K	96K	90M	3k
4	64K	192K	150M	2k
5	128K	0B	692M	6k
6	256K	0B	768K	3
7	512K	0B	0B	0
8	1M	0B	0B	0
9	2M	0B	0B	0
10	4M	0B	8M	2
11	8M	0B	0B	0
12	16M	0B	64M	4
13	32M	0B	0B	0
14	64M	0B	0B	0
15	128M	0B	0B	0
16	256M	0B	0B	0
17	512M	0B	0B	0
18	1G	0B	0B	0
19	2G	0B	0B	0
20	4G	0B	0B	0
21	8G	0B	0B	0
22	16G	0B	0B	0
23	32G	0B	0B	0
24	64G	0B	0B	0
25	128G	0B	0B	0
26	256G	0B	0B	0
27	512G	0B	0B	0
28	1T	0B	0B	0
29	2T	0B	0B	0
30	4T	0B	0B	0
31	8T	0B	0B	0

ERROR 2023-03-09 17:21:31,365 [shard 1] seastar - Failed to allocate 131072 bytes
Aborting on shard 1.
Backtrace:
  0x5ca4cb6
  0x5d086b2
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42abf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x92e3b
  /opt/redpanda_installs/ci/lib/libc.so.6+0x42a15
  /opt/redpanda_installs/ci/lib/libc.so.6+0x2c82e
  0x5c190ba
  0x5c0ea9e
  0x5bd96dc
  0x5bd903e
  0x5bda9a8
  0x5cc2def
  0x5cc6ac7
  0x5d09e15
  0x5c64bbf
  /opt/redpanda_installs/ci/lib/libc.so.6+0x91016
  /opt/redpanda_installs/ci/lib/libc.so.6+0x1166cf

@jcsp
Copy link
Contributor

jcsp commented Mar 10, 2023

This is the one we discussed here https://redpandadata.slack.com/archives/C04ASNM2ZLK/p1678306064561969

@andrwng
Copy link
Contributor

andrwng commented Mar 14, 2023

I spent some time trying to add memory tracking to the segment readers to confirm where the consumed memory is coming from, though I've run into a wall with plumbing tracking_allocator into the underlying iobufs (memory tracking in that class seems challenging because it allows sharing of underlying buffers, and temporary buffers look like they go as deep as Seastar fstream).

Instead, some analysis from logs:

  • There are ~6k spans of size ~128K, totaling a good chunk (692M) of the total memory 1GB, despite the topics-per-shard limit (and thus, the readers per shard limit) being dropped to 1000
  • The failed allocation is ~128K, so it for the sake of digging in, let's assume it has the same culprit as those 128K allocations.
  • From the trace, the failed allocation is coming from read-ahead buffers, which get configured for remote storage by default as 128KiB and 10 read aheads.
  • It isn't obvious to me that the memory consumed is all from read-ahead, but it does seem like a quick win to reducing memory consumption here.

andrwng added a commit to andrwng/redpanda that referenced this issue Mar 14, 2023
Multiple OOMs have been reported in the test that report a large number
of 128K spans having been allocated. It appears, from failed
allocations, to be caused by the failure to allocate buffers on the read
path, which are tuned by storage configs.

This commit updates these configs to reduce the amount of allocation in
this test, given the point of the test is to stress the number of
segment readers, not memory consumed or performance under load.

Fixes redpanda-data#9375
@jcsp jcsp added the area/cloud-storage Shadow indexing subsystem label Mar 16, 2023
andrwng added a commit to andrwng/redpanda that referenced this issue Mar 31, 2023
Multiple OOMs have been reported in the test that report a large number
of 128K spans having been allocated. It appears, from failed
allocations, to be caused by the failure to allocate buffers on the read
path, which are tuned by storage configs.

This commit updates these configs to reduce the amount of allocation in
this test, given the point of the test is to stress the number of
segment readers, not memory consumed or performance under load.

Fixes redpanda-data#9375
@jcsp
Copy link
Contributor

jcsp commented Apr 5, 2023

Looking at an instance of this in https://buildkite.com/redpanda/redpanda/builds/26497:

  • Most of the memory is in 32kiB buffers, suggesting it might come from someplace other than our usual 128kiB bulk I/Os, although the actual allocation failure is on a 128kiB buffer
  • Partition manifests aren't particularly big: it's about 100 segments each in about 128 manifests.
  • The consumer is just one sequential consumer: would be surprising if this was using much memory, unless it's leaving readers behind somehow.
  • During consume, uploads to remote storage are still going on.
  • This is a miniature test node with 2GiB of memory and 2 cores
  • The fetch that hits an allocation error is the first fetch the consumer does. It is reading from 65 partitions in parallel, and ends up building a 67MiB response. Big, but not really enough to explain fully running out of memory.

@jcsp
Copy link
Contributor

jcsp commented Apr 5, 2023

This will crash on demand with the allocation failure if you run a producer at the same time as the consumer:

diff --git a/tests/rptest/tests/e2e_shadow_indexing_test.py b/tests/rptest/tests/e2e_shadow_indexing_test.py
index f447bc0d51..432d9666ef 100644
--- a/tests/rptest/tests/e2e_shadow_indexing_test.py
+++ b/tests/rptest/tests/e2e_shadow_indexing_test.py
@@ -422,8 +422,24 @@ class ShadowIndexingManyPartitionsTest(PreallocNodesTest):
                                               self.topic,
                                               0,
                                               nodes=self.preallocated_nodes)
-        seq_consumer.start(clean=False)
-        seq_consumer.wait()
+
+        # Intentionally overlap producer and consumer so that we have segment uploads
+        # going on at the same time as wide consumer reads -> memory stress
+        producer = KgoVerifierProducer(self.test_context,
+                                       self.redpanda,
+                                       self.topic,
+                                       msg_size=1024,
+                                       msg_count=1000 * 1000,
+                                       custom_node=self.preallocated_nodes)
+        producer.start()
+        try:
+            seq_consumer.start(clean=False)
+            seq_consumer.wait()
+
+        finally:
+            producer.stop()
+            producer.wait()

@jcsp jcsp added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Apr 5, 2023
@jcsp
Copy link
Contributor

jcsp commented Apr 5, 2023

sev/medium because it's a crash, but it's happening on knowingly-underpowered nodes.

jcsp added a commit to jcsp/redpanda that referenced this issue Apr 6, 2023
This saves up to 24kib of memory per index, where
a segment is closed with very few entries in its index.

Related: redpanda-data#9375
@jcsp
Copy link
Contributor

jcsp commented Apr 6, 2023

One thing very apparent in a heap profile of this was excess memory consumption from segment indices: this test creates a huge number of tiny segments, which were all allocating up to 24kib for their indices in spite of only having 1-2 batches. I've opened #9904 to fix that, but I can still make the reproducer OOM pretty easily.

There is a weird part of the profile that claims the circular buffer in storage::range is occupying lots of memory, but I can't validate that: I tried adding some hacky counters for how many of those objects exist, and it's <100 at any given moment (the readers for segment uploads, plus the readers for raft stm apply() calls). If you believe the profile, we have some massive leak of reader objects' leases somewhere, but I'm suspicious of the telemetry.

Experimenting with stricter enforcement of reader limits, I notice that when the underlying tiered storage reads are throttled, the client ends up sending overlapping fetch requests: this might be part of what triggers resource exhaustion.

This is tangentially related to #3409, although in this instance the kafka fetch code itself doesn't appear to be misbehaving, it's more that the underlying resource cost of servicing a fixed number of bytes is might higher when that happens to touch a lot of micro segments.

jcsp added a commit to jcsp/redpanda that referenced this issue Apr 13, 2023
This saves up to 24kib of memory per index, where
a segment is closed with very few entries in its index.

Related: redpanda-data#9375
(cherry picked from commit f5e6b57)
jcsp added a commit to jcsp/redpanda that referenced this issue Apr 13, 2023
This saves up to 24kib of memory per index, where
a segment is closed with very few entries in its index.

Related: redpanda-data#9375
(cherry picked from commit f5e6b57)
ballard26 pushed a commit to ballard26/redpanda that referenced this issue May 9, 2023
This saves up to 24kib of memory per index, where
a segment is closed with very few entries in its index.

Related: redpanda-data#9375
@NyaliaLui
Copy link
Contributor

FAIL test: ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown (2/30 runs)
  failure at 2023-05-26T07:16:43.314Z: <NodeCrash docker-rp-17: Redpanda process unexpectedly stopped>
      on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29954#018856b6-6364-4f73-bd27-dea22ecf2c8c

@piyushredpanda
Copy link
Contributor

Closing old issues that have not occurred in 2 months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants