CI Failure decommissioning stopped making progress in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8621

ballard26 · 2023-02-03T21:28:26Z

https://buildkite.com/redpanda/redpanda/builds/22277#01860dcb-5c6c-45d2-9481-503b7fbbd528

Module: rptest.tests.nodes_decommissioning_test
Class:  NodesDecommissioningTest
Method: test_flipping_decommission_recommission
Arguments:
{
  "node_is_alive": true
}

====================================================================================================
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=True
status:     FAIL
run time:   1 minute 59.275 seconds


    AssertionError('Node 1 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 566, in test_flipping_decommission_recommission
    self._wait_for_node_removed(node_id)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 152, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 158, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 1 decommissioning stopped making progress

The reason is that decommissioning for node 1 stopped making progress config is that raft_learner_recovery_rate is set to 1. The test last sets it to 104857600 before we start checking for progress. However, if we look at the last couple of config upserts we can see that the config version when we set it to 104857600 is older than the config version when it previously set it to 1

[DEBUG - 2023-02-01 17:05:02,857 - admin - _request - lineno:305]: Dispatching PUT http://docker-rp-14:9644/v1/cluster_config
[DEBUG - 2023-02-01 17:05:02,912 - admin - _request - lineno:328]: Response OK, JSON: {'config_version': 16}
[DEBUG - 2023-02-01 17:05:02,912 - nodes_decommissioning_test - _set_recovery_rate - lineno:118]: setting recovery rate to 1 result: {'config_version': 16}
...
[DEBUG - 2023-02-01 17:05:03,095 - admin - _request - lineno:305]: Dispatching PUT http://docker-rp-5:9644/v1/cluster_config
[DEBUG - 2023-02-01 17:05:03,145 - admin - _request - lineno:328]: Response OK, JSON: {'config_version': 8}
[DEBUG - 2023-02-01 17:05:03,145 - nodes_decommissioning_test - _set_recovery_rate - lineno:118]: setting recovery rate to 104857600 result: {'config_version': 8}

It's clear from the RP node logs that config version 16 with raft_learner_recovery_rate = 1 is the one that is eventually replicated to every node. I.e, the last log message I see in every node from the config_frontend and config_manager is similar to below;

TRACE 2023-02-01 17:05:04,894 [shard 0] cluster - config_manager.cc:601 - apply: upsert raft_learner_recovery_rate=1
TRACE 2023-02-01 17:05:04,896 [shard 1] cluster - config_manager.cc:601 - apply: upsert raft_learner_recovery_rate=1
INFO  2023-02-01 17:05:04,896 [shard 1] raft - recovery_throttle.h:64 - Updating recovery throttle with new rate of 0
TRACE 2023-02-01 17:05:04,904 [shard 0] cluster - config_frontend.cc:140 - set_next_version: 17

The text was updated successfully, but these errors were encountered:

mmaslankaprv · 2023-02-07T13:19:40Z

Marking as sev/low as this is test setup issue

VadimPlh · 2023-02-09T15:40:46Z

https://buildkite.com/redpanda/redpanda/builds/22772#01863193-28e9-4c18-a90d-3d90b95543de

ZeDRoman · 2023-02-17T08:32:14Z

+1 https://buildkite.com/redpanda/redpanda/builds/23349#018658c5-5ca4-49b2-8ebb-6b6ad803ce88

dlex · 2023-03-09T01:15:20Z

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/24612#0186bfbc-2e49-4641-9929-f97f0b7364a7

michael-redpanda · 2023-05-01T12:41:43Z

https://buildkite.com/redpanda/redpanda/builds/28276#0187c97c-a0da-40b1-8dfc-81068083e72e

michael-redpanda · 2023-05-25T12:43:19Z

https://buildkite.com/redpanda/vtools/builds/7764#01885138-02e2-42c7-bee7-7f34583c971d

piyushredpanda · 2023-07-25T04:27:55Z

Not seen in 30 days, closing.

ballard26 added kind/bug Something isn't working ci-failure area/replication labels Feb 3, 2023

ballard26 mentioned this issue Feb 3, 2023

Set promise on exceptions in dispatch_method_once #8519

Merged

6 tasks

mmaslankaprv added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Feb 7, 2023

ZeDRoman self-assigned this Feb 8, 2023

jcsp mentioned this issue Feb 16, 2023

cloud_storage: downgrade DeleteObject 404 message to debug #8906

Merged

6 tasks

andijcr mentioned this issue Mar 1, 2023

cluster/types.h: default init of some duration fields and others #9193

Merged

7 tasks

michael-redpanda mentioned this issue May 1, 2023

Fix ps filtering in ducktape #10294

Merged

7 tasks

dotnwat mentioned this issue May 5, 2023

cloud_storage: Fix typo in partition_manifest #10587

Merged

7 tasks

jcsp mentioned this issue May 24, 2023

raft: reject linearizable_barrier if not is_leader() #10921

Merged

7 tasks

dlex mentioned this issue Jun 1, 2023

Limit memory used while fetching many partitions #10905

Merged

7 tasks

ZeDRoman removed their assignment Jun 16, 2023

piyushredpanda closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure decommissioning stopped making progress in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8621

CI Failure decommissioning stopped making progress in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8621

ballard26 commented Feb 3, 2023 •

edited

Loading

mmaslankaprv commented Feb 7, 2023

VadimPlh commented Feb 9, 2023

ZeDRoman commented Feb 17, 2023

dlex commented Mar 9, 2023

michael-redpanda commented May 1, 2023

michael-redpanda commented May 25, 2023

piyushredpanda commented Jul 25, 2023

CI Failure decommissioning stopped making progress in NodesDecommissioningTest.test_flipping_decommission_recommission #8621

CI Failure decommissioning stopped making progress in NodesDecommissioningTest.test_flipping_decommission_recommission #8621

Comments

ballard26 commented Feb 3, 2023 • edited Loading

mmaslankaprv commented Feb 7, 2023

VadimPlh commented Feb 9, 2023

ZeDRoman commented Feb 17, 2023

dlex commented Mar 9, 2023

michael-redpanda commented May 1, 2023

michael-redpanda commented May 25, 2023

piyushredpanda commented Jul 25, 2023

CI Failure decommissioning stopped making progress in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8621

CI Failure decommissioning stopped making progress in `NodesDecommissioningTest.test_flipping_decommission_recommission` #8621

ballard26 commented Feb 3, 2023 •

edited

Loading