Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure decommissioning stopped making progress in NodesDecommissioningTest.test_flipping_decommission_recommission #8621

Closed
ballard26 opened this issue Feb 3, 2023 · 7 comments
Labels
area/replication ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages

Comments

@ballard26
Copy link
Contributor

ballard26 commented Feb 3, 2023

https://buildkite.com/redpanda/redpanda/builds/22277#01860dcb-5c6c-45d2-9481-503b7fbbd528

Module: rptest.tests.nodes_decommissioning_test
Class:  NodesDecommissioningTest
Method: test_flipping_decommission_recommission
Arguments:
{
  "node_is_alive": true
}
====================================================================================================
test_id:    rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=True
status:     FAIL
run time:   1 minute 59.275 seconds


    AssertionError('Node 1 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 566, in test_flipping_decommission_recommission
    self._wait_for_node_removed(node_id)
  File "/root/tests/rptest/tests/nodes_decommissioning_test.py", line 152, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 158, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 1 decommissioning stopped making progress

The reason is that decommissioning for node 1 stopped making progress config is that raft_learner_recovery_rate is set to 1. The test last sets it to 104857600 before we start checking for progress. However, if we look at the last couple of config upserts we can see that the config version when we set it to 104857600 is older than the config version when it previously set it to 1

[DEBUG - 2023-02-01 17:05:02,857 - admin - _request - lineno:305]: Dispatching PUT http://docker-rp-14:9644/v1/cluster_config
[DEBUG - 2023-02-01 17:05:02,912 - admin - _request - lineno:328]: Response OK, JSON: {'config_version': 16}
[DEBUG - 2023-02-01 17:05:02,912 - nodes_decommissioning_test - _set_recovery_rate - lineno:118]: setting recovery rate to 1 result: {'config_version': 16}
...
[DEBUG - 2023-02-01 17:05:03,095 - admin - _request - lineno:305]: Dispatching PUT http://docker-rp-5:9644/v1/cluster_config
[DEBUG - 2023-02-01 17:05:03,145 - admin - _request - lineno:328]: Response OK, JSON: {'config_version': 8}
[DEBUG - 2023-02-01 17:05:03,145 - nodes_decommissioning_test - _set_recovery_rate - lineno:118]: setting recovery rate to 104857600 result: {'config_version': 8}

It's clear from the RP node logs that config version 16 with raft_learner_recovery_rate = 1 is the one that is eventually replicated to every node. I.e, the last log message I see in every node from the config_frontend and config_manager is similar to below;

TRACE 2023-02-01 17:05:04,894 [shard 0] cluster - config_manager.cc:601 - apply: upsert raft_learner_recovery_rate=1
TRACE 2023-02-01 17:05:04,896 [shard 1] cluster - config_manager.cc:601 - apply: upsert raft_learner_recovery_rate=1
INFO  2023-02-01 17:05:04,896 [shard 1] raft - recovery_throttle.h:64 - Updating recovery throttle with new rate of 0
TRACE 2023-02-01 17:05:04,904 [shard 0] cluster - config_frontend.cc:140 - set_next_version: 17
@mmaslankaprv
Copy link
Member

Marking as sev/low as this is test setup issue

@mmaslankaprv mmaslankaprv added the sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages label Feb 7, 2023
@ZeDRoman ZeDRoman self-assigned this Feb 8, 2023
@VadimPlh
Copy link
Contributor

VadimPlh commented Feb 9, 2023

@ZeDRoman
Copy link
Contributor

@dlex
Copy link
Contributor

dlex commented Mar 9, 2023

@michael-redpanda
Copy link
Contributor

@michael-redpanda
Copy link
Contributor

@ZeDRoman ZeDRoman removed their assignment Jun 16, 2023
@piyushredpanda
Copy link
Contributor

Not seen in 30 days, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication ci-failure kind/bug Something isn't working sev/low Bugs which are non-functional paper cuts, e.g. typos, issues in log messages
Projects
None yet
Development

No branches or pull requests

7 participants