Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

Closed
jcsp opened this issue Apr 21, 2022 · 8 comments · Fixed by #9870
Closed

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

jcsp opened this issue Apr 21, 2022 · 8 comments · Fixed by #9870
Assignees
Labels
area/raft ci-failure kind/bug Something isn't working

Comments

@jcsp
Copy link
Contributor

jcsp commented Apr 21, 2022

Possible partition movement bug?

This is on a PR run, but certainly not related to the change in the PR.
https://buildkite.com/redpanda/redpanda/builds/9258#98222f10-573d-4414-b635-c62451a3428b

The test is timing out waiting for partitions to move to the third node added in the test.

In the controller logs I can see that the rebalance on node add is happening, but apparently partition movement is taking longer than
the test allows. This is surprising, because the test is giving 30 seconds for a few single-replica partitions to move.

The last check before timeout is:

[INFO  - 2022-04-21 11:58:22,225 - scaling_up_test - partitions_rebalanced - lineno:69]: replicas per node: {2: 6, 1: 5}

The controller leader doesn't get around to kicking off any reallocations for the node_id=3 add until just a few seconds before the timeout expires

INFO  2022-04-21 11:58:26,869 [shard 0] cluster - members_backend.cc:380 - [update: {node_id: 3, type: added}] calculated reallocations: {{ntp: {kafka/topic2/0}, ...

Up til that time, it is still outputting "calculated reallocations" lines relating to the node 2 addition. So either one of the partition moves to node 2 is going way too slow, or something is wrong in the controller housekeeping that's preventing it from realizing that the moves are complete in time.

The test also has an issue whereby the wait for node 2's moves to complete is completing too early, because it just checks that at least one partition has moved to node 2, before proceeding to start node 3. So node 3's 30 second wait period really has to account for all the node 2 movement and the node 3 movement.

@jcsp jcsp added kind/bug Something isn't working area/raft ci-failure labels Apr 21, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Apr 21, 2022

This is one of those cases where we could just bump the timeout, but I want to know why moving just a few partitions is taking so long.

@ZeDRoman
Copy link
Contributor

Last failure was on 2022-04-18
We think that it happens because of scale tests near this one.
If this test fails again, reopen issue.

@andijcr
Copy link
Contributor

andijcr commented Apr 3, 2023

same?
https://buildkite.com/redpanda/redpanda/builds/26233#018739b0-75b2-4177-bb79-a296b2315e70

259 FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (1/51 runs)
260   failure at 2023-03-31T23:03:13.151Z: TimeoutError('')
261       on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/26233#018739b0-75b2-4177-bb79-a296b2315e70

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1
status:     FAIL
run time:   2 minutes 40.901 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 133, in test_adding_nodes_to_cluster
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 84, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

also: https://buildkite.com/redpanda/redpanda/builds/26246#01873b7d-a3cc-45f6-9d28-9470f254ed84

259 FAIL test: ScalingUpTest.test_on_demand_rebalancing.partition_count=1 (1/51 runs)  
260   failure at 2023-04-01T07:56:09.740Z: TimeoutError('')
261       on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/26246#01873b7d-a3cc-45f6-9d28-9470f254ed84


test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_on_demand_rebalancing.partition_count=1
status:     FAIL
run time:   3 minutes 54.952 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 211, in test_on_demand_rebalancing
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 84, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

@andijcr
Copy link
Contributor

andijcr commented Apr 5, 2023

https://buildkite.com/redpanda/redpanda/builds/26446#01874dca-6355-4a63-8e98-e1798ed55995

  1 FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (1/15 runs)
  2   failure at 2023-04-04T21:00:23.366Z: TimeoutError('')
  3       on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/26446#01874dca-6355-4a63-8e98-e1798ed55995

@ZeDRoman
Copy link
Contributor

ZeDRoman commented Apr 5, 2023

New failures is caused by pr #9622

size_t value gone below zero

TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 3 has 1 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: 6.148914691236517e+18
TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 1 has 18446744073709551615 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: -1.2297829382473036e+19
TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 2 has 18446744073709551614 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: -1.2297829382473036e+19
INFO  2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:488 - [update: {{node_id: 3, type: added, offset: 42, update_raft0: true, decom_upd_revision: {nullopt}}}] unevenness error: 5.000000000000001, previous error: 1, improvement: -4.000000000000001

@piyushredpanda
Copy link
Contributor

@vshtokman: Looks like a RP bug we should fix...

@ZeDRoman
Copy link
Contributor

ZeDRoman commented Apr 6, 2023

@vshtokman: Looks like a RP bug we should fix...

Yes it is RP bug
I am working on fix

@dlex
Copy link
Contributor

dlex commented May 18, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants