Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

jcsp · 2022-04-21T14:28:00Z

Possible partition movement bug?

This is on a PR run, but certainly not related to the change in the PR.
https://buildkite.com/redpanda/redpanda/builds/9258#98222f10-573d-4414-b635-c62451a3428b

The test is timing out waiting for partitions to move to the third node added in the test.

In the controller logs I can see that the rebalance on node add is happening, but apparently partition movement is taking longer than
the test allows. This is surprising, because the test is giving 30 seconds for a few single-replica partitions to move.

The last check before timeout is:

[INFO  - 2022-04-21 11:58:22,225 - scaling_up_test - partitions_rebalanced - lineno:69]: replicas per node: {2: 6, 1: 5}

The controller leader doesn't get around to kicking off any reallocations for the node_id=3 add until just a few seconds before the timeout expires

INFO  2022-04-21 11:58:26,869 [shard 0] cluster - members_backend.cc:380 - [update: {node_id: 3, type: added}] calculated reallocations: {{ntp: {kafka/topic2/0}, ...

Up til that time, it is still outputting "calculated reallocations" lines relating to the node 2 addition. So either one of the partition moves to node 2 is going way too slow, or something is wrong in the controller housekeeping that's preventing it from realizing that the moves are complete in time.

The test also has an issue whereby the wait for node 2's moves to complete is completing too early, because it just checks that at least one partition has moved to node 2, before proceeding to start node 3. So node 3's 30 second wait period really has to account for all the node 2 movement and the node 3 movement.

The text was updated successfully, but these errors were encountered:

jcsp · 2022-04-21T14:28:54Z

This is one of those cases where we could just bump the timeout, but I want to know why moving just a few partitions is taking so long.

ZeDRoman · 2022-05-18T12:24:01Z

Last failure was on 2022-04-18
We think that it happens because of scale tests near this one.
If this test fails again, reopen issue.

andijcr · 2023-04-03T11:34:34Z

same?
https://buildkite.com/redpanda/redpanda/builds/26233#018739b0-75b2-4177-bb79-a296b2315e70

259 FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (1/51 runs)
260   failure at 2023-03-31T23:03:13.151Z: TimeoutError('')
261       on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/26233#018739b0-75b2-4177-bb79-a296b2315e70

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1
status:     FAIL
run time:   2 minutes 40.901 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 133, in test_adding_nodes_to_cluster
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 84, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

also: https://buildkite.com/redpanda/redpanda/builds/26246#01873b7d-a3cc-45f6-9d28-9470f254ed84

259 FAIL test: ScalingUpTest.test_on_demand_rebalancing.partition_count=1 (1/51 runs)  
260   failure at 2023-04-01T07:56:09.740Z: TimeoutError('')
261       on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/26246#01873b7d-a3cc-45f6-9d28-9470f254ed84


test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_on_demand_rebalancing.partition_count=1
status:     FAIL
run time:   3 minutes 54.952 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 211, in test_on_demand_rebalancing
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 84, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

andijcr · 2023-04-05T08:56:57Z

https://buildkite.com/redpanda/redpanda/builds/26446#01874dca-6355-4a63-8e98-e1798ed55995

  1 FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (1/15 runs)
  2   failure at 2023-04-04T21:00:23.366Z: TimeoutError('')
  3       on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/26446#01874dca-6355-4a63-8e98-e1798ed55995

ZeDRoman · 2023-04-05T15:25:17Z

New failures is caused by pr #9622

size_t value gone below zero

TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 3 has 1 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: 6.148914691236517e+18
TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 1 has 18446744073709551615 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: -1.2297829382473036e+19
TRACE 2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:220 - node 2 has 18446744073709551614 replicas allocated in domain 0, requested replicas per node 6148914691236517204, difference: -1.2297829382473036e+19
INFO  2023-04-04 20:34:59,503 [shard 0] cluster - members_backend.cc:488 - [update: {{node_id: 3, type: added, offset: 42, update_raft0: true, decom_upd_revision: {nullopt}}}] unevenness error: 5.000000000000001, previous error: 1, improvement: -4.000000000000001

piyushredpanda · 2023-04-05T22:00:53Z

@vshtokman: Looks like a RP bug we should fix...

ZeDRoman · 2023-04-06T07:36:57Z

@vshtokman: Looks like a RP bug we should fix...

Yes it is RP bug
I am working on fix

dlex · 2023-05-18T00:41:26Z

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29290#01882860-88a9-4976-aa44-9e13c4f74f3e
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29191#01882210-d07f-45af-a3dc-3f91becbe8d9

jcsp added kind/bug Something isn't working area/raft ci-failure labels Apr 21, 2022

piyushredpanda assigned ZeDRoman May 4, 2022

ZeDRoman closed this as completed May 18, 2022

andijcr reopened this Apr 3, 2023

jcsp mentioned this issue Apr 3, 2023

storage: expose cloud storage status for partitions on the admin API #9237

Merged

7 tasks

r-vasquez mentioned this issue Apr 3, 2023

rpk: fix command name in cluster storage recovery #9795

Merged

1 task

ztlpn mentioned this issue Apr 3, 2023

controller snapshots: bootstrap + members + features +config #9601

Merged

7 tasks

This was referenced Apr 6, 2023

ducktape: ScalingUpTest.test_adding_nodes_to_cluster ok_to_fail #9868

Closed

c/members_backend: check domain while decr removed replicas #9870

Merged

ZeDRoman closed this as completed in #9870 Apr 6, 2023

StephanDollberg mentioned this issue May 10, 2023

storage: use async serialization for large kvstore snapshot batches #5129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

jcsp commented Apr 21, 2022

jcsp commented Apr 21, 2022

ZeDRoman commented May 18, 2022

andijcr commented Apr 3, 2023 •

edited

Loading

andijcr commented Apr 5, 2023

ZeDRoman commented Apr 5, 2023 •

edited

Loading

piyushredpanda commented Apr 5, 2023

ZeDRoman commented Apr 6, 2023

dlex commented May 18, 2023

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371

Comments

jcsp commented Apr 21, 2022

jcsp commented Apr 21, 2022

ZeDRoman commented May 18, 2022

andijcr commented Apr 3, 2023 • edited Loading

andijcr commented Apr 5, 2023

ZeDRoman commented Apr 5, 2023 • edited Loading

piyushredpanda commented Apr 5, 2023

ZeDRoman commented Apr 6, 2023

dlex commented May 18, 2023

andijcr commented Apr 3, 2023 •

edited

Loading

ZeDRoman commented Apr 5, 2023 •

edited

Loading