-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in ScalingUpTest.test_adding_nodes_to_cluster #4371
Comments
This is one of those cases where we could just bump the timeout, but I want to know why moving just a few partitions is taking so long. |
Last failure was on 2022-04-18 |
same?
also: https://buildkite.com/redpanda/redpanda/builds/26246#01873b7d-a3cc-45f6-9d28-9470f254ed84
|
https://buildkite.com/redpanda/redpanda/builds/26446#01874dca-6355-4a63-8e98-e1798ed55995
|
New failures is caused by pr #9622 size_t value gone below zero
|
@vshtokman: Looks like a RP bug we should fix... |
Yes it is RP bug |
|
Possible partition movement bug?
This is on a PR run, but certainly not related to the change in the PR.
https://buildkite.com/redpanda/redpanda/builds/9258#98222f10-573d-4414-b635-c62451a3428b
The test is timing out waiting for partitions to move to the third node added in the test.
In the controller logs I can see that the rebalance on node add is happening, but apparently partition movement is taking longer than
the test allows. This is surprising, because the test is giving 30 seconds for a few single-replica partitions to move.
The last check before timeout is:
The controller leader doesn't get around to kicking off any reallocations for the node_id=3 add until just a few seconds before the timeout expires
Up til that time, it is still outputting "calculated reallocations" lines relating to the node 2 addition. So either one of the partition moves to node 2 is going way too slow, or something is wrong in the controller housekeeping that's preventing it from realizing that the moves are complete in time.
The test also has an issue whereby the wait for node 2's moves to complete is completing too early, because it just checks that at least one partition has moved to node 2, before proceeding to start node 3. So node 3's 30 second wait period really has to account for all the node 2 movement and the node 3 movement.
The text was updated successfully, but these errors were encountered: