CI Failure (partitions_rebalanced times out) in `ScalingUpTest`.`test_adding_nodes_to_cluster` #7418

dlex · 2022-11-22T03:08:19Z

https://buildkite.com/redpanda/redpanda/builds/18876#01849c1f-24b8-447e-956d-2b6f080f625b

Module: rptest.tests.scaling_up_test
Class:  ScalingUpTest
Method: test_adding_nodes_to_cluster
Arguments:
{
  "partition_count": 1
}

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1
status:     FAIL
run time:   1 minute 2.276 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 119, in test_adding_nodes_to_cluster
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 70, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

The test is trying to balance 20 partition replicas, 16 of them of __consumer_offsets and 4 of a regular topic. The success criteria is number_of_replicas_per_node ∊ number_of_replicas / number_of_nodes ± 20% (here]. This translates to expected range: [5.333333333333334,8.0] (which is actually an exclusive range despite of square brackets logged).

The distribution of replicas per node the cluster settles with is

replicas per node: {1: 6, 2: 6, 3: 8}

This is totally normal since #5460, because there are two different topics in two disctinct domains being balanced, domain 0 settles with the distribution of [1,1,2] and domain -1 also allocates the remainder of replicas on node 3.

The test needs to be adjusted either to test balancing of partitions that belong to the same domain, or allow for corner cases like this.

The text was updated successfully, but these errors were encountered:

jcsp · 2022-11-22T10:01:16Z

https://buildkite.com/redpanda/redpanda/builds/18880#01849bff-7a20-4482-96f3-57f561592be8

dlex · 2022-11-23T01:43:45Z

@mmaslankaprv I think f0f683b is unrelated to this issue, this one is not about timeouts: the cluster comes to a stable replicas balance in ~5s out of 30s timeout and replica distribution never improves after that.

dlex · 2022-11-23T01:45:06Z

another one: https://buildkite.com/redpanda/redpanda/builds/18948#0184a180-4e1e-4831-b5c7-57a05b7cacc8

jcsp · 2022-11-23T19:09:40Z

Two more today:

FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (2/47 runs)
failure at 2022-11-22T12:51:49.347Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18914#01849ed1-f732-4647-9c49-d36c17a01e7a
failure at 2022-11-22T15:29:01.731Z: TimeoutError('')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18919#01849f4c-57cb-4516-b76b-37128ac44ad9

jcsp · 2022-12-01T12:50:14Z

This is also still failing on 22.3.x
https://buildkite.com/redpanda/redpanda/builds/19280#0184c8d9-35ac-4fd2-ad7a-d944b237ee90

Scaling up tests was recently parametrized with partition count. For the large partition count it is not enough to wait 30 seconds for the partitions to be rebalanced, especially for the slow debug builds. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com>

rystsov · 2022-12-11T09:38:32Z

another instance - https://buildkite.com/redpanda/redpanda/builds/19611#0184ff25-8370-463d-bac1-4ce05d050af4

Changed the condition validating if partition replicas are rebalanced to include the range boundaries. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com>

Changed the condition validating if partition replicas are rebalanced to include the range boundaries. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit ab6d23d)

The assessment whether partitions are balanced were done by comparing the number of partitions in each node with the average target ±20%. This way went broken with the introduction of partition balancing domains. The criteria is changed by this commit to make sure that the number of partitions across noded is levelled within a scope of each domain separately. Levelled means that min and max # of replicas differ by 1 at most. Re redpanda-data#7418

NyaliaLui · 2023-03-17T22:49:06Z

This came back on nightly retest of dev
https://ci-artifacts.dev.vectorized.cloud/redpanda/25291/0186ee17-bd4e-44c1-ba92-1cb8685ee2db/vbuild/ducktape/results/2023-03-17--001/ScalingUpTest/test_adding_nodes_to_cluster/partition_count=1/64/

Scaling up tests was recently parametrized with partition count. For the large partition count it is not enough to wait 30 seconds for the partitions to be rebalanced, especially for the slow debug builds. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit f0f683b)

VladLazar · 2023-03-31T14:18:38Z

https://buildkite.com/redpanda/redpanda/builds/26162#018737a2-3d1e-44b9-a979-7f117b1b243f
https://buildkite.com/redpanda/redpanda/builds/26162#018737b3-d248-4be3-aac0-859f5bb01adf

michael-redpanda · 2023-04-04T15:04:38Z

https://buildkite.com/redpanda/redpanda/builds/26345#018747f8-1352-453a-831b-1093a3f9b029
https://buildkite.com/redpanda/redpanda/builds/26384#01874978-cf93-47c9-81de-c5d264ec2b22

Lazin · 2023-04-06T07:28:50Z

https://buildkite.com/redpanda/redpanda/builds/26534#018753e0-35c1-4580-88a8-828b5bbbe6ee/6-1855

Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.

Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)

dlex · 2023-04-13T00:27:34Z

There are corner cases that still give almost the same failure, handled by #10024

Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)

Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.

michael-redpanda · 2023-05-25T15:17:35Z

https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279

dlex · 2023-05-25T15:41:56Z

https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279

This is the case of [1,1,2] distribution

replicas per domain per node: {-1: {1: 5, 2: 5, 3: 6}, 0: {1: 1, 2: 1, 3: 2}

moving over to #10024

rystsov · 2023-09-03T12:37:58Z

Please do not reopen this issue; if you feel like want to create a new issue and link this one as a reference

dlex added kind/bug Something isn't working area/tests ci-failure labels Nov 22, 2022

dlex self-assigned this Nov 22, 2022

dlex mentioned this issue Nov 22, 2022

Fix a use-after-move in partition_balancer #7406

Merged

6 tasks

mmaslankaprv closed this as completed in f0f683b Nov 22, 2022

ztlpn mentioned this issue Nov 22, 2022

Metrics for rack awareness repair #7432

Merged

6 tasks

piyushredpanda mentioned this issue Nov 23, 2022

[v22.3.x] archival: allow scheduler service to stop early in unit tests #7447

Merged

dlex reopened this Nov 23, 2022

jcsp mentioned this issue Dec 1, 2022

[22.3.x] Backport #7493 (cloud_storage: fix topic recovery size/time limits to use local.target settings) #7578

Merged

6 tasks

rystsov mentioned this issue Dec 11, 2022

[v22.3.x] Fixes Unknown Server Errors (USE) in transactions #7695

Merged

mmaslankaprv mentioned this issue Dec 12, 2022

tests: adjusted check of rebalanced partitions to include range ends #7698

Merged

6 tasks

jcsp mentioned this issue Dec 12, 2022

[v22.3.x] kafka: report default-matching SI topic properties as DEFAULT_CONFIG #7660

Merged

mmaslankaprv closed this as completed in #7698 Dec 13, 2022

vbotbuildovich mentioned this issue Dec 13, 2022

[v22.3.x] TimeoutError in ScalingUpTest.test_adding_nodes_to_cluster #7726

Closed

dlex mentioned this issue Dec 14, 2022

Change the balanced partition criteria in scaling up DTs #7768

Closed

6 tasks

NyaliaLui reopened this Mar 17, 2023

VladLazar mentioned this issue Mar 31, 2023

cloud_storage: provide in-memory manifest dumps via the Admin API #9745

Merged

7 tasks

dlex mentioned this issue Apr 1, 2023

[v23.1.x] Cluster-wide rate limits: avoid log pollution by quota balancer #9771

Merged

piyushredpanda mentioned this issue Apr 10, 2023

[v22.3.x] Fix 9850: return committed_leader_epoch on offset_fetch #9862

Merged

piyushredpanda mentioned this issue Apr 11, 2023

[v23.1.x] Topic-aware leadership balancing #9901

Merged

dlex mentioned this issue Apr 11, 2023

Support for allocation domains in scaling_up_tests #9947

Merged

7 tasks

mmaslankaprv closed this as completed in #9947 Apr 11, 2023

vbotbuildovich mentioned this issue Apr 11, 2023

[v23.1.x] CI Failure (partitions_rebalanced times out) in ScalingUpTest.test_adding_nodes_to_cluster #9968

Closed

vbotbuildovich mentioned this issue Apr 12, 2023

[v22.3.x] CI Failure (partitions_rebalanced times out) in ScalingUpTest.test_adding_nodes_to_cluster #10011

Closed

This was referenced Apr 13, 2023

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in ScalingUpTest.test_on_demand_rebalancing #10024

Closed

[v23.1.x] Support for allocation domains in scaling_up_tests #9969

Merged

dlex mentioned this issue Apr 13, 2023

[v22.3.x] Support for allocation domains in scaling_up_tests #10012

Merged

michael-redpanda mentioned this issue May 25, 2023

Address clang16 use after move #11018

Merged

7 tasks

rystsov reopened this Sep 3, 2023

rystsov closed this as completed Sep 3, 2023

rystsov added do-not-reopen ci-ignore Automatic ci analysis tools ignore this issue labels Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (partitions_rebalanced times out) in `ScalingUpTest`.`test_adding_nodes_to_cluster` #7418

CI Failure (partitions_rebalanced times out) in `ScalingUpTest`.`test_adding_nodes_to_cluster` #7418

dlex commented Nov 22, 2022 •

edited by rystsov

Loading

jcsp commented Nov 22, 2022

dlex commented Nov 23, 2022

dlex commented Nov 23, 2022

jcsp commented Nov 23, 2022

jcsp commented Dec 1, 2022

rystsov commented Dec 11, 2022

NyaliaLui commented Mar 17, 2023

VladLazar commented Mar 31, 2023

michael-redpanda commented Apr 4, 2023

Lazin commented Apr 6, 2023

dlex commented Apr 13, 2023

michael-redpanda commented May 25, 2023

dlex commented May 25, 2023

rystsov commented Sep 3, 2023

CI Failure (partitions_rebalanced times out) in ScalingUpTest.test_adding_nodes_to_cluster #7418

CI Failure (partitions_rebalanced times out) in ScalingUpTest.test_adding_nodes_to_cluster #7418

Comments

dlex commented Nov 22, 2022 • edited by rystsov Loading

jcsp commented Nov 22, 2022

dlex commented Nov 23, 2022

dlex commented Nov 23, 2022

jcsp commented Nov 23, 2022

jcsp commented Dec 1, 2022

rystsov commented Dec 11, 2022

NyaliaLui commented Mar 17, 2023

VladLazar commented Mar 31, 2023

michael-redpanda commented Apr 4, 2023

Lazin commented Apr 6, 2023

dlex commented Apr 13, 2023

michael-redpanda commented May 25, 2023

dlex commented May 25, 2023

rystsov commented Sep 3, 2023

CI Failure (partitions_rebalanced times out) in `ScalingUpTest`.`test_adding_nodes_to_cluster` #7418

CI Failure (partitions_rebalanced times out) in `ScalingUpTest`.`test_adding_nodes_to_cluster` #7418

dlex commented Nov 22, 2022 •

edited by rystsov

Loading