-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (partitions_rebalanced times out) in ScalingUpTest
.test_adding_nodes_to_cluster
#7418
Comments
@mmaslankaprv I think f0f683b is unrelated to this issue, this one is not about timeouts: the cluster comes to a stable replicas balance in ~5s out of 30s timeout and replica distribution never improves after that. |
Two more today: FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (2/47 runs) |
This is also still failing on 22.3.x |
Scaling up tests was recently parametrized with partition count. For the large partition count it is not enough to wait 30 seconds for the partitions to be rebalanced, especially for the slow debug builds. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com>
Changed the condition validating if partition replicas are rebalanced to include the range boundaries. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com>
Changed the condition validating if partition replicas are rebalanced to include the range boundaries. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit ab6d23d)
The assessment whether partitions are balanced were done by comparing the number of partitions in each node with the average target ±20%. This way went broken with the introduction of partition balancing domains. The criteria is changed by this commit to make sure that the number of partitions across noded is levelled within a scope of each domain separately. Levelled means that min and max # of replicas differ by 1 at most. Re redpanda-data#7418
The assessment whether partitions are balanced were done by comparing the number of partitions in each node with the average target ±20%. This way went broken with the introduction of partition balancing domains. The criteria is changed by this commit to make sure that the number of partitions across noded is levelled within a scope of each domain separately. Levelled means that min and max # of replicas differ by 1 at most. Re redpanda-data#7418
Scaling up tests was recently parametrized with partition count. For the large partition count it is not enough to wait 30 seconds for the partitions to be rebalanced, especially for the slow debug builds. Fixes: redpanda-data#7418 Signed-off-by: Michal Maslanka <michal@redpanda.com> (cherry picked from commit f0f683b)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
There are corner cases that still give almost the same failure, handled by #10024 |
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.
This is the case of
moving over to #10024 |
Please do not reopen this issue; if you feel like want to create a new issue and link this one as a reference |
https://buildkite.com/redpanda/redpanda/builds/18876#01849c1f-24b8-447e-956d-2b6f080f625b
The test is trying to balance 20 partition replicas, 16 of them of
__consumer_offsets
and 4 of a regular topic. The success criteria isnumber_of_replicas_per_node ∊ number_of_replicas / number_of_nodes ± 20%
(here]. This translates toexpected range: [5.333333333333334,8.0]
(which is actually an exclusive range despite of square brackets logged).The distribution of replicas per node the cluster settles with is
This is totally normal since #5460, because there are two different topics in two disctinct domains being balanced, domain
0
settles with the distribution of[1,1,2]
and domain-1
also allocates the remainder of replicas on node3
.The test needs to be adjusted either to test balancing of partitions that belong to the same domain, or allow for corner cases like this.
The text was updated successfully, but these errors were encountered: