Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition balancer: rack awareness constraint repair #6845

Merged
merged 10 commits into from
Oct 28, 2022

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Oct 20, 2022

Cover letter

Add rack awareness repair to the partition balancer.

Add the partition_balancer_state class that captures the controller state needed for the balancer. This class maintains a set of ntps that have rack awareness constraint violated (i.e. more than one replica in a rack). In the balancer we go over this set and (if there are suitable racks) try to schedule repairing moves.

Fixes #6355

TODO: add a "number of ntps with violated constraint" metric

Backport Required

  • not a bug fix

UX changes

Rack awareness constraint repair is added to partition balancing in the continuous mode. For a given partition balancer will try to move excess replicas from racks that have more than one replica to racks where there are none.

Release notes

Features

  • Added rack awareness constraint repair in the continuous partition balancing mode.

This is a class that stores state that is needed for functioning
of the partition balancer. This commit also wires it up to
topic_updates_dispatcher and adds code maintaining a set of ntps that
have rack awareness constraint violated.
Replication factor now is anyway calculated from the number of replicas
of partition 0 so we don't need the metadata object if we have the set
of replicas.
Copy link
Contributor

@ZeDRoman ZeDRoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Have some minor questions

/// the ntp is replicated to, we try to schedule a move. For each rack we
/// arbitrarily choose the first appearing replica to remain there (note: this
/// is probably not optimal choice).
void partition_balancer_planner::get_rack_constraint_repair_reassignments(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add rack_constraint violations into partition_balancer_violations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but decided against it - a full list of partitions in the violations doesn't make sense (could be thousands), but for the number of violations it makes more sense to have it as a metric - easier to observe and alert on.

ns.make_unavailable(node)
self.wait_until_ready(expected_unavailable_node=node)

self.redpanda.start_node(self.redpanda.nodes[4])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just fixing failed node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is harder for the balancer :) (the movements introduced by node-add are interfering a bit).

/// is probably not optimal choice).
void partition_balancer_planner::get_rack_constraint_repair_reassignments(
plan_data& result, reallocation_request_state& rrs) {
if (_state.ntps_with_broken_rack_constraint().empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this condition is already check by the caller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true - we can end here e.g. if there are some unavailable nodes but no violating ntps. Although it might make sense to make it so! Would be easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reshuffled the main function a bit, not sure if this is much cleaner, but the idea is that a planner pass should decide itself if it needs to run or not, but we also want to avoid loading ntp sizes in the happy case (this is why we need an early exit if there are no violations).

@@ -118,6 +118,7 @@ v_cc_library(
remote_topic_configuration_source.cc
partition_balancer_planner.cc
partition_balancer_backend.cc
partition_balancer_state.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a nice clean up--consolidating state.

Copy link
Contributor Author

@ztlpn ztlpn Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was the idea. Although there is not much consolidation right now, we can use this class to store some balancing-specific indexes (e.g. node -> ntp map). Will be helpful when we eventually will need to get rid of those "iterate over all ntps" loops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of those "iterate over all ntps" loops.
😍

Copy link
Member

@mmaslankaprv mmaslankaprv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@ZeDRoman ZeDRoman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ztlpn
Copy link
Contributor Author

ztlpn commented Oct 28, 2022

Unrelated test failure: #6991

restarted

@ztlpn ztlpn merged commit b6721de into redpanda-data:dev Oct 28, 2022
@ztlpn ztlpn deleted the rack-awareness-repair branch November 27, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Continual rack-awareness rebalancing for multi-az deployments
5 participants