-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition balancer: rack awareness constraint repair #6845
Conversation
00cc743
to
3b20cf1
Compare
This is a class that stores state that is needed for functioning of the partition balancer. This commit also wires it up to topic_updates_dispatcher and adds code maintaining a set of ntps that have rack awareness constraint violated.
Replication factor now is anyway calculated from the number of replicas of partition 0 so we don't need the metadata object if we have the set of replicas.
3b20cf1
to
f8e3e41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Have some minor questions
/// the ntp is replicated to, we try to schedule a move. For each rack we | ||
/// arbitrarily choose the first appearing replica to remain there (note: this | ||
/// is probably not optimal choice). | ||
void partition_balancer_planner::get_rack_constraint_repair_reassignments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add rack_constraint
violations into partition_balancer_violations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but decided against it - a full list of partitions in the violations doesn't make sense (could be thousands), but for the number of violations it makes more sense to have it as a metric - easier to observe and alert on.
ns.make_unavailable(node) | ||
self.wait_until_ready(expected_unavailable_node=node) | ||
|
||
self.redpanda.start_node(self.redpanda.nodes[4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just fixing failed node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is harder for the balancer :) (the movements introduced by node-add are interfering a bit).
/// is probably not optimal choice). | ||
void partition_balancer_planner::get_rack_constraint_repair_reassignments( | ||
plan_data& result, reallocation_request_state& rrs) { | ||
if (_state.ntps_with_broken_rack_constraint().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this condition is already check by the caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true - we can end here e.g. if there are some unavailable nodes but no violating ntps. Although it might make sense to make it so! Would be easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reshuffled the main function a bit, not sure if this is much cleaner, but the idea is that a planner pass should decide itself if it needs to run or not, but we also want to avoid loading ntp sizes in the happy case (this is why we need an early exit if there are no violations).
@@ -118,6 +118,7 @@ v_cc_library( | |||
remote_topic_configuration_source.cc | |||
partition_balancer_planner.cc | |||
partition_balancer_backend.cc | |||
partition_balancer_state.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a nice clean up--consolidating state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was the idea. Although there is not much consolidation right now, we can use this class to store some balancing-specific indexes (e.g. node -> ntp map). Will be helpful when we eventually will need to get rid of those "iterate over all ntps" loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get rid of those "iterate over all ntps" loops.
😍
1) don't log happy path of submitting reassignments/cancellations 2) move planned reassignments/cancellation logging to planner and add more context: previous replicas and reason.
67f0d60
to
04e041e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Unrelated test failure: #6991 restarted |
Cover letter
Add rack awareness repair to the partition balancer.
Add the
partition_balancer_state
class that captures the controller state needed for the balancer. This class maintains a set of ntps that have rack awareness constraint violated (i.e. more than one replica in a rack). In the balancer we go over this set and (if there are suitable racks) try to schedule repairing moves.Fixes #6355
TODO: add a "number of ntps with violated constraint" metric
Backport Required
UX changes
Rack awareness constraint repair is added to partition balancing in the
continuous
mode. For a given partition balancer will try to move excess replicas from racks that have more than one replica to racks where there are none.Release notes
Features
continuous
partition balancing mode.