Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance group coordinator nodes more evenly #5460

Closed
travisdowns opened this issue Jul 13, 2022 · 12 comments · Fixed by #6251
Closed

Balance group coordinator nodes more evenly #5460

travisdowns opened this issue Jul 13, 2022 · 12 comments · Fixed by #6251
Assignees
Labels
area/controller DW good first issue Good for newcomers kind/enhance New feature or request

Comments

@travisdowns
Copy link
Member

travisdowns commented Jul 13, 2022

Who is this for and what problem do they have today?

This is for users who make use of more than one consumer group. A problem that arises is that even with many nodes and few groups, the group coordinators may be chosen unevenly, e.g., with 8 groups and 32 nodes, it is possible (and in practice common enough) that one node is the coordinator for 2 or 3 groups. Ideally, no no node would coordinate for more than 1 group.

What are the success criteria?

In "unusual operating conditions" group coordinators should be "evenly spread" across nodes, in the same way that partitions and leaders are evenly spread.

Why is solving this problem impactful?

The coordinator shard requires additional resources to complete its coordination tasks: it needs to receive offset commits and heartbeats from all consumers in the group. A system may be sized such that the load on any coordinator is still low enough to avoid saturation as long as a shard is only a coordinator for one group (or, more generally, it's proportional "share" of coordination jobs + 1), but if it ends up coordinator more groups, responsiveness will suffer. If the balancer can ensure the groups are evenly spread, this should not occur.

Additional notes

At a high level, the coordinator for a given group is the shard which is the leader for the partition associated with the group in the __consumer_offsets topic.

So there are two steps to the distribution: the group name -> partition mapping (to determine the "partition associated with the group in the __consumer_offsets topic" part) and the partition -> node/shard mapping part.

The first is a pure function of the group name: it takes a group name and the number of partitions in the topic as input and spits out a partition number based on a consistent hash function (see coordinator_ntp_mapper::ntp_for). If this distribution has collisions, there will be coordinator collisions, period: we can't fix this in the second part, so we rely on the distribution of the hash function and a bit of luck. The good news is that since this is a deterministic function of the outputs the end-user can choose their groups names strategically to avoid collisions in this part.

The second step to determine the coordinator is then finding which shard is the leader for the partition determined in the first step. This uses the "usual" partition allocation & rebalancing (if any) just like for any other topic. So the allocation depends on other factors such as which nodes are part of the cluster when the group is created, and decisions made by the leader and partition balancers.

The second step is what we want to change here: the main problem is that although __consumer_offsets leadership (if the leader balancer is enabled, which is it by default) and partition allocation and rebalancing (if the rebalancer is enabled, which it is not by default) do consider the partitions in __consumer_offsets they are lumped together with partitions from all the user-created topics too. If there is an initial poor allocation of partitions then balance may be achieved by moving user-created partitions instead of any partitions from __consumer_offsets.

Similarly, if some nodes have fewer partitions than others (e.g., because some nodes joined more recently) when the __consumer_offsets group is originally created, the partition allocation may be highly imbalanced since most or all partitions may be assigned to the underrepresented nodes.

Similar reasoning applies to leadership balancing (note that leadership balancing is on by default, but partition balancing is not) and perhaps initial leadership allocation (though this is less clear as I am not clear if initial leadership allocation is a directed process).

To solve these problems, one possibility is to consider this topic a separate "domain" for balancing: we may balance in the same way, but only among the partitions in the __consumer_offsets table, and similarly for leadership balancing decisions. Other possibilities exist.

@jcsp
Copy link
Contributor

jcsp commented Aug 11, 2022

If this distribution has collisions, there will be coordinator collisions, period: we can't fix this in the second part, so we rely on the distribution of the hash function and a bit of luck.

It may be worth us doing something more dramatic and doing explicit placement (i.e. store a map of consumer group name to partition index) for these. Do we know if the client relies on the server doing hashed placement at all, or is it something we can own+change on the server side?

The principle here would be that consistent hashing is good for placing lots of small things that will average out, but not good for placing smaller numbers of larger things (like consumer groups with lots of members).

one possibility is to consider this topic a separate "domain" for balancing: we may balance in the same way, but only among the partitions in the __consumer_offsets table, and similarly for leadership balancing decisions.

I think we should do something along these lines.

An extreme would be to also bias consumer offset partitions to shard 0, and dial down the amount of user work we put on shard 0 (topic_partitions_reserve_shard0 is the new hammer in 22.2 for pushing user workloads off shard 0 to make way for any more expensive housekeeping).

@dlex
Copy link
Contributor

dlex commented Aug 11, 2022

An extreme would be to also bias consumer offset partitions to shard 0, and dial down the amount of user work we put on shard 0

I'm not so sure about positive impact of this, unless we push away more user work from shard 0 than what we use for consumer offsets, thus making shard 0 less loaded than the other ones.

However there is a downside in this approach: consumer offset processing will only scale across nodes but not across shards. So if a clusted consists of 40-core nodes, the overall consumer groups perfomance will be limitted to 1/40th of the overall perfomance capacity of the cluster.

@dlex
Copy link
Contributor

dlex commented Aug 11, 2022

It may be worth us doing something more dramatic and doing explicit placement (i.e. store a map of consumer group name to partition index) for these. Do we know if the client relies on the server doing hashed placement at all, or is it something we can own+change on the server side?

AFAIK, clients can't do that with Kafka API and have to rely on server. And yes, I agree to your idea of mapping groups to __consumer_offsets partitions explicitly, that would also solve the issue of changing the number of __consumer_offsets partitions without losing the offsets.

@jcsp
Copy link
Contributor

jcsp commented Aug 11, 2022

I'm not so sure about positive impact of this, unless we push away more user work from shard 0 than what we use for consumer offsets, thus making shard 0 less loaded than the other ones.
However there is a downside in this approach: consumer offset processing will only scale across nodes but not across shards. So if a clusted consists of 40-core nodes, the overall consumer groups perfomance will be limitted to 1/40th of the overall perfomance capacity of the cluster.

Yep, it's definitely not the general case solution. It would come up on systems where there are very many user data partitions, and they are principally writing a lot of data through those, and only a small fraction of their writes are to update consumer groups, and the number of consumer groups is small (e.g. of the order number of nodes). This might line up quite well with some of the "100K partition" use cases where they run a few mega-consumer-groups.

The other extreme is where users are doing about the same number of consumer group writes as they do data writes (produce a message, consume it, sync the offset), and may have a large number of independent consumer groups, in which case our current model of treating consumer group partitions just like data partitions makes sense.

@dlex
Copy link
Contributor

dlex commented Aug 12, 2022

@jcsp I see what you mean, and there are definitely real world scenarios that may benefit from that approach. However it seems to me like there are too many strings attached to it to follow it right now. For example, should __consumer_offsets partition become a special kind of partitons in the sense that allocation_node::allocate() won't be called for them anymore? If yes, then how to balace between general hosekeeping reserve on shard0 and __consumer_groups workload? If not, then how to deal with the situation when topic_reserve_shard0 value approaches the topic_partitions_per_shard value? So I suppose I will leave it in a form of a possible idea for future tuning.

@dlex
Copy link
Contributor

dlex commented Aug 12, 2022

Is the even distribution only makes sense across the nodes, or should we consider shards within nodes as well?

John has mentioned an idea when __consumer_offsets partitions are allocated at shard0 which is at the same time deprived of user data partitions. I'm thinking about an opposite scenario: will it be beneficial if __consumer_offsets partitions are evenly allocated across shards within each broker in addition to even allocation just across brokers?

@travisdowns, @jcsp, what do you think?

@jcsp
Copy link
Contributor

jcsp commented Aug 15, 2022

will it be beneficial if __consumer_offsets partitions are evenly allocated across shards within each broker in addition to even allocation just across brokers?

I think yes: this helps us if CPU is tight. The scenario we're seeing on cloud instances is that disk tends to be the bottleneck, so shard doesn't matter, but on non-cloud hardware we will likely see faster drives and therefore more interest in getting better CPU balance.

@travisdowns
Copy link
Member Author

It may be worth us doing something more dramatic and doing explicit placement (i.e. store a map of consumer group name to partition index) for these. Do we know if the client relies on the server doing hashed placement at all, or is it something we can own+change on the server side?

It's the server only: the client queries the server for the coordinator for the group and the server replies, so for e.g., we don't need to use the same hashing function as Kafka. Changing the partition location should be OK, but changing the group name -> partition mapping (what you were asking about) is more difficult since you would need to migrate the existing offsets from the old to the new partition or something like that, or else the offsets would be lost at upgrade time. This also makes increasing the number of partitions for this partition problematic: you have to get it right when you form your cluster.

@travisdowns
Copy link
Member Author

is the new hammer in 22.2 for pushing user workloads off shard 0 to make way for any more expensive housekeeping

We definitely needed a new hammer here since just using it-is-equivalent-to-2-partitions didn't make much sense given that partition count can vary wildly: 2 partitions rounds to zero if there are 10,000 data partitions to spread around.

@travisdowns
Copy link
Member Author

travisdowns commented Aug 17, 2022

Is the even distribution only makes sense across the nodes, or should we consider shards within nodes as well?

John has mentioned an idea when __consumer_offsets partitions are allocated at shard0 which is at the same time deprived of user data partitions. I'm thinking about an opposite scenario: will it be beneficial if __consumer_offsets partitions are evenly allocated across shards within each broker in addition to even allocation just across brokers?

Shard is just as (more?) important as node: we definitely need even balancing between shards. Problems arise principally if a given shard has an uneven allocation of consumer groups: this shard can be overloaded CPU-wise. Having an uneven balance node-wise is probably less important as long as groups go to distinct shards on that node (though this is not true in the limit since there are shared resources like network, disk).

Of course, "node-wise" balance is good enough if the number of groups < number of nodes, since in that case node-wise balance guarantees shard-wise balance too, but groups > number of nodes is going to be a common case so this isn't too useful.

@travisdowns
Copy link
Member Author

travisdowns commented Aug 17, 2022

@jcsp wrote:

The principle here would be that consistent hashing is good for placing lots of small things that will average out, but not good for placing smaller numbers of larger things (like consumer groups with lots of members).

Yes, exactly. This is true also just for "hashing" not only "consistent hashing", though consistent hashing is worse in the sense that you can't as easily resolve a collision by changing the number of slots.

A typical case is that the number of groups is less than the number of shards, so you would like that shards have exactly 0 or 1 consumer groups: having 2 is a loosely speaking a "100% over-allocation", but birthday paradox makes a collision fairly likely even with modest numbers of groups (and I think it's exacerbated by our two stage mapping process: that's two "chances" for a collision to happen and once a collision happens there is a guaranteed collision at the end, though I did not run the numbers on this).

@dlex
Copy link
Contributor

dlex commented Aug 29, 2022

Implementation update. The PR 6251 has been created to solve the partiton distribution part of the problem. The other part - even partition leader balancing - has been left out of scope yet. This PR is a precondition for the other part, and also it is able to solve the problem in a subset of cluster configurations, namely when the following condition is true:

consumer_offsets_partition_no × internal_topic_replication_factor ≤ brokers_no

This condition makes it impossible for more than one __consumer_offsets partition leader to land in the same broker even without the even leader distribution feature implemented.

dlex added a commit to dlex/redpanda that referenced this issue Apr 11, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Apr 11, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.

(cherry picked from commit b0d13af)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Apr 12, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.

(cherry picked from commit b0d13af)
dlex added a commit to dlex/redpanda that referenced this issue Apr 19, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.

(cherry picked from commit b0d13af)
dlex added a commit to dlex/redpanda that referenced this issue Apr 19, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.

(cherry picked from commit b0d13af)
ballard26 pushed a commit to ballard26/redpanda that referenced this issue May 9, 2023
Partitions in each allocation domain are balanced separately since redpanda-data#5460.
This change evaluates whether the partitions are balanced well enough
within each of the allocation domains.

How topics are assigned to allocation domains is currently hardcoded:
__consumer_offsets belong to -1, all the rest belong to 0. If that
becomes more complicated, there should be a better way to determine
allocation domain association than this.

The ±20% range rule is preserved for each domain, but is somewhat relaxed
by rounding the boundary values outwards. This is required to handle small
partition counts, e.g. 3 partitions for 2 nodes would otherwise give range
of [1.2, 1.8] which no integer value will satisfy; the rounding makes
that range [1, 2] instead.

Fixes redpanda-data#7418.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller DW good first issue Good for newcomers kind/enhance New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants