-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balance group coordinator nodes more evenly #5460
Comments
It may be worth us doing something more dramatic and doing explicit placement (i.e. store a map of consumer group name to partition index) for these. Do we know if the client relies on the server doing hashed placement at all, or is it something we can own+change on the server side? The principle here would be that consistent hashing is good for placing lots of small things that will average out, but not good for placing smaller numbers of larger things (like consumer groups with lots of members).
I think we should do something along these lines. An extreme would be to also bias consumer offset partitions to shard 0, and dial down the amount of user work we put on shard 0 ( |
I'm not so sure about positive impact of this, unless we push away more user work from shard 0 than what we use for consumer offsets, thus making shard 0 less loaded than the other ones. However there is a downside in this approach: consumer offset processing will only scale across nodes but not across shards. So if a clusted consists of 40-core nodes, the overall consumer groups perfomance will be limitted to 1/40th of the overall perfomance capacity of the cluster. |
AFAIK, clients can't do that with Kafka API and have to rely on server. And yes, I agree to your idea of mapping groups to |
Yep, it's definitely not the general case solution. It would come up on systems where there are very many user data partitions, and they are principally writing a lot of data through those, and only a small fraction of their writes are to update consumer groups, and the number of consumer groups is small (e.g. of the order number of nodes). This might line up quite well with some of the "100K partition" use cases where they run a few mega-consumer-groups. The other extreme is where users are doing about the same number of consumer group writes as they do data writes (produce a message, consume it, sync the offset), and may have a large number of independent consumer groups, in which case our current model of treating consumer group partitions just like data partitions makes sense. |
@jcsp I see what you mean, and there are definitely real world scenarios that may benefit from that approach. However it seems to me like there are too many strings attached to it to follow it right now. For example, should |
Is the even distribution only makes sense across the nodes, or should we consider shards within nodes as well? John has mentioned an idea when @travisdowns, @jcsp, what do you think? |
I think yes: this helps us if CPU is tight. The scenario we're seeing on cloud instances is that disk tends to be the bottleneck, so shard doesn't matter, but on non-cloud hardware we will likely see faster drives and therefore more interest in getting better CPU balance. |
It's the server only: the client queries the server for the coordinator for the group and the server replies, so for e.g., we don't need to use the same hashing function as Kafka. Changing the partition location should be OK, but changing the group name -> partition mapping (what you were asking about) is more difficult since you would need to migrate the existing offsets from the old to the new partition or something like that, or else the offsets would be lost at upgrade time. This also makes increasing the number of partitions for this partition problematic: you have to get it right when you form your cluster. |
We definitely needed a new hammer here since just using it-is-equivalent-to-2-partitions didn't make much sense given that partition count can vary wildly: 2 partitions rounds to zero if there are 10,000 data partitions to spread around. |
Shard is just as (more?) important as node: we definitely need even balancing between shards. Problems arise principally if a given shard has an uneven allocation of consumer groups: this shard can be overloaded CPU-wise. Having an uneven balance node-wise is probably less important as long as groups go to distinct shards on that node (though this is not true in the limit since there are shared resources like network, disk). Of course, "node-wise" balance is good enough if the number of groups < number of nodes, since in that case node-wise balance guarantees shard-wise balance too, but groups > number of nodes is going to be a common case so this isn't too useful. |
@jcsp wrote:
Yes, exactly. This is true also just for "hashing" not only "consistent hashing", though consistent hashing is worse in the sense that you can't as easily resolve a collision by changing the number of slots. A typical case is that the number of groups is less than the number of shards, so you would like that shards have exactly 0 or 1 consumer groups: having 2 is a loosely speaking a "100% over-allocation", but birthday paradox makes a collision fairly likely even with modest numbers of groups (and I think it's exacerbated by our two stage mapping process: that's two "chances" for a collision to happen and once a collision happens there is a guaranteed collision at the end, though I did not run the numbers on this). |
Implementation update. The PR 6251 has been created to solve the partiton distribution part of the problem. The other part - even partition leader balancing - has been left out of scope yet. This PR is a precondition for the other part, and also it is able to solve the problem in a subset of cluster configurations, namely when the following condition is true:
This condition makes it impossible for more than one |
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418. (cherry picked from commit b0d13af)
Partitions in each allocation domain are balanced separately since redpanda-data#5460. This change evaluates whether the partitions are balanced well enough within each of the allocation domains. How topics are assigned to allocation domains is currently hardcoded: __consumer_offsets belong to -1, all the rest belong to 0. If that becomes more complicated, there should be a better way to determine allocation domain association than this. The ±20% range rule is preserved for each domain, but is somewhat relaxed by rounding the boundary values outwards. This is required to handle small partition counts, e.g. 3 partitions for 2 nodes would otherwise give range of [1.2, 1.8] which no integer value will satisfy; the rounding makes that range [1, 2] instead. Fixes redpanda-data#7418.
Who is this for and what problem do they have today?
This is for users who make use of more than one consumer group. A problem that arises is that even with many nodes and few groups, the group coordinators may be chosen unevenly, e.g., with 8 groups and 32 nodes, it is possible (and in practice common enough) that one node is the coordinator for 2 or 3 groups. Ideally, no no node would coordinate for more than 1 group.
What are the success criteria?
In "unusual operating conditions" group coordinators should be "evenly spread" across nodes, in the same way that partitions and leaders are evenly spread.
Why is solving this problem impactful?
The coordinator shard requires additional resources to complete its coordination tasks: it needs to receive offset commits and heartbeats from all consumers in the group. A system may be sized such that the load on any coordinator is still low enough to avoid saturation as long as a shard is only a coordinator for one group (or, more generally, it's proportional "share" of coordination jobs + 1), but if it ends up coordinator more groups, responsiveness will suffer. If the balancer can ensure the groups are evenly spread, this should not occur.
Additional notes
At a high level, the coordinator for a given group is the shard which is the leader for the partition associated with the group in the
__consumer_offsets
topic.So there are two steps to the distribution: the group name -> partition mapping (to determine the "partition associated with the group in the
__consumer_offsets
topic" part) and the partition -> node/shard mapping part.The first is a pure function of the group name: it takes a group name and the number of partitions in the topic as input and spits out a partition number based on a consistent hash function (see
coordinator_ntp_mapper::ntp_for
). If this distribution has collisions, there will be coordinator collisions, period: we can't fix this in the second part, so we rely on the distribution of the hash function and a bit of luck. The good news is that since this is a deterministic function of the outputs the end-user can choose their groups names strategically to avoid collisions in this part.The second step to determine the coordinator is then finding which shard is the leader for the partition determined in the first step. This uses the "usual" partition allocation & rebalancing (if any) just like for any other topic. So the allocation depends on other factors such as which nodes are part of the cluster when the group is created, and decisions made by the leader and partition balancers.
The second step is what we want to change here: the main problem is that although
__consumer_offsets
leadership (if the leader balancer is enabled, which is it by default) and partition allocation and rebalancing (if the rebalancer is enabled, which it is not by default) do consider the partitions in__consumer_offsets
they are lumped together with partitions from all the user-created topics too. If there is an initial poor allocation of partitions then balance may be achieved by moving user-created partitions instead of any partitions from __consumer_offsets.Similarly, if some nodes have fewer partitions than others (e.g., because some nodes joined more recently) when the
__consumer_offsets
group is originally created, the partition allocation may be highly imbalanced since most or all partitions may be assigned to the underrepresented nodes.Similar reasoning applies to leadership balancing (note that leadership balancing is on by default, but partition balancing is not) and perhaps initial leadership allocation (though this is less clear as I am not clear if initial leadership allocation is a directed process).
To solve these problems, one possibility is to consider this topic a separate "domain" for balancing: we may balance in the same way, but only among the partitions in the __consumer_offsets table, and similarly for leadership balancing decisions. Other possibilities exist.
The text was updated successfully, but these errors were encountered: