-
Notifications
You must be signed in to change notification settings - Fork 2.6k
LoadBalancer
keyed on slot instead of primary node, not reset on NodesManager.initialize()
#3683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…reset when NodesManager resets
…reset when NodesManager resets
…reset when NodesManager resets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the LoadBalancer
to use cluster slots as keys instead of primary node names, so that slot-based indices persist across NodesManager
reinitializations. Related tests are updated to assert on slots.
- Rename internal mapping from
primary_to_idx
toslot_to_idx
and adapt methods accordingly - Change calls to
get_server_index
to pass slot IDs and updateget_node_from_slot
- Remove automatic load balancer reset in
NodesManager.reset
to preserve slot indices - Update sync and async cluster tests to use slot keys in load balancer assertions
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/test_cluster.py | Switch load balancer tests to use slot argument instead of primary name |
tests/test_asyncio/test_cluster.py | Same slot-based updates for async cluster tests |
redis/cluster.py | Core LoadBalancer refactor: keying by slot, updated signatures, and removed reset in NodesManager |
Comments suppressed due to low confidence (5)
redis/cluster.py:1409
- Update the method docstring (or add a comment) for
get_server_index
to clearly state that the first parameter is nowslot: int
instead of the previous primary node name, and describe the expected behavior.
def get_server_index(
redis/cluster.py:1406
- [nitpick] Consider renaming
slot_to_idx
to a more descriptive identifier such asslot_index_map
orslot_to_index
to make it clearer that this is a mapping from slot IDs to rotation indices.
self.slot_to_idx = {}
redis/cluster.py:1435
- The inline comment here could be updated to reflect the slot-based logic: e.g. "skip index 0 (primary) when
replicas_only
is true" to avoid confusion about nodes vs. slot indices.
# skip the primary node index
redis/cluster.py:1836
- Add a test to verify that calling
NodesManager.reset()
no longer clears the slot-based load balancer state, ensuring that slot indices persist across reinitializations.
def reset(self):
redis/cluster.py:1405
- Add tests for non-default
start_index
values to ensure theLoadBalancer
correctly starts rotations from the specified offset.
def __init__(self, start_index: int = 0) -> None:
Hi @drewfustin, That said, I have some concerns about the proposed approach. With your change, the replica indexes would be rotated per slot, which could lead to repeatedly hitting the same cluster node over an extended period, instead of distributing requests across replicas or across both replicas and primaries. In a Redis Cluster with 16,384 slots, an application that doesn’t deliberately target specific slots might end up consistently hitting a single node. This behavior doesn't align with the intended behavior of the round-robin algorithm, which is designed to spread requests across both primaries and replicas or just through the replicas to ensure better load distribution. |
adcc529
to
28b14a5
Compare
Pull Request check-list
Description of change
LoadBalancer
now usesslot_to_idx
instead ofprimary_to_idx
, using slot as the key instead of primary node name.NodesManager
resets ininitialize
, it no longer runsLoadBalancer.reset()
which would clear theslot_to_idx
dictionary.TestNodesManager.test_load_balancer
updated accordingly.As noted in #3681, reseting the load balancer on
NodesManager.initialize()
causes the index associated with the primary node to reset to 0. If aConnectionError
orTimeoutError
is raised by an attempt to connect to a primary node,NodesManager.initialize()
is called, and the the load balancer's index for that node will reset to 0. Therefore, the next attempt in the retry loop will not move on from the primary node to a replica node (with index > 0) as expected, but will instead retry the primary node again (and presumably raise the same error).Since
NodesManager.initialize()
being called onConnectionError
orTimeoutError
is the valid strategy, and since the primary node's host will often be replaced in tandem with events that cause these errors (e.g. when a primary node is deleted and then recreated in Kubernetes), keying theLoadBalancer
dictionary on the primary node's name (host:port
) doesn't feel appropriate. Instead, keying the dictionary on the Redis Cluster's slot seems to be a better strategy. As such, theserver_index
corresponding to keyslot
doesn't need to be reset to 0 onNodesManager.initialize()
as theslot
isn't expected to change and need to be reset, only thehost:port
would require such. Instead, theslot
can maintain its "state" even when theNodesManager
is reinitialized, thus resolving #3681.With the fix in this PR implemented, the output of the loop from #3681 becomes what is expected when the primary node goes down (the load balancer continues to the next node on a
TimeoutError
):