`LoadBalancer` keyed on slot instead of primary node, not reset on `NodesManager.initialize()` #3683

drewfustin · 2025-06-19T19:47:07Z

Pull Request check-list

Do tests and lints pass with this change?
Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)? link
Is the new or changed code fully tested?
Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
Is there an example added to the examples folder (if applicable)?

Description of change

Fixes Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681
LoadBalancer now uses slot_to_idx instead of primary_to_idx, using slot as the key instead of primary node name.
When NodesManager resets in initialize, it no longer runs LoadBalancer.reset() which would clear the slot_to_idx dictionary.
Tests TestNodesManager.test_load_balancer updated accordingly.

As noted in #3681, reseting the load balancer on NodesManager.initialize() causes the index associated with the primary node to reset to 0. If a ConnectionError or TimeoutError is raised by an attempt to connect to a primary node, NodesManager.initialize() is called, and the the load balancer's index for that node will reset to 0. Therefore, the next attempt in the retry loop will not move on from the primary node to a replica node (with index > 0) as expected, but will instead retry the primary node again (and presumably raise the same error).

Since NodesManager.initialize() being called on ConnectionError or TimeoutError is the valid strategy, and since the primary node's host will often be replaced in tandem with events that cause these errors (e.g. when a primary node is deleted and then recreated in Kubernetes), keying the LoadBalancer dictionary on the primary node's name (host:port) doesn't feel appropriate. Instead, keying the dictionary on the Redis Cluster's slot seems to be a better strategy. As such, the server_index corresponding to key slot doesn't need to be reset to 0 on NodesManager.initialize() as the slot isn't expected to change and need to be reset, only the host:port would require such. Instead, the slot can maintain its "state" even when the NodesManager is reinitialized, thus resolving #3681.

With the fix in this PR implemented, the output of the loop from #3681 becomes what is expected when the primary node goes down (the load balancer continues to the next node on a TimeoutError):

Attempt 1
idx: 2 | node: 100.66.151.143:6379 | type: replica
'bar'

Attempt 2
idx: 0 | node: 100.66.122.229:6379 | type: primary
Exception: Timeout connecting to server

Attempt 3
idx: 1 | node: 100.66.151.143:6379 | type: replica
'bar'

Attempt 4
idx: 2 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 5
idx: 0 | node: 100.66.122.229:6379 | type: primary
Exception: Error 113 connecting to 100.66.122.229:6379. No route to host.

…reset when NodesManager resets

Copilot

Pull Request Overview

This PR refactors the LoadBalancer to use cluster slots as keys instead of primary node names, so that slot-based indices persist across NodesManager reinitializations. Related tests are updated to assert on slots.

Rename internal mapping from primary_to_idx to slot_to_idx and adapt methods accordingly
Change calls to get_server_index to pass slot IDs and update get_node_from_slot
Remove automatic load balancer reset in NodesManager.reset to preserve slot indices
Update sync and async cluster tests to use slot keys in load balancer assertions

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
tests/test_cluster.py	Switch load balancer tests to use `slot` argument instead of primary name
tests/test_asyncio/test_cluster.py	Same slot-based updates for async cluster tests
redis/cluster.py	Core `LoadBalancer` refactor: keying by slot, updated signatures, and removed reset in `NodesManager`

Comments suppressed due to low confidence (5)

redis/cluster.py:1409

Update the method docstring (or add a comment) for get_server_index to clearly state that the first parameter is now slot: int instead of the previous primary node name, and describe the expected behavior.

    def get_server_index(

redis/cluster.py:1406

[nitpick] Consider renaming slot_to_idx to a more descriptive identifier such as slot_index_map or slot_to_index to make it clearer that this is a mapping from slot IDs to rotation indices.

        self.slot_to_idx = {}

redis/cluster.py:1435

The inline comment here could be updated to reflect the slot-based logic: e.g. "skip index 0 (primary) when replicas_only is true" to avoid confusion about nodes vs. slot indices.

            # skip the primary node index

redis/cluster.py:1836

Add a test to verify that calling NodesManager.reset() no longer clears the slot-based load balancer state, ensuring that slot indices persist across reinitializations.

    def reset(self):

redis/cluster.py:1405

Add tests for non-default start_index values to ensure the LoadBalancer correctly starts rotations from the specified offset.

    def __init__(self, start_index: int = 0) -> None:

petyaslavova · 2025-06-30T11:16:15Z

Hi @drewfustin,
First of all, thank you for the time and effort you've put into this PR — it's much appreciated!

That said, I have some concerns about the proposed approach. With your change, the replica indexes would be rotated per slot, which could lead to repeatedly hitting the same cluster node over an extended period, instead of distributing requests across replicas or across both replicas and primaries.

In a Redis Cluster with 16,384 slots, an application that doesn’t deliberately target specific slots might end up consistently hitting a single node. This behavior doesn't align with the intended behavior of the round-robin algorithm, which is designed to spread requests across both primaries and replicas or just through the replicas to ensure better load distribution.

…alancer

drewfustin added 3 commits June 19, 2025 12:52

LoadBalancer keyed on slot instead of primary node; LoadBalancer not …

040bb9a

…reset when NodesManager resets

LoadBalancer keyed on slot instead of primary node; LoadBalancer not …

cc3a422

…reset when NodesManager resets

LoadBalancer keyed on slot instead of primary node; LoadBalancer not …

745bb29

…reset when NodesManager resets

drewfustin mentioned this pull request Jun 19, 2025

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681

Open

petyaslavova requested a review from Copilot June 30, 2025 06:42

Copilot AI reviewed Jun 30, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master' into slot-based-load-b…

28b14a5

…alancer

drewfustin force-pushed the slot-based-load-balancer branch from adcc529 to 28b14a5 Compare June 30, 2025 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`LoadBalancer` keyed on slot instead of primary node, not reset on `NodesManager.initialize()` #3683

`LoadBalancer` keyed on slot instead of primary node, not reset on `NodesManager.initialize()` #3683

drewfustin commented Jun 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

petyaslavova commented Jun 30, 2025

Uh oh!

Uh oh!

LoadBalancer keyed on slot instead of primary node, not reset on NodesManager.initialize() #3683

Are you sure you want to change the base?

LoadBalancer keyed on slot instead of primary node, not reset on NodesManager.initialize() #3683

Conversation

drewfustin commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request check-list

Description of change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

petyaslavova commented Jun 30, 2025

Uh oh!

Uh oh!

`LoadBalancer` keyed on slot instead of primary node, not reset on `NodesManager.initialize()` #3683

`LoadBalancer` keyed on slot instead of primary node, not reset on `NodesManager.initialize()` #3683

drewfustin commented Jun 19, 2025 •

edited

Loading