-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster: Fail to reconnect to node #587
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
Keep alive. This issue needs to be resolved. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
Keep alive. This issues needs to be resolved. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
Keep alive. This issue still needs to be resolved. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
freshbot |
Hey @ccs018, as far as I know, ioredis only refreshes the cluster nodes list when there is a slot change. Can you try to change one of the slots and see if ioredis reconnects to that slave? You are correct that we should add a refresh interval for the cluster nodes list, I'll see if I can hook something up |
@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response. Note that I've raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output: |
Can you share how you are calling the |
I could, but it's gotten rather complex as I've had to work around the various issues I've logged and no longer have the original code. Between this issue and not handling the exception case of trying to connect to the cluster before the cluster is actually formed (startup race condition and the ioredis client hangs) and not proactively handling actual changes to the cluster topology, I've written a lot of logic to detect and recover from these conditions. I'll try to find some time to write a sample client, but it should be rather easy to reproduce. Just set up a cluster, connect a client and then stop one of the nodes. Wait a minute and then restart the node. You'll see that ioredis does not automatically reconnect to that node. startupNodes = [{host:127.0.0.1, port:6400},{host:127.0.0.1, port:6410},{host:127.0.0.1, port:6420}]
I don't recall if I was initially using the builtin for |
@ccs018 Hi, sorry for the really late response. According to the official documentation https://redis.io/topics/cluster-spec#clients-first-connection-and-handling-of-redirections, ioredis will refresh the slots only when a I'm wondering whether this should be a problem since once sending a command that belong to this node, ioredis will receive a |
MOVED is only applicable if slots were moved from one shard to another. There are many scenarios where there's an issue with the current implementation where there is never a MOVED error. Several scenarios come to mind:
|
So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in
I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a Further, the code overrides the client specified options and forces offline queuing to be enabled.
It's not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can't complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data. The fact that these two options are being overridden is not documented. I'd like to see these overrides removed. |
@ccs018 The use cases you described won't be affected by setting |
The comment about |
Has there been a suitable workaround to this issue, or is a solution in progress? I am currently running into this issue, which has caused our custom |
Is there any solution we found on unreachable connected server? I am facing same issue with Redis Sentinels. When connected master become unreachable, it never retry to reconnects. Is this PR #658 is solution of this scenario? |
We're seeing this issue when a node in the cluster is restarted without any changes to the slot distribution, does anyone have a working solution? |
Same issue here when a replica node restart. |
We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn't any progress been made in 4 years? |
You may have to start a new issue @casret |
This issue seems to be directly related to one I have recently filled: #1732 |
I have discovered a hack that could mitigate issues related to the current flaws in redis cluster topology updates: |
I'm running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.
Note, however, that retryStrategy is a function. I've repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.
Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.
So, the only way to recover from this very basic failure mode is for the application to monitor the '-node' event and destroy and re-create the client. This is a bad defect.
I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.
The text was updated successfully, but these errors were encountered: