Cluster: Fail to reconnect to node #587

jcstanaway · 2018-02-16T20:50:44Z

I'm running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.

  ioredis:redis status[127.0.0.1:6421]: ready -> close +7s
  ioredis:connection skip reconnecting because `retryStrategy` is not a function +0ms
  ioredis:redis status[127.0.0.1:6421]: close -> end +0ms
Received -node: "127.0.0.1:6421"

Note, however, that retryStrategy is a function. I've repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.

Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.

So, the only way to recover from this very basic failure mode is for the application to monitor the '-node' event and destroy and re-create the client. This is a bad defect.

I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.

The text was updated successfully, but these errors were encountered:

stale · 2018-03-18T21:42:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway · 2018-03-19T13:08:49Z

Keep alive. This issue needs to be resolved.

stale · 2018-04-18T13:24:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway · 2018-04-18T19:23:57Z

Keep alive. This issues needs to be resolved.

stale · 2018-05-18T20:17:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway · 2018-05-18T20:31:21Z

Keep alive. This issue still needs to be resolved.

stale · 2018-06-17T20:36:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

jcstanaway · 2018-06-18T13:20:06Z

freshbot

shaharmor · 2018-06-18T14:34:58Z

Hey @ccs018, as far as I know, ioredis only refreshes the cluster nodes list when there is a slot change.

Can you try to change one of the slots and see if ioredis reconnects to that slave?

You are correct that we should add a refresh interval for the cluster nodes list, I'll see if I can hook something up

jcstanaway · 2018-06-18T14:47:05Z

@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response.

Note that I've raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output:
ioredis:connection skip reconnecting because `retryStrategy` is not a function +0m
yet in my configuration of ioredis, retryStrategy IS a function. I suspect that there is an error in handling the options passed to Cluster().

shaharmor · 2018-06-18T14:48:44Z

Can you share how you are calling the Redis.Cluster constructor?

jcstanaway · 2018-06-18T15:18:03Z

I could, but it's gotten rather complex as I've had to work around the various issues I've logged and no longer have the original code. Between this issue and not handling the exception case of trying to connect to the cluster before the cluster is actually formed (startup race condition and the ioredis client hangs) and not proactively handling actual changes to the cluster topology, I've written a lot of logic to detect and recover from these conditions.

I'll try to find some time to write a sample client, but it should be rather easy to reproduce. Just set up a cluster, connect a client and then stop one of the nodes. Wait a minute and then restart the node. You'll see that ioredis does not automatically reconnect to that node.

startupNodes = [{host:127.0.0.1, port:6400},{host:127.0.0.1, port:6410},{host:127.0.0.1, port:6420}]
Other nodes in the cluster are 127.0.0.1 ports 6401, 6411 and 6421 and discovered dynamically.
Non-default options:

enableOfflineQueue: false
scaleReads: true
redisOptions.connectTimeout: 2000
redisOptions.autoResendUnfulfilledCommands: false
redisOptions.reconnectOnError = function(err) {
    let targetError = 'READONLY';
    if (err.message.slice(0, targetError.length) === targetError) {
        // Only reconnect when the error starts with "READONLY".
        return 2;
    }
};
redisOptions.readOnly: true

I don't recall if I was initially using the builtin for clusterRetryStrategy and/or redisOptions.retryStrategy but you can look at #586 for a related issue.

luin · 2018-06-24T05:04:27Z

@ccs018 Hi, sorry for the really late response.

According to the official documentation https://redis.io/topics/cluster-spec#clients-first-connection-and-handling-of-redirections, ioredis will refresh the slots only when a MOVED error is received. That means even when a node is down temporarily and it's not removed from the cluster, ioredis will disconnect it and won't reconnect.

I'm wondering whether this should be a problem since once sending a command that belong to this node, ioredis will receive a MOVED error, at which time ioredis refresh the slots information and connect to the lost node.

jcstanaway · 2018-06-25T13:58:35Z

MOVED is only applicable if slots were moved from one shard to another. There are many scenarios where there's an issue with the current implementation where there is never a MOVED error.

Several scenarios come to mind:

read-only slaves. If I want to scale my read performance and enable read operations on slave nodes, then if a slave node restarts (or more like the machine the slave node is on restarts), the slave node is removed by ioredis and never reconnects thus impacting my read performance.
high availability: A slave node restarts and ioredis removes it from the cluster list. At some point later, the master goes down. The slave node which previously restart is elected master. The cluster slot configuration hasn't changed, so no MOVED error will be received. The client will never be able to access that shard.
rolling upgrades. A slave is taken down to be upgraded. ioredis removes that node from the list. That slave is restarted. Next the master is taken down to be upgraded. The previously upgraded slave is elected as master. This is basically the same as item 2 above.

jcstanaway · 2018-07-06T19:40:00Z

So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in options.redisOptions. Specifically in connection_pool.js:

    redis = new Redis(_.defaults({
      // Never try to reconnect when a node is lose,
      // instead, waiting for a `MOVED` error and
      // fetch the slots again.
      retryStrategy: null,

I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a MOVED error and so we never reconnect to this node.

Further, the code overrides the client specified options and forces offline queuing to be enabled.

      // Offline queue should be enabled so that
      // we don't need to wait for the `ready` event
      // before sending commands to the node.
      enableOfflineQueue: true,

It's not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can't complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data.

The fact that these two options are being overridden is not documented. I'd like to see these overrides removed.

luin · 2018-07-07T06:04:53Z

@ccs018 The use cases you described won't be affected by setting enableOfflineQueue to true since offline queue only works when the connection is live or a retry attempt is happening. When a node is lost, ioredis will close the connection immediately because retryStrategy is null, so no more command will be added into the offline queue.

jcstanaway · 2018-07-09T13:20:46Z

The comment about enableOfflineQueue is more of a secondary issue. The primary problem is that retryStrategy is being set to null so that ioredis will not attempt to reconnect to that node when that node simply restarts. That needs to change. Assuming that there will eventually be a MOVED error is an invalid assumption.

ischer · 2020-01-08T17:07:48Z

Has there been a suitable workaround to this issue, or is a solution in progress? I am currently running into this issue, which has caused our custom scaleReads handler to drop nodes and only return the master. Since all of our reads must go to read replicas, this is not workable.

ajinkyarajput · 2020-06-09T09:16:18Z

Is there any solution we found on unreachable connected server? I am facing same issue with Redis Sentinels. When connected master become unreachable, it never retry to reconnects. Is this PR #658 is solution of this scenario?

headlessme · 2022-05-18T15:21:54Z

We're seeing this issue when a node in the cluster is restarted without any changes to the slot distribution, does anyone have a working solution?

menocomp · 2022-09-01T02:50:22Z

Same issue here when a replica node restart.
redis-cli shows that the nodes is diconnected then after seconds as connected, however ioredis is not trying to get the nodes again, only the slots. That looks like a bug for me.

casret · 2022-09-27T18:24:27Z

We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn't any progress been made in 4 years?

xtianus79 · 2022-10-05T23:48:12Z

You may have to start a new issue @casret

roma-glushko · 2023-03-20T10:22:56Z

This issue seems to be directly related to one I have recently filled: #1732

roma-glushko · 2023-03-20T14:08:58Z

I have discovered a hack that could mitigate issues related to the current flaws in redis cluster topology updates:
#1732 (comment)
Hope it will save your some time.

stale bot added the wontfix label Mar 18, 2018

stale bot removed the wontfix label Mar 19, 2018

stale bot added the wontfix label Apr 18, 2018

stale bot removed the wontfix label Apr 18, 2018

stale bot added the wontfix label May 18, 2018

stale bot removed the wontfix label May 18, 2018

stale bot added the wontfix label Jun 17, 2018

stale bot removed the wontfix label Jun 18, 2018

luin added the pinned label Jun 26, 2018

jcstanaway mentioned this issue Jul 6, 2018

feat: add "timeoutPerCommand" option to detect dead connection #658

Closed

hheshamazon mentioned this issue Sep 21, 2022

RetryStrategy not taking effect in cluster mode #1657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster: Fail to reconnect to node #587

Cluster: Fail to reconnect to node #587

jcstanaway commented Feb 16, 2018

stale bot commented Mar 18, 2018

jcstanaway commented Mar 19, 2018

stale bot commented Apr 18, 2018

jcstanaway commented Apr 18, 2018

stale bot commented May 18, 2018

jcstanaway commented May 18, 2018

stale bot commented Jun 17, 2018

jcstanaway commented Jun 18, 2018

shaharmor commented Jun 18, 2018

jcstanaway commented Jun 18, 2018

shaharmor commented Jun 18, 2018

jcstanaway commented Jun 18, 2018 •

edited

Loading

luin commented Jun 24, 2018

jcstanaway commented Jun 25, 2018 •

edited

Loading

jcstanaway commented Jul 6, 2018

luin commented Jul 7, 2018

jcstanaway commented Jul 9, 2018

ischer commented Jan 8, 2020

ajinkyarajput commented Jun 9, 2020

headlessme commented May 18, 2022

menocomp commented Sep 1, 2022

casret commented Sep 27, 2022

xtianus79 commented Oct 5, 2022

roma-glushko commented Mar 20, 2023

roma-glushko commented Mar 20, 2023

Cluster: Fail to reconnect to node #587

Cluster: Fail to reconnect to node #587

Comments

jcstanaway commented Feb 16, 2018

stale bot commented Mar 18, 2018

jcstanaway commented Mar 19, 2018

stale bot commented Apr 18, 2018

jcstanaway commented Apr 18, 2018

stale bot commented May 18, 2018

jcstanaway commented May 18, 2018

stale bot commented Jun 17, 2018

jcstanaway commented Jun 18, 2018

shaharmor commented Jun 18, 2018

jcstanaway commented Jun 18, 2018

shaharmor commented Jun 18, 2018

jcstanaway commented Jun 18, 2018 • edited Loading

luin commented Jun 24, 2018

jcstanaway commented Jun 25, 2018 • edited Loading

jcstanaway commented Jul 6, 2018

luin commented Jul 7, 2018

jcstanaway commented Jul 9, 2018

ischer commented Jan 8, 2020

ajinkyarajput commented Jun 9, 2020

headlessme commented May 18, 2022

menocomp commented Sep 1, 2022

casret commented Sep 27, 2022

xtianus79 commented Oct 5, 2022

roma-glushko commented Mar 20, 2023

roma-glushko commented Mar 20, 2023

jcstanaway commented Jun 18, 2018 •

edited

Loading

jcstanaway commented Jun 25, 2018 •

edited

Loading