Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster: Fail to reconnect to node #587

Open
jcstanaway opened this issue Feb 16, 2018 · 25 comments
Open

Cluster: Fail to reconnect to node #587

jcstanaway opened this issue Feb 16, 2018 · 25 comments
Labels

Comments

@jcstanaway
Copy link

I'm running a simple test with 3 masters + 1 slave per master. If I take down one of the slaves and restart it, ioredis fails to reconnect to that node.

  ioredis:redis status[127.0.0.1:6421]: ready -> close +7s
  ioredis:connection skip reconnecting because `retryStrategy` is not a function +0ms
  ioredis:redis status[127.0.0.1:6421]: close -> end +0ms
Received -node: "127.0.0.1:6421"

Note, however, that retryStrategy is a function. I've repeated this failure with 1) no options supplied new Cluster, 2) with options, but not setting options.redisOptions.retryFunction and 3) with option, but setting options.redisOptions.retryFunction to function() { return 3000; }.

Even long after the node has been restarted and rejoined the cluster (verified with redis-cli cluster slots and cluster nodes), ioredis continues to report the cluster.nodes().length = 5.

So, the only way to recover from this very basic failure mode is for the application to monitor the '-node' event and destroy and re-create the client. This is a bad defect.

I would have hoped that ioredis was periodically monitoring the cluster configuration by periodically performing a cluster slots command, it is clearly impossible for ioredis to detect if new slaves are ever added to the cluster in an expansion scenario.

@stale
Copy link

stale bot commented Mar 18, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Mar 18, 2018
@jcstanaway
Copy link
Author

Keep alive. This issue needs to be resolved.

@stale stale bot removed the wontfix label Mar 19, 2018
@stale
Copy link

stale bot commented Apr 18, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Apr 18, 2018
@jcstanaway
Copy link
Author

Keep alive. This issues needs to be resolved.

@stale stale bot removed the wontfix label Apr 18, 2018
@stale
Copy link

stale bot commented May 18, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label May 18, 2018
@jcstanaway
Copy link
Author

Keep alive. This issue still needs to be resolved.

@stale stale bot removed the wontfix label May 18, 2018
@stale
Copy link

stale bot commented Jun 17, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Jun 17, 2018
@jcstanaway
Copy link
Author

freshbot

@stale stale bot removed the wontfix label Jun 18, 2018
@shaharmor
Copy link
Collaborator

Hey @ccs018, as far as I know, ioredis only refreshes the cluster nodes list when there is a slot change.

Can you try to change one of the slots and see if ioredis reconnects to that slave?

You are correct that we should add a refresh interval for the cluster nodes list, I'll see if I can hook something up

@jcstanaway
Copy link
Author

@shaharmor , thanks for the response. I could try a slot change, but I believe this also is not immediately detected, but only if an operation results in a MOVED response.

Note that I've raised a couple other related issues regarding how ioredis manages cluster connections. In this particular issue, there is no real change in the cluster topology. Rather one of the nodes failed and ioredis decides to never attempt to reconnect to that node. If you query any node in cluster, that temporarily failed node was never removed from the cluster, yet ioredis decides that it is gone forever. In this scenario, ioredis should not unilaterally remove the node from its view of the cluster, but should continue to retry re-connecting to that node. As I noted above error may be related to the following debug output:
ioredis:connection skip reconnecting because `retryStrategy` is not a function +0m
yet in my configuration of ioredis, retryStrategy IS a function. I suspect that there is an error in handling the options passed to Cluster().

@shaharmor
Copy link
Collaborator

Can you share how you are calling the Redis.Cluster constructor?

@jcstanaway
Copy link
Author

jcstanaway commented Jun 18, 2018

I could, but it's gotten rather complex as I've had to work around the various issues I've logged and no longer have the original code. Between this issue and not handling the exception case of trying to connect to the cluster before the cluster is actually formed (startup race condition and the ioredis client hangs) and not proactively handling actual changes to the cluster topology, I've written a lot of logic to detect and recover from these conditions.

I'll try to find some time to write a sample client, but it should be rather easy to reproduce. Just set up a cluster, connect a client and then stop one of the nodes. Wait a minute and then restart the node. You'll see that ioredis does not automatically reconnect to that node.

startupNodes = [{host:127.0.0.1, port:6400},{host:127.0.0.1, port:6410},{host:127.0.0.1, port:6420}]
Other nodes in the cluster are 127.0.0.1 ports 6401, 6411 and 6421 and discovered dynamically.
Non-default options:

enableOfflineQueue: false
scaleReads: true
redisOptions.connectTimeout: 2000
redisOptions.autoResendUnfulfilledCommands: false
redisOptions.reconnectOnError = function(err) {
    let targetError = 'READONLY';
    if (err.message.slice(0, targetError.length) === targetError) {
        // Only reconnect when the error starts with "READONLY".
        return 2;
    }
};
redisOptions.readOnly: true

I don't recall if I was initially using the builtin for clusterRetryStrategy and/or redisOptions.retryStrategy but you can look at #586 for a related issue.

@luin
Copy link
Collaborator

luin commented Jun 24, 2018

@ccs018 Hi, sorry for the really late response.

According to the official documentation https://redis.io/topics/cluster-spec#clients-first-connection-and-handling-of-redirections, ioredis will refresh the slots only when a MOVED error is received. That means even when a node is down temporarily and it's not removed from the cluster, ioredis will disconnect it and won't reconnect.

I'm wondering whether this should be a problem since once sending a command that belong to this node, ioredis will receive a MOVED error, at which time ioredis refresh the slots information and connect to the lost node.

@jcstanaway
Copy link
Author

jcstanaway commented Jun 25, 2018

MOVED is only applicable if slots were moved from one shard to another. There are many scenarios where there's an issue with the current implementation where there is never a MOVED error.

Several scenarios come to mind:

  1. read-only slaves. If I want to scale my read performance and enable read operations on slave nodes, then if a slave node restarts (or more like the machine the slave node is on restarts), the slave node is removed by ioredis and never reconnects thus impacting my read performance.

  2. high availability: A slave node restarts and ioredis removes it from the cluster list. At some point later, the master goes down. The slave node which previously restart is elected master. The cluster slot configuration hasn't changed, so no MOVED error will be received. The client will never be able to access that shard.

  3. rolling upgrades. A slave is taken down to be upgraded. ioredis removes that node from the list. That slave is restarted. Next the master is taken down to be upgraded. The previously upgraded slave is elected as master. This is basically the same as item 2 above.

@luin luin added the pinned label Jun 26, 2018
@jcstanaway
Copy link
Author

So I think the discussion has gotten off topic of the original issue. I did a bit more digging and found that in cluster mode, if a node disconnects (e.g., the redis server restarts), that there is explicitly no attempt to reconnect to that node - even if the client specified a retryStrategy() in options.redisOptions. Specifically in connection_pool.js:

    redis = new Redis(_.defaults({
      // Never try to reconnect when a node is lose,
      // instead, waiting for a `MOVED` error and
      // fetch the slots again.
      retryStrategy: null,

I believe this is wrong. The connection could be lost due do a simple node restart (machine restart, redis server upgrade, etc). In these scenarios, there would never be a MOVED error and so we never reconnect to this node.

Further, the code overrides the client specified options and forces offline queuing to be enabled.

      // Offline queue should be enabled so that
      // we don't need to wait for the `ready` event
      // before sending commands to the node.
      enableOfflineQueue: true,

It's not clear why this needs to overridden. In my use cases, because I am using redis as a true cache, I have special handling when operations to update the cache can't complete. I may have other clients also writing to the cache and if those commands start to get queued up and then later dequeued and executed, those commands could get executed out-of-order and the cache has the wrong data.

The fact that these two options are being overridden is not documented. I'd like to see these overrides removed.

@luin
Copy link
Collaborator

luin commented Jul 7, 2018

@ccs018 The use cases you described won't be affected by setting enableOfflineQueue to true since offline queue only works when the connection is live or a retry attempt is happening. When a node is lost, ioredis will close the connection immediately because retryStrategy is null, so no more command will be added into the offline queue.

@jcstanaway
Copy link
Author

The comment about enableOfflineQueue is more of a secondary issue. The primary problem is that retryStrategy is being set to null so that ioredis will not attempt to reconnect to that node when that node simply restarts. That needs to change. Assuming that there will eventually be a MOVED error is an invalid assumption.

@ischer
Copy link

ischer commented Jan 8, 2020

Has there been a suitable workaround to this issue, or is a solution in progress? I am currently running into this issue, which has caused our custom scaleReads handler to drop nodes and only return the master. Since all of our reads must go to read replicas, this is not workable.

@ajinkyarajput
Copy link

Is there any solution we found on unreachable connected server? I am facing same issue with Redis Sentinels. When connected master become unreachable, it never retry to reconnects. Is this PR #658 is solution of this scenario?

@headlessme
Copy link
Contributor

We're seeing this issue when a node in the cluster is restarted without any changes to the slot distribution, does anyone have a working solution?

@menocomp
Copy link

menocomp commented Sep 1, 2022

Same issue here when a replica node restart.
redis-cli shows that the nodes is diconnected then after seconds as connected, however ioredis is not trying to get the nodes again, only the slots. That looks like a bug for me.

@casret
Copy link

casret commented Sep 27, 2022

We have major problems with redis not reconnecting when redis labs does maintanence on our cluster, I believe this is the underlying issue. Why hasn't any progress been made in 4 years?

@xtianus79
Copy link

You may have to start a new issue @casret

@roma-glushko
Copy link

This issue seems to be directly related to one I have recently filled: #1732

@roma-glushko
Copy link

I have discovered a hack that could mitigate issues related to the current flaws in redis cluster topology updates:
#1732 (comment)
Hope it will save your some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants