Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Cluster: ioredis doesn't update cluster topology on redis instance restarts #1732

Open
roma-glushko opened this issue Mar 13, 2023 · 2 comments

Comments

@roma-glushko
Copy link

roma-glushko commented Mar 13, 2023

When using ioredis for receiving messages from a redis cluster deployment (deployed using Bitnami Redis Cluster chart, for example), I have spotted a situation where ioredis gets stuck with stale redis instances IPs and could not self-heal at all keeping the microservice that relies on it effectively broken until restart.

Redis instances in the redis cluster setup may get restarted for various reasons like:

  • Kubernetes does some bean packing work moving pods over on the new set of nodes
  • Cluster autoscaler has provisioned more nodes and it triggers some pod rescheduling in the cluster

After that, redis instances end up having completely different IPs than the ones they had during ioredis initialization. ioredis doesn't seem to update the redis cluster topology and in the severe cases when all redis instances got granularly updated, ioredis can fall into an infinite retry loop trying to establish connections with IPs that have already been removed from the system with the following types of messages:

2023-03-13T12:38:06.338Z ioredis:cluster:subscriber subscriber has disconnected, selecting a new one...
2023-03-13T12:38:06.339Z ioredis:cluster:subscriber selected a subscriber 10.42.1.215:6379
2023-03-13T12:38:06.339Z ioredis:redis status[10.42.1.215:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-13T12:38:06.339Z ioredis:cluster:subscriber subscribe 9 channels
2023-03-13T12:38:06.339Z ioredis:redis status[10.42.1.215:6379 (ioredis-cluster(subscriber))]: wait -> connecting
2023-03-13T12:38:06.339Z ioredis:redis queue command[10.42.1.215:6379 (ioredis-cluster(subscriber))]: 0 -> subscribe('somechannels... <REDACTED full-length="209">')
2023-03-13T12:38:06.339Z ioredis:cluster:subscriber psubscribe 3 channels
2023-03-13T12:38:06.339Z ioredis:redis queue command[10.42.1.215:6379 (ioredis-cluster(subscriber))]: 0 -> psubscribe([ 'somechannels*' ])
2023-03-13T12:38:06.340Z ioredis:cluster:subscriber failed to subscribe 9 channels
2023-03-13T12:38:06.340Z ioredis:cluster:subscriber failed to psubscribe 3 channels
2023-03-13T12:38:08.340Z ioredis:AbstractConnector stream 10.42.0.177:6379 still open, destroying it
2023-03-13T12:38:08.340Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-13T12:38:09.410Z ioredis:connection error: Error: connect EHOSTUNREACH 10.42.1.215:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.42.1.215',
  port: 6379
}
2023-03-13T12:38:09.410Z ioredis:redis status[10.42.1.215:6379 (ioredis-cluster(subscriber))]: connecting -> close
2023-03-13T12:38:09.410Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2023-03-13T12:38:09.410Z ioredis:redis status[10.42.1.215:6379 (ioredis-cluster(subscriber))]: close -> end

The 10.42.1.215 was one of the IPs that ioredis reported as part of the cluster on initialization:

ioredis:cluster:connectionPool Reset with [` `{ host: '10.42.1.184', port: 6379, readOnly: false },` `{ host: '10.42.1.214', port: 6379, readOnly: true },` `{ host: '10.42.0.178', port: 6379, readOnly: false },` `{ host: '10.42.0.183', port: 6379, readOnly: true },` `{ host: '10.42.0.177', port: 6379, readOnly: false },` `{ host: '10.42.1.215', port: 6379, readOnly: true }` `]`

After the pod had restarted the IP became stale.

Steps to Reproduce

That behavior can be reproduced by these steps:

  • Made sure a mservice is up and running (hence, ioredis has been initialized)
  • Restarted all Redis instances in the cluster except on that was in use by the mservice at that moment
  • Restarted the redis instance that ioredis was using
  • ioredis was trying to find a live node across the previously familiar topology without luck repeating messages above

Client Config

Here is how I have my ioredis cluster client configured:

new Cluster(
    [{
      host: "redis",  // "redis" is a hostname provided by k8s ClusterIP service that points to all instances in the redis cluster 
      port: 6379
    }],
    {
      redisOptions: {
        host: "redis",  // "redis" is a hostname provided by k8s ClusterIP service that points to all instances in the redis cluster 
        port: 6379,
        username: "",
        password: "somepassw0rd",
        db: 0,
        maxRetriesPerRequest: 3,
        maxLoadingRetryTime: 5,  // secs
        retryStrategy,
        tls: {},
      },
      retryDelayOnMoved: 100,
      dnsLookup: (address, callback) => callback(null, address),
    }
);

Is there a way to force ioredis to update its topology in such situation?

@roma-glushko
Copy link
Author

Hey @luin 👋 do you happen to know if this is a well-known problem? What are the current assumptions ioredis makes around cluster topology updates?

@roma-glushko
Copy link
Author

roma-glushko commented Mar 20, 2023

Today by using the error and trial approach, I have been able to discover a hack that could mitigate this issue.

So I put my listener into a endless retry loop like I explained in the issue description:

2023-03-20T13:48:47.482Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(subscriber))]: connecting -> close
2023-03-20T13:48:47.482Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2023-03-20T13:48:47.482Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(subscriber))]: close -> end
2023-03-20T13:48:47.483Z ioredis:cluster:subscriber subscriber has disconnected, selecting a new one...
2023-03-20T13:48:47.483Z ioredis:cluster:subscriber selected a subscriber 10.42.0.149:6379
2023-03-20T13:48:47.483Z ioredis:redis status[10.42.0.149:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-20T13:48:47.484Z ioredis:cluster:subscriber subscribe 9 channels
2023-03-20T13:48:47.484Z ioredis:redis status[10.42.0.149:6379 (ioredis-cluster(subscriber))]: wait -> connecting
2023-03-20T13:48:47.484Z ioredis:redis queue command[10.42.0.149:6379 (ioredis-cluster(subscriber))]: 0 -> subscribe('<REDACTED> <REDACTED full-length="209">')
2023-03-20T13:48:47.484Z ioredis:cluster:subscriber psubscribe 3 channels
2023-03-20T13:48:47.484Z ioredis:redis queue command[10.42.0.149:6379 (ioredis-cluster(subscriber))]: 0 -> psubscribe([ '<REDACTED>' ])
2023-03-20T13:48:47.485Z ioredis:cluster:subscriber failed to subscribe 9 channels
2023-03-20T13:48:47.485Z ioredis:cluster:subscriber failed to psubscribe 3 channels
2023-03-20T13:48:49.484Z ioredis:AbstractConnector stream 10.42.0.148:6379 still open, destroying it
2023-03-20T13:48:49.484Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:48:50.581Z ioredis:connection error: Error: connect EHOSTUNREACH 10.42.0.149:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.42.0.149',
  port: 6379
}
2023-03-20T13:48:50.581Z ioredis:redis status[10.42.0.149:6379 (ioredis-cluster(subscriber))]: connecting -> close
2023-03-20T13:48:50.602Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2023-03-20T13:48:50.602Z ioredis:redis status[10.42.0.149:6379 (ioredis-cluster(subscriber))]: close -> end
2023-03-20T13:48:50.603Z ioredis:cluster:subscriber subscriber has disconnected, selecting a new one...
2023-03-20T13:48:50.603Z ioredis:cluster:subscriber selected a subscriber 10.42.0.150:6379
2023-03-20T13:48:50.603Z ioredis:redis status[10.42.0.150:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-20T13:48:50.603Z ioredis:cluster:subscriber subscribe 9 channels
2023-03-20T13:48:50.604Z ioredis:redis status[10.42.0.150:6379 (ioredis-cluster(subscriber))]: wait -> connecting
2023-03-20T13:48:50.604Z ioredis:redis queue command[10.42.0.150:6379 (ioredis-cluster(subscriber))]: 0 -> subscribe('<REDACTED>... <REDACTED full-length="209">')
2023-03-20T13:48:50.604Z ioredis:cluster:subscriber psubscribe 3 channels
2023-03-20T13:48:50.604Z ioredis:redis queue command[10.42.0.150:6379 (ioredis-cluster(subscriber))]: 0 -> psubscribe([ '<REDACTED>' ])
2023-03-20T13:48:50.605Z ioredis:cluster:subscriber failed to subscribe 9 channels
2023-03-20T13:48:50.605Z ioredis:cluster:subscriber failed to psubscribe 3 channels
2023-03-20T13:48:52.605Z ioredis:AbstractConnector stream 10.42.0.148:6379 still open, destroying it
2023-03-20T13:48:52.605Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:48:53.717Z ioredis:connection error: Error: connect EHOSTUNREACH 10.42.0.150:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.42.0.150',
  port: 6379
}

ioredis won't recover from this state by itself, but if you happen to send a ping() command, then the situation becomes interesting. At some point, ioredis-cluster(refresher)) chimes in and start refreshing topology of the cluster:

2023-03-20T13:56:27.697Z ioredis:redis status[10.42.1.185:6379]: connecting -> close
2023-03-20T13:56:27.697Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2023-03-20T13:56:27.697Z ioredis:redis status[10.42.1.185:6379]: close -> end
2023-03-20T13:56:27.697Z ioredis:cluster:connectionPool Remove 10.42.1.185:6379 from the pool
2023-03-20T13:56:27.797Z ioredis:cluster getting slot cache from 10.42.1.186:6379
2023-03-20T13:56:27.798Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: wait -> wait
2023-03-20T13:56:27.798Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: wait -> connecting
2023-03-20T13:56:27.798Z ioredis:redis queue command[10.42.1.186:6379 (ioredis-cluster(refresher))]: 0 -> cluster([ 'SLOTS' ])
2023-03-20T13:56:28.697Z ioredis:delayqueue send 1 commands in failover queue
2023-03-20T13:56:28.697Z ioredis:redis status[10.42.1.186:6379]: wait -> connecting
2023-03-20T13:56:28.698Z ioredis:redis queue command[10.42.1.186:6379]: 0 -> ping([])
2023-03-20T13:56:28.719Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:28.719Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: connecting -> close
2023-03-20T13:56:28.719Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:28.719Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: close -> end
2023-03-20T13:56:28.799Z ioredis:delayqueue send 1 commands in failover queue
2023-03-20T13:56:28.799Z ioredis:redis status[10.42.1.186:6379]: wait -> connecting
2023-03-20T13:56:28.800Z ioredis:redis queue command[10.42.1.186:6379]: 0 -> ping([])
2023-03-20T13:56:28.956Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:28.956Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: connecting -> close
2023-03-20T13:56:28.957Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:28.957Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: close -> end
2023-03-20T13:56:29.691Z ioredis:AbstractConnector stream 10.42.0.148:6379 still open, destroying it
2023-03-20T13:56:29.691Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:29.694Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:30.696Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:30.696Z ioredis:redis status[10.42.1.185:6379 (ioredis-cluster(refresher))]: connecting -> close
2023-03-20T13:56:30.696Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:30.696Z ioredis:redis status[10.42.1.185:6379 (ioredis-cluster(refresher))]: close -> end
2023-03-20T13:56:30.800Z ioredis:AbstractConnector stream undefined:undefined still open, destroying it
2023-03-20T13:56:30.800Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: connecting -> close
2023-03-20T13:56:30.800Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:30.800Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(refresher))]: close -> end
2023-03-20T13:56:31.912Z ioredis:connection error: Error: connect EHOSTUNREACH 10.42.1.186:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.42.1.186',
  port: 6379
}
2023-03-20T13:56:31.912Z ioredis:redis status[10.42.1.186:6379]: connecting -> close
2023-03-20T13:56:31.912Z ioredis:connection skip reconnecting because `retryStrategy` is not a function
2023-03-20T13:56:31.912Z ioredis:redis status[10.42.1.186:6379]: close -> end
2023-03-20T13:56:31.912Z ioredis:cluster:connectionPool Remove 10.42.1.186:6379 from the pool
2023-03-20T13:56:31.912Z ioredis:cluster:subscriber subscriber has left, selecting a new one...
2023-03-20T13:56:31.912Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(subscriber))]: wait -> close
2023-03-20T13:56:31.912Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:31.913Z ioredis:redis status[10.42.1.186:6379 (ioredis-cluster(subscriber))]: close -> end
2023-03-20T13:56:31.913Z ioredis:cluster:subscriber selecting subscriber failed since there is no node discovered in the cluster yet
2023-03-20T13:56:31.913Z ioredis:cluster status: ready -> close
2023-03-20T13:56:31.914Z ioredis:cluster status: close -> reconnecting
2023-03-20T13:56:32.012Z ioredis:delayqueue send 1 commands in failover queue
2023-03-20T13:56:33.916Z ioredis:cluster Cluster is disconnected. Retrying after 2000ms
2023-03-20T13:56:33.916Z ioredis:cluster status: reconnecting -> connecting
2023-03-20T13:56:33.920Z ioredis:cluster resolved hostname redis.notebooks-management to IP 10.43.125.49
2023-03-20T13:56:33.922Z ioredis:cluster:connectionPool Reset with [ { host: '10.43.125.49', port: 6379 } ]
2023-03-20T13:56:33.922Z ioredis:cluster:connectionPool Connecting to 10.43.125.49:6379 as master
2023-03-20T13:56:33.922Z ioredis:redis status[10.43.125.49:6379]: wait -> wait
2023-03-20T13:56:33.923Z ioredis:cluster:subscriber a new node is discovered and there is no subscriber, selecting a new one...
2023-03-20T13:56:33.923Z ioredis:cluster:subscriber selected a subscriber 10.43.125.49:6379
2023-03-20T13:56:33.923Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-20T13:56:33.923Z ioredis:cluster getting slot cache from 10.43.125.49:6379
2023-03-20T13:56:33.923Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: wait -> wait
2023-03-20T13:56:33.924Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: wait -> connecting
2023-03-20T13:56:33.924Z ioredis:redis queue command[10.43.125.49:6379 (ioredis-cluster(refresher))]: 0 -> cluster([ 'SLOTS' ])
2023-03-20T13:56:33.924Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: wait -> close
2023-03-20T13:56:33.924Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:33.924Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: close -> end
2023-03-20T13:56:33.924Z ioredis:cluster:subscriber selected a subscriber 10.43.125.49:6379
2023-03-20T13:56:33.924Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-20T13:56:33.924Z ioredis:cluster:subscriber started
2023-03-20T13:56:33.925Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: connecting -> connect
2023-03-20T13:56:33.925Z ioredis:redis write command[10.43.125.49:6379 (ioredis-cluster(refresher))]: 0 -> auth([ '<REDACTED>' ])
2023-03-20T13:56:33.926Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: connect -> ready
2023-03-20T13:56:33.926Z ioredis:connection set the connection name [ioredis-cluster(refresher)]
2023-03-20T13:56:33.926Z ioredis:redis write command[10.43.125.49:6379 (ioredis-cluster(refresher))]: 0 -> client([ 'setname', 'ioredis-cluster(refresher)' ])
2023-03-20T13:56:33.926Z ioredis:connection send 1 commands in offline queue
2023-03-20T13:56:33.926Z ioredis:redis write command[10.43.125.49:6379 (ioredis-cluster(refresher))]: 0 -> cluster([ 'SLOTS' ])
2023-03-20T13:56:33.928Z ioredis:cluster cluster slots result count: 3
2023-03-20T13:56:33.928Z ioredis:cluster cluster slots result [0]: slots 0~5460 served by [ '10.42.1.197:6379', '10.42.1.195:6379' ]
2023-03-20T13:56:33.928Z ioredis:cluster cluster slots result [1]: slots 5461~10922 served by [ '10.42.0.162:6379', '10.42.0.160:6379' ]
2023-03-20T13:56:33.929Z ioredis:cluster cluster slots result [2]: slots 10923~16383 served by [ '10.42.0.161:6379', '10.42.1.196:6379' ]
2023-03-20T13:56:33.949Z ioredis:cluster:connectionPool Reset with [
  { host: '10.42.1.197', port: 6379, readOnly: false },
  { host: '10.42.1.195', port: 6379, readOnly: true },
  { host: '10.42.0.162', port: 6379, readOnly: false },
  { host: '10.42.0.160', port: 6379, readOnly: true },
  { host: '10.42.0.161', port: 6379, readOnly: false },
  { host: '10.42.1.196', port: 6379, readOnly: true }
]
2023-03-20T13:56:33.950Z ioredis:cluster:connectionPool Disconnect 10.43.125.49:6379 because the node does not hold any slot
2023-03-20T13:56:33.950Z ioredis:redis status[10.43.125.49:6379]: wait -> close
2023-03-20T13:56:33.950Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:33.950Z ioredis:redis status[10.43.125.49:6379]: close -> end
2023-03-20T13:56:33.950Z ioredis:cluster:connectionPool Remove 10.43.125.49:6379 from the pool
2023-03-20T13:56:33.950Z ioredis:cluster:connectionPool Connecting to 10.42.1.197:6379 as master
2023-03-20T13:56:33.950Z ioredis:redis status[10.42.1.197:6379]: wait -> wait
2023-03-20T13:56:33.950Z ioredis:cluster:connectionPool Connecting to 10.42.1.195:6379 as slave
2023-03-20T13:56:33.951Z ioredis:redis status[10.42.1.195:6379]: wait -> wait
2023-03-20T13:56:33.951Z ioredis:cluster:connectionPool Connecting to 10.42.0.162:6379 as master
2023-03-20T13:56:33.951Z ioredis:redis status[10.42.0.162:6379]: wait -> wait
2023-03-20T13:56:33.951Z ioredis:cluster:connectionPool Connecting to 10.42.0.160:6379 as slave
2023-03-20T13:56:33.951Z ioredis:redis status[10.42.0.160:6379]: wait -> wait
2023-03-20T13:56:33.951Z ioredis:cluster:connectionPool Connecting to 10.42.0.161:6379 as master
2023-03-20T13:56:33.951Z ioredis:redis status[10.42.0.161:6379]: wait -> wait
2023-03-20T13:56:33.952Z ioredis:cluster:connectionPool Connecting to 10.42.1.196:6379 as slave
2023-03-20T13:56:33.952Z ioredis:redis status[10.42.1.196:6379]: wait -> wait
2023-03-20T13:56:33.952Z ioredis:cluster status: connecting -> connect
2023-03-20T13:56:33.952Z ioredis:redis status[10.42.0.161:6379]: wait -> connecting
2023-03-20T13:56:33.952Z ioredis:redis queue command[10.42.0.161:6379]: 0 -> cluster([ 'INFO' ])
2023-03-20T13:56:33.953Z ioredis:cluster:subscriber subscriber has left, selecting a new one...
2023-03-20T13:56:33.953Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: wait -> close
2023-03-20T13:56:33.953Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:33.953Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(subscriber))]: close -> end
2023-03-20T13:56:33.953Z ioredis:cluster:subscriber selected a subscriber 10.42.1.196:6379
2023-03-20T13:56:33.953Z ioredis:redis status[10.42.1.196:6379 (ioredis-cluster(subscriber))]: wait -> wait
2023-03-20T13:56:33.955Z ioredis:redis status[10.42.0.161:6379]: connecting -> connect
2023-03-20T13:56:33.956Z ioredis:redis write command[10.42.0.161:6379]: 0 -> auth([ '<REDACTED>' ])
2023-03-20T13:56:33.956Z ioredis:redis write command[10.42.0.161:6379]: 0 -> info([])
2023-03-20T13:56:33.957Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: ready -> close
2023-03-20T13:56:33.957Z ioredis:connection skip reconnecting since the connection is manually closed.
2023-03-20T13:56:33.957Z ioredis:redis status[10.43.125.49:6379 (ioredis-cluster(refresher))]: close -> end
2023-03-20T13:56:33.959Z ioredis:redis status[10.42.0.161:6379]: connect -> ready
2023-03-20T13:56:33.959Z ioredis:connection send 1 commands in offline queue
2023-03-20T13:56:33.960Z ioredis:redis write command[10.42.0.161:6379]: 0 -> cluster([ 'INFO' ])
2023-03-20T13:56:33.961Z ioredis:cluster status: connect -> ready
2023-03-20T13:56:33.961Z ioredis:cluster send 1 commands in offline queue
2023-03-20T13:56:33.961Z ioredis:redis status[10.42.1.197:6379]: wait -> connecting
2023-03-20T13:56:33.961Z ioredis:redis queue command[10.42.1.197:6379]: 0 -> ping([])
2023-03-20T13:56:33.962Z ioredis:redis status[10.42.1.197:6379]: connecting -> connect
2023-03-20T13:56:33.962Z ioredis:redis write command[10.42.1.197:6379]: 0 -> auth([ '<REDACTED>' ])
2023-03-20T13:56:33.963Z ioredis:redis write command[10.42.1.197:6379]: 0 -> info([])
2023-03-20T13:56:33.964Z ioredis:redis status[10.42.1.197:6379]: connect -> ready
2023-03-20T13:56:33.964Z ioredis:connection send 1 commands in offline queue
2023-03-20T13:56:33.964Z ioredis:redis write command[10.42.1.197:6379]: 0 -> ping([])

With an updated topology, ioredis will be able to work with redis cluster again. I have put this ping command to my healthiness probes like that:

try {
    await redisClient.ping();
    res.status(200).send({ healthy: true });
  } catch {
    res.status(503).send({ healthy: false, message: 'Redis cluster is dead' });
  }

Given that Kubernetes requests healthiness probes periodically, this actually can mitigate the problem to a good extend.

I'm surprised this hack was not suggested in numerous of related tickets I have seen during thinking of this problem.

Hope this will help you to safe some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant