"gocql: no hosts available in the pool" cannot recover without restart #915

zhixinwen · 2017-05-23T04:25:17Z

We had a cluster degrades due to increased traffic load, and on the client side we see "gocql: no hosts available in the pool" error for ~80% of the requests.

The error rate is very consistent, and the C* cluster is healthy except the load. We try to reduce the load on the cluster, but the error is still consistent. It lasts for two hours.

We restart the client, and the errors go away immediately.

Also one thing interesting is that after we restart the gocql client, we also see an immediate drop on coordinator latency and traffic on the C* cluster (Yes, client restarts first, then the latency and traffic on C* drops, not the other way around). It may be due to upstream users stop retry after we restart the server and no longer see "gocql: no hosts available in the pool", but we are not sure about the root cause yet.

zhixinwen · 2017-05-23T04:28:15Z

After we close a connection due to too many timeouts, when would the connection be made again?

I do see gocql wait for event from control connection to reconnect. But if no node is perceived as down by the control node, and it never emits a 'node up' event, would gocql try to recreate the connections?

zhixinwen · 2017-06-01T17:50:57Z

@Zariel gocql closes a connection when it takes too long to respond, when would the connection be reconnected?

I may miss some of the code, looks like it will only reconnect when it receives "Node Up" event?

Zariel · 2017-06-27T21:51:49Z

It should be periodically try to dial nodes that are down, if you can reproduce and build with gocql_debug you should see more info about the pool state.

Also if all the hosts are down, then the host list should be reset to the initial hosts and they will be attempted.

zhixinwen · 2017-08-09T00:57:39Z

I think the root cause of the problem here is:

When all connections to a hosts are closed, the host is not marked as down, so ReconnectInterval would not help
HandleError in hostConnPool would try to refill the connection pool when a connection is closed, but in an outage it is very likely to fail. Moreover it would not retry afterward. As a result, when the cluster recovers after all the connections are closed, gocql would not attempt to rebuild the connection pool

To fix this, we need to mark the host as DOWN, when we close all the connections on that host.

balta2ar · 2017-08-28T11:35:17Z

I simulated network connectivity issues using iptables:

sudo iptables -I OUTPUT -p tcp -d 127.0.0.1 --dport 9042 -j DROP

After that upon queries I start to receive this error message: gocql: no response received from cassandra within timeout period
And soon after it this message: gocql: no hosts available in the pool

After deleting the rule (sudo iptables -D OUTPUT 1) gocql indeed does not recover automatically.

Zariel · 2017-08-28T15:23:58Z

Running a single node of Cassandra (on mac so excuse lack of iptables), able to query host as expected when its running. Kill -9 the process, gocql receives no host down notification and discovers the control connection is dead and marks the host down. Restarting Cassandra the driver reconnects and queries can continue. Repeated with STOP/CONT and had the same results.

Can you rerun your test with gocql_debug build tag

zhixinwen · 2018-02-14T00:14:17Z

I think kill -9 is not a accurate simulation, simulation of @balta2ar seems to be more similar to the issue I see.

Zariel · 2018-02-14T15:58:35Z

Do you know of a way to replicate on OSX or reliably for regression tests?

robdefeo · 2018-02-24T22:45:44Z

When this error occurs I have had to cycle the container as there is no good way to reconnect after this event. There must be a better way but this error has caused so many problems

engin · 2018-06-11T18:19:44Z

seems that setting ClusterConfig.ReconnectInterval > 0 has no affect on this.

…is encountered apache/cassandra-gocql-driver#915

shettyh · 2019-02-21T09:49:49Z

+1

Any fixes for this issue ?. In single node cassandra this issue is consistently reproducible

@Zariel

steebchen · 2019-06-01T20:44:21Z

Constantly seeing this issue on single-node Cassandra cluster. cqlsh works fine 100% of the time while gocql seems to take a lot of time connecting to the cluster, often resulting in "gocql: no hosts available in the pool" error.

rlk833 · 2019-06-05T19:48:53Z

We have this problem too. Have had it since at least Oct of 2018. I had to put in special code to detect this and cause the application to be recycled. Our apps call gocql under very heavy load (millions an hour requests). For a long time the recycle occurred once a month or so. But the loads have increased recently and it is now up to every days or even shorter. We have 4 instances running. It is not always all 4 instances at the same time. Out of nowhere with no proceeding errors logged we all of a sudden get this in a running instance. The only fix is to restart the instance. Occasionally other instances will shortly run into the problem to and need restarting. If we restart there is no problem at all connecting to the nodes.

The main problem we have is during the detection and restarting period there is a significant slow down in processing as the instance is not available.

+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() +

+ Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked. + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above. + Made shutdown of Memendra thread safe + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915 + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format() + Replace Bind(...) calls by Query in new goCQL version (simplify the code)

clarakosi · 2020-05-21T22:38:02Z

I ran into a similar issue a couple of days ago and followed @balta2ar instructions to try and recreate the situation.

And surprisingly gocql reconnected after removing the rule (sudo iptables -D OUTPUT 1).

Here are the logs:

{"msg":"Error reading from storage (gocql: no response received from cassandra within timeout period)","appname":"kask","time":"2020-05-21T18:05:40-04:00","level"
:"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"WARNING"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:50-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:51-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:58-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:09-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:20-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeUp: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:23-04:00","level":"WARNING"}

Any other ideas on how to replicate this issue?

SubNader · 2020-10-01T10:52:38Z

Running into this while using gocql to write to AWS Keyspaces.
Any updates on this?

mattmassicotte · 2020-10-07T14:18:45Z

I'm also experiencing this issue extremely frequently with AWS Keyspaces. It appears to be entirely due to state of the process. Other processes can access Keyspaces without issue, and restarting an affected process immediately resolves the issue. I'm up for helping to debug however I can.

Draal · 2021-04-09T01:44:10Z

it seems like it happens because of
https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/connectionpool.go#L248
and then we add host back only at https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/events.go#L187-L187
which if I understand correctly happens only when a new node is starting, but not when a node temporarily lost network connection or because of overload rejected request.

martin-sucha · 2021-04-14T13:00:50Z

There is pool.addHost in reconnectDownedHosts. reconnectDownedHosts periodically adds nodes that are marked as down. However, we set the host state as down only when we get such event from the database, which does not occur if just gocql lost connections (or closed them due to errors).

mrwonko · 2021-04-22T15:53:51Z

I'm trying to reproduce this using docker-compose on Mac. I'm using the alpine image for my client, installed iptables via apk --update add iptables, then added

    cap_add:
      - NET_ADMIN

to my docker-compose.yaml so I can use iptables inside the container. I have a small Go program that does SELECT requests in a loop.

When I execute iptables -I OUTPUT -p tcp --dport 9042 -j DROP, this happens:

client_1     | 2021/04/22 15:08:22 SELECTed: "data"
client_1     | 2021/04/22 15:08:23 SELECTed: "data"
client_1     | 2021/04/22 15:08:25 SELECT failed: gocql: no response received from cassandra within timeout period
client_1     | 2021/04/22 15:08:26 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: connection closed waiting for response
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:27 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:28 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:29 SELECT failed: gocql: no hosts available in the pool

So far, so good. Occasionally, an automatic reconnect is apparently attempted:

client_1     | 2021/04/22 15:09:20 connection failed "172.22.0.2": dial tcp 172.22.0.2:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy

But when I then remove the rule via iptables -D OUTPUT 1, gocql recovers immediately:

client_1     | 2021/04/22 15:10:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeUp: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 SELECTed: "data"
client_1     | 2021/04/22 15:10:30 SELECTed: "data"

However I know that the issue is not fixed entirely, because we had an outage due to it recently. Now I don't know how to reproduce it.

justinfx · 2022-11-07T18:25:00Z

@player1-github my last reply was 10 months ago, after the proposed fix was merged. I had tested my reported issue using the latest commits. It would be great if someone could confirm my report on the Yugabyte issue.

PierreF · 2023-02-08T16:11:07Z

This issue looks like the same I had: it happen if the initial set of Cassandra IP addresses change.

This happen easily on Kubernetes. With a rolling upgrade of Cassandra, each Cassandra node will be restarted with a new IP address. When the last node is restarted all initial IPs have changed and gocql fail with "no hosts available in the pool"

I've create a minimal case to reproduce this issue in the following repository using Docker compose: https://github.com/PierreF/gocql-cluster-issue

This repository has:

A Cassandra cluster of 3 nodes.
It restart one Cassandra after the other. So at least two Cassandra are always up.
Python client that have no issue. So this rolling restart should be valid.

I've found some other issue that looks duplicate of this issue:

gocql does not re-resolve DNS names #831
Rolling restart of k8s db cluster causes all nodes to remain down and not recover #1582 (which contains step to reproduce the issue using minikube)
New hosts connection details are not updated #1667
refresh ring when node UP event received for missing host #1669 (this PR does not fix my issue)

PierreF · 2023-02-08T16:45:45Z

Inspired by the following comment, I've tested a workaround which fix my issue: PierreF/gocql-cluster-issue@a7a6db2

justinfx · 2023-02-08T18:08:23Z

@PierreF that workaround looks like a simplified version of what I commented with a year ago, earlier in this thread: yugabyte/yugabyte-db#10182 (comment)

It is pretty ugly and has to be a bit more careful in guarding concurrent access to the session in queries that are inflight. I wouldn't really consider it a great workaround. It's more like a band-aid

…ut restart (apache/cassandra-gocql-driver#915)

sseidman · 2023-02-09T14:48:18Z

@PierreF I'd be curious if using both #1670 and #1669 would solve your issue. I've had positive results using both changes as a workaround

mfamador · 2023-02-11T13:45:52Z

@PierreF I'd be curious if using both #1670 and #1669 would solve your issue. I've had positive results using both changes as a workaround

I've used both changes in those PRs and they didn't solve my issue, at least the way I'm testing it.
We've been hitting this issue often on Azure CosmosDB with Cassandra interface, and it's hard to reproduce.
The way I'm testing is running a local cassandra container and removing it while my consumer is working (if I just stop it and start it there's no issue, it reconnects immediately). When recreating the container it won't reconnect unless I use the "band-aid" from @PierreF, that is working for me.

…ut restart (apache/cassandra-gocql-driver#915)

PierreF · 2023-05-02T13:54:20Z

If the test-case I've done in #915 (comment) (https://github.com/PierreF/gocql-cluster-issue) is the issue @zhixinwen reported, then #1682 fix this issue.

At least after an upgrade to gocql v1.4.0, I no longer have an issue.

josuebrunel · 2023-06-20T14:56:38Z

Still having the same issue with Azure Cosmos DB Cassandra Interface (private endpoint), even after upgrading to v1.5.2

amcrn · 2023-09-07T22:24:49Z

For those who are still experiencing this issue, can you confirm whether you're using DisableInitialHostLookup or not? See #1721

josuebrunel · 2023-09-07T22:46:24Z

I do use DisableInitialHostLookup=true

dustinblue · 2023-09-08T23:26:29Z

I'm using cluster.DisableInitialHostLookup = true also. Was getting the error on 1.5.x and 1.6.x using aws keyspaces. Went back to github.com/gocql/gocql v0.0.0-20200410100145-b454769479c6 haven't had any issues since.

martin-sucha · 2023-09-11T11:28:02Z

We are considering deprecation of DisableInitialHostLookup=true (see comments in #1721). Is using SingleHostReadyPolicy+DisableInitialHostLookup=false good enough for those who are using DisableInitialHostLookup now? It will achieve shorter time to startup the session too. Are there any other use cases besides the lower session startup time?

jack-at-circle · 2024-02-01T15:27:44Z

Just updated from v0.0.0-20220224095938-0eacd3183625 to v1.6.0 and we're hitting this issue every few days losing connection to AWS Keyspaces. Restarting the client re-establishes the connections but that's clearly not an ideal solution

josuebrunel · 2024-02-01T15:54:16Z

Hello @jack-at-circle
I had the same issue with Azure and the following cluster' config helped me:

cluster.DisableInitialHostLookup=false
cluster.PoolConfig = gocql.PoolConfig{
                HostSelectionPolicy: gocql.SingleHostReadyPolicy(gocql.RoundRobinHostPolicy()),
 }

Hope it helps

jack-at-circle · 2024-02-01T15:57:12Z

We're currently using that config @josuebrunel but are still losing connections after a day or two : (

josuebrunel · 2024-02-01T16:58:39Z

@jack-at-circle Oh I see. I'm using github.com/gocql/gocql v1.5.2 and I also have a retry policy (Can't really tell if it has an effect or not). Something like:

type ReconnectionPolicy struct {
        maxRetries int
        delay      time.Duration
}

func (rp ReconnectionPolicy) GetInterval(currentRetry int) time.Duration {
        slog.Info("Trying to reconnect to db instance", "currentRetry", currentRetry)
        return rp.delay
}

func (rp ReconnectionPolicy) GetMaxRetries() int {
        return rp.maxRetries
}


cluster.ReconnectionPolicy = ReconnectionPolicy{maxRetries: 4, delay: time.Duration(400) * time.Millisecond}

Rikkuru · 2024-05-23T13:28:14Z

I had this problem with v1.6.0 (actually replace github.com/gocql/gocql => github.com/scylladb/gocql v1.12.0)

It was a configuration isuue. (thankfully only in dev environment)

My nodes have 2 ip addrs 100.* (extip) and 10.*(ip). Config made use of extip addrs but system.local returns ip addrs. And firewall only allowed extip addrs
Service was able to start as initial addr was taken from config, but all the other nodes used ip provided by scylla and were down.
If this one node went down too we ended up without conns to scylla
There is a fallback to cfg addrs in control but when it eventually succeeded setupConn still used this conn to request system.local and added received ips to ring.

Log

2024/05/23 14:38:20.717959 Session.ring:[100.*:UP][10.*:DOWN]
// after 100.* went down 
2024/05/23 15:16:20.717901 Session.ring:[10.*:DOWN][10.*:DOWN] 
// 100.* returned there are attemts to fallback to config but ring is not changed (all errors on dial to 10.* addrs )
2024/05/23 15:22:16.845645 gocql: control falling back to initial contact points.

djschaap · 2024-05-23T13:45:12Z

This issue is seven years old as of yesterday 🎂 - quite a few changes have been made to the connection failure recovery logic since it was created.

Might I suggest closing this issue and documenting each remaining failure case in its own (new) issue?

jacobgreenleaf mentioned this issue Feb 8, 2018

Add ability to collect statistics #992

Closed

nyanshak mentioned this issue Jul 12, 2018

Jaeger fragile in the face of cassandra failures jaegertracing/jaeger#767

Closed

siva0chain added a commit to 0chain/0chain that referenced this issue Oct 5, 2018

Closing and recreating the session when gocql.ErrNoConnections error …

8a4f7de

…is encountered apache/cassandra-gocql-driver#915

fkaleo mentioned this issue Dec 9, 2019

Metrictank does not rediscover cassandra node IPs correctly grafana/metrictank#1566

Closed

cmsxbc mentioned this issue Mar 19, 2020

Auto reconnect or quickly exit when cassandra cluster down and then up. grafana/loki#1821

Closed

martin-sucha added the bug label Dec 5, 2020

alexshtin mentioned this issue Nov 10, 2022

Wrap gocql.Iter with custom wrapper temporalio/temporal#3577

Merged

mfamador added a commit to anovateam/connect that referenced this issue Feb 8, 2023

Mitigate "gocql: no hosts available in the pool" cannot recover witho…

bc7fea9

…ut restart (apache/cassandra-gocql-driver#915)

mfamador added a commit to anovateam/connect that referenced this issue Feb 8, 2023

Mitigate "gocql: no hosts available in the pool" cannot recover witho…

1b69a75

…ut restart (apache/cassandra-gocql-driver#915)

mfamador added a commit to anovateam/connect that referenced this issue Feb 9, 2023

Mitigate "gocql: no hosts available in the pool" cannot recover witho…

05b7e8e

…ut restart (apache/cassandra-gocql-driver#915)

mfamador mentioned this issue Feb 9, 2023

[cassandra output] Mitigate "gocql: no hosts available in the pool" cannot recover without restart redpanda-data/connect#1736

Closed

andihaskel mentioned this issue Feb 24, 2023

feat: restart connection changes Drafteame/cassandra-builder#11

Merged

5 tasks

lwimmer mentioned this issue Mar 20, 2023

refresh ring when node UP event received for missing host #1669

Merged

djschaap mentioned this issue Mar 20, 2023

Fix k8s reconnect #1682

Merged

mfamador added a commit to anovateam/connect that referenced this issue Mar 23, 2023

Mitigate "gocql: no hosts available in the pool" cannot recover witho…

0b98374

…ut restart (apache/cassandra-gocql-driver#915)

maverick-ai mentioned this issue Apr 5, 2024

gocql: no hosts available in the pool" cannot recover without restart scylladb/gocql#163

Closed

Zariel closed this as completed May 23, 2024

CallumNZ mentioned this issue May 24, 2024

fix: set DisableInitialHostLookup to false GeoNet/kit#170

Merged

3 tasks

"gocql: no hosts available in the pool" cannot recover without restart #915

"gocql: no hosts available in the pool" cannot recover without restart #915

Comments

zhixinwen commented May 23, 2017

zhixinwen commented May 23, 2017

zhixinwen commented Jun 1, 2017

Zariel commented Jun 27, 2017 • edited Loading

zhixinwen commented Aug 9, 2017 • edited Loading

balta2ar commented Aug 28, 2017

Zariel commented Aug 28, 2017

zhixinwen commented Feb 14, 2018

Zariel commented Feb 14, 2018

robdefeo commented Feb 24, 2018

engin commented Jun 11, 2018

shettyh commented Feb 21, 2019 • edited Loading

steebchen commented Jun 1, 2019

rlk833 commented Jun 5, 2019

clarakosi commented May 21, 2020

SubNader commented Oct 1, 2020

mattmassicotte commented Oct 7, 2020

Draal commented Apr 9, 2021 • edited Loading

martin-sucha commented Apr 14, 2021

mrwonko commented Apr 22, 2021

justinfx commented Nov 7, 2022

PierreF commented Feb 8, 2023 • edited Loading

PierreF commented Feb 8, 2023

justinfx commented Feb 8, 2023

sseidman commented Feb 9, 2023

mfamador commented Feb 11, 2023 • edited Loading

PierreF commented May 2, 2023

josuebrunel commented Jun 20, 2023 • edited Loading

amcrn commented Sep 7, 2023

josuebrunel commented Sep 7, 2023

dustinblue commented Sep 8, 2023

martin-sucha commented Sep 11, 2023

jack-at-circle commented Feb 1, 2024

josuebrunel commented Feb 1, 2024

jack-at-circle commented Feb 1, 2024

josuebrunel commented Feb 1, 2024

Rikkuru commented May 23, 2024

djschaap commented May 23, 2024

Zariel commented Jun 27, 2017 •

edited

Loading

zhixinwen commented Aug 9, 2017 •

edited

Loading

shettyh commented Feb 21, 2019 •

edited

Loading

Draal commented Apr 9, 2021 •

edited

Loading

PierreF commented Feb 8, 2023 •

edited

Loading

mfamador commented Feb 11, 2023 •

edited

Loading

josuebrunel commented Jun 20, 2023 •

edited

Loading