Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"gocql: no hosts available in the pool" cannot recover without restart #915

Closed
zhixinwen opened this issue May 23, 2017 · 49 comments
Closed
Labels

Comments

@zhixinwen
Copy link
Contributor

We had a cluster degrades due to increased traffic load, and on the client side we see "gocql: no hosts available in the pool" error for ~80% of the requests.

The error rate is very consistent, and the C* cluster is healthy except the load. We try to reduce the load on the cluster, but the error is still consistent. It lasts for two hours.

We restart the client, and the errors go away immediately.

Also one thing interesting is that after we restart the gocql client, we also see an immediate drop on coordinator latency and traffic on the C* cluster (Yes, client restarts first, then the latency and traffic on C* drops, not the other way around). It may be due to upstream users stop retry after we restart the server and no longer see "gocql: no hosts available in the pool", but we are not sure about the root cause yet.

@zhixinwen
Copy link
Contributor Author

After we close a connection due to too many timeouts, when would the connection be made again?

I do see gocql wait for event from control connection to reconnect. But if no node is perceived as down by the control node, and it never emits a 'node up' event, would gocql try to recreate the connections?

@zhixinwen
Copy link
Contributor Author

@Zariel gocql closes a connection when it takes too long to respond, when would the connection be reconnected?

I may miss some of the code, looks like it will only reconnect when it receives "Node Up" event?

@Zariel
Copy link
Contributor

Zariel commented Jun 27, 2017

It should be periodically try to dial nodes that are down, if you can reproduce and build with gocql_debug you should see more info about the pool state.

Also if all the hosts are down, then the host list should be reset to the initial hosts and they will be attempted.

@zhixinwen
Copy link
Contributor Author

zhixinwen commented Aug 9, 2017

I think the root cause of the problem here is:

  1. When all connections to a hosts are closed, the host is not marked as down, so ReconnectInterval would not help
  2. HandleError in hostConnPool would try to refill the connection pool when a connection is closed, but in an outage it is very likely to fail. Moreover it would not retry afterward. As a result, when the cluster recovers after all the connections are closed, gocql would not attempt to rebuild the connection pool

To fix this, we need to mark the host as DOWN, when we close all the connections on that host.

@balta2ar
Copy link

I simulated network connectivity issues using iptables:

sudo iptables -I OUTPUT -p tcp -d 127.0.0.1 --dport 9042 -j DROP

After that upon queries I start to receive this error message: gocql: no response received from cassandra within timeout period
And soon after it this message: gocql: no hosts available in the pool

After deleting the rule (sudo iptables -D OUTPUT 1) gocql indeed does not recover automatically.

@Zariel
Copy link
Contributor

Zariel commented Aug 28, 2017

Running a single node of Cassandra (on mac so excuse lack of iptables), able to query host as expected when its running. Kill -9 the process, gocql receives no host down notification and discovers the control connection is dead and marks the host down. Restarting Cassandra the driver reconnects and queries can continue. Repeated with STOP/CONT and had the same results.

Can you rerun your test with gocql_debug build tag

@zhixinwen
Copy link
Contributor Author

I think kill -9 is not a accurate simulation, simulation of @balta2ar seems to be more similar to the issue I see.

@Zariel
Copy link
Contributor

Zariel commented Feb 14, 2018

Do you know of a way to replicate on OSX or reliably for regression tests?

@robdefeo
Copy link

When this error occurs I have had to cycle the container as there is no good way to reconnect after this event. There must be a better way but this error has caused so many problems

@engin
Copy link

engin commented Jun 11, 2018

seems that setting ClusterConfig.ReconnectInterval > 0 has no affect on this.

@shettyh
Copy link

shettyh commented Feb 21, 2019

+1

Any fixes for this issue ?. In single node cassandra this issue is consistently reproducible

@Zariel

@steebchen
Copy link

Constantly seeing this issue on single-node Cassandra cluster. cqlsh works fine 100% of the time while gocql seems to take a lot of time connecting to the cluster, often resulting in "gocql: no hosts available in the pool" error.

@rlk833
Copy link
Contributor

rlk833 commented Jun 5, 2019

We have this problem too. Have had it since at least Oct of 2018. I had to put in special code to detect this and cause the application to be recycled. Our apps call gocql under very heavy load (millions an hour requests). For a long time the recycle occurred once a month or so. But the loads have increased recently and it is now up to every days or even shorter. We have 4 instances running. It is not always all 4 instances at the same time. Out of nowhere with no proceeding errors logged we all of a sudden get this in a running instance. The only fix is to restart the instance. Occasionally other instances will shortly run into the problem to and need restarting. If we restart there is no problem at all connecting to the nodes.

The main problem we have is during the detection and restarting period there is a significant slow down in processing as the instance is not available.

erebe pushed a commit to criteo-forks/rend that referenced this issue Apr 27, 2020
  + Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked.

  + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch

  + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above.

  + Made shutdown of Memendra thread safe

  + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915

  + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format()

  +
erebe pushed a commit to criteo-forks/rend that referenced this issue Apr 27, 2020
  + Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked.

  + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch

  + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above.

  + Made shutdown of Memendra thread safe

  + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915

  + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format()

  +
erebe pushed a commit to criteo-forks/rend that referenced this issue Apr 28, 2020
  + Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked.

  + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch

  + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above.

  + Made shutdown of Memendra thread safe

  + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915

  + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format()

  + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
erebe pushed a commit to criteo-forks/rend that referenced this issue Apr 29, 2020
  + Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked.

  + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch

  + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above.

  + Made shutdown of Memendra thread safe

  + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915

  + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format()

  + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
erebe pushed a commit to criteo-forks/rend that referenced this issue Apr 30, 2020
  + Remove global singleton, by passing around context when possible and by using a clojure for Rend New Cassandra handler creation. Purpose it not recreate a new cnx toward cassandra each time this handler is invoked.

  + Move out from the flaky unlogged batch for SETs. Problem is that managing size of the batch is tricky as it depends not on the number of elements but of the bytes size of the batch

  + Instead of relying a single buffer for the batched SETs. Use a fan out approach where goroutines are responsible for sending single SET command toward cassandra. If perf decrease we can re-use the unlogged batch but we a way lot smaller SET buffer, thus avoiding problem from above.

  + Made shutdown of Memendra thread safe

  + Add a custom ConvictionPolicy to avoid apache/cassandra-gocql-driver#915

  + Prepare statement in the cassandra context to avoid allocating a new string every request with fmt.format()

  + Replace Bind(...) calls by Query in new goCQL version (simplify the code)
@clarakosi
Copy link

I ran into a similar issue a couple of days ago and followed @balta2ar instructions to try and recreate the situation.

And surprisingly gocql reconnected after removing the rule (sudo iptables -D OUTPUT 1).

Here are the logs:

{"msg":"Error reading from storage (gocql: no response received from cassandra within timeout period)","appname":"kask","time":"2020-05-21T18:05:40-04:00","level"
:"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"WARNING"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:50-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:51-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:58-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:09-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:20-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeUp: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:23-04:00","level":"WARNING"}

Any other ideas on how to replicate this issue?

@SubNader
Copy link

SubNader commented Oct 1, 2020

Running into this while using gocql to write to AWS Keyspaces.
Any updates on this?

@mattmassicotte
Copy link

I'm also experiencing this issue extremely frequently with AWS Keyspaces. It appears to be entirely due to state of the process. Other processes can access Keyspaces without issue, and restarting an affected process immediately resolves the issue. I'm up for helping to debug however I can.

@Draal
Copy link

Draal commented Apr 9, 2021

it seems like it happens because of
https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/connectionpool.go#L248
and then we add host back only at https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/events.go#L187-L187
which if I understand correctly happens only when a new node is starting, but not when a node temporarily lost network connection or because of overload rejected request.

@martin-sucha
Copy link
Contributor

There is pool.addHost in reconnectDownedHosts. reconnectDownedHosts periodically adds nodes that are marked as down. However, we set the host state as down only when we get such event from the database, which does not occur if just gocql lost connections (or closed them due to errors).

@mrwonko
Copy link

mrwonko commented Apr 22, 2021

I'm trying to reproduce this using docker-compose on Mac. I'm using the alpine image for my client, installed iptables via apk --update add iptables, then added

    cap_add:
      - NET_ADMIN

to my docker-compose.yaml so I can use iptables inside the container. I have a small Go program that does SELECT requests in a loop.

When I execute iptables -I OUTPUT -p tcp --dport 9042 -j DROP, this happens:

client_1     | 2021/04/22 15:08:22 SELECTed: "data"
client_1     | 2021/04/22 15:08:23 SELECTed: "data"
client_1     | 2021/04/22 15:08:25 SELECT failed: gocql: no response received from cassandra within timeout period
client_1     | 2021/04/22 15:08:26 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: connection closed waiting for response
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:27 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:28 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:29 SELECT failed: gocql: no hosts available in the pool

So far, so good. Occasionally, an automatic reconnect is apparently attempted:

client_1     | 2021/04/22 15:09:20 connection failed "172.22.0.2": dial tcp 172.22.0.2:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy

But when I then remove the rule via iptables -D OUTPUT 1, gocql recovers immediately:

client_1     | 2021/04/22 15:10:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeUp: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 SELECTed: "data"
client_1     | 2021/04/22 15:10:30 SELECTed: "data"

However I know that the issue is not fixed entirely, because we had an outage due to it recently. Now I don't know how to reproduce it.

@justinfx
Copy link

justinfx commented Nov 7, 2022

@player1-github my last reply was 10 months ago, after the proposed fix was merged. I had tested my reported issue using the latest commits. It would be great if someone could confirm my report on the Yugabyte issue.

@PierreF
Copy link

PierreF commented Feb 8, 2023

This issue looks like the same I had: it happen if the initial set of Cassandra IP addresses change.

This happen easily on Kubernetes. With a rolling upgrade of Cassandra, each Cassandra node will be restarted with a new IP address. When the last node is restarted all initial IPs have changed and gocql fail with "no hosts available in the pool"

I've create a minimal case to reproduce this issue in the following repository using Docker compose: https://github.com/PierreF/gocql-cluster-issue

This repository has:

  • A Cassandra cluster of 3 nodes.
  • It restart one Cassandra after the other. So at least two Cassandra are always up.
  • Python client that have no issue. So this rolling restart should be valid.

I've found some other issue that looks duplicate of this issue:

@PierreF
Copy link

PierreF commented Feb 8, 2023

Inspired by the following comment, I've tested a workaround which fix my issue: PierreF/gocql-cluster-issue@a7a6db2

@justinfx
Copy link

justinfx commented Feb 8, 2023

@PierreF that workaround looks like a simplified version of what I commented with a year ago, earlier in this thread: yugabyte/yugabyte-db#10182 (comment)

It is pretty ugly and has to be a bit more careful in guarding concurrent access to the session in queries that are inflight. I wouldn't really consider it a great workaround. It's more like a band-aid

@sseidman
Copy link
Contributor

sseidman commented Feb 9, 2023

@PierreF I'd be curious if using both #1670 and #1669 would solve your issue. I've had positive results using both changes as a workaround

@mfamador
Copy link

mfamador commented Feb 11, 2023

@PierreF I'd be curious if using both #1670 and #1669 would solve your issue. I've had positive results using both changes as a workaround

I've used both changes in those PRs and they didn't solve my issue, at least the way I'm testing it.
We've been hitting this issue often on Azure CosmosDB with Cassandra interface, and it's hard to reproduce.
The way I'm testing is running a local cassandra container and removing it while my consumer is working (if I just stop it and start it there's no issue, it reconnects immediately). When recreating the container it won't reconnect unless I use the "band-aid" from @PierreF, that is working for me.

@PierreF
Copy link

PierreF commented May 2, 2023

If the test-case I've done in #915 (comment) (https://github.com/PierreF/gocql-cluster-issue) is the issue @zhixinwen reported, then #1682 fix this issue.

At least after an upgrade to gocql v1.4.0, I no longer have an issue.

@josuebrunel
Copy link

josuebrunel commented Jun 20, 2023

Still having the same issue with Azure Cosmos DB Cassandra Interface (private endpoint), even after upgrading to v1.5.2

@amcrn
Copy link

amcrn commented Sep 7, 2023

For those who are still experiencing this issue, can you confirm whether you're using DisableInitialHostLookup or not? See #1721

@josuebrunel
Copy link

I do use DisableInitialHostLookup=true

@dustinblue
Copy link

I'm using cluster.DisableInitialHostLookup = true also. Was getting the error on 1.5.x and 1.6.x using aws keyspaces. Went back to github.com/gocql/gocql v0.0.0-20200410100145-b454769479c6 haven't had any issues since.

@martin-sucha
Copy link
Contributor

We are considering deprecation of DisableInitialHostLookup=true (see comments in #1721). Is using SingleHostReadyPolicy+DisableInitialHostLookup=false good enough for those who are using DisableInitialHostLookup now? It will achieve shorter time to startup the session too. Are there any other use cases besides the lower session startup time?

@jack-at-circle
Copy link

Just updated from v0.0.0-20220224095938-0eacd3183625 to v1.6.0 and we're hitting this issue every few days losing connection to AWS Keyspaces. Restarting the client re-establishes the connections but that's clearly not an ideal solution

@josuebrunel
Copy link

Hello @jack-at-circle
I had the same issue with Azure and the following cluster' config helped me:

cluster.DisableInitialHostLookup=false
cluster.PoolConfig = gocql.PoolConfig{
                HostSelectionPolicy: gocql.SingleHostReadyPolicy(gocql.RoundRobinHostPolicy()),
 }

Hope it helps

@jack-at-circle
Copy link

We're currently using that config @josuebrunel but are still losing connections after a day or two : (

@josuebrunel
Copy link

@jack-at-circle Oh I see. I'm using github.com/gocql/gocql v1.5.2 and I also have a retry policy (Can't really tell if it has an effect or not). Something like:

type ReconnectionPolicy struct {
        maxRetries int
        delay      time.Duration
}

func (rp ReconnectionPolicy) GetInterval(currentRetry int) time.Duration {
        slog.Info("Trying to reconnect to db instance", "currentRetry", currentRetry)
        return rp.delay
}

func (rp ReconnectionPolicy) GetMaxRetries() int {
        return rp.maxRetries
}


cluster.ReconnectionPolicy = ReconnectionPolicy{maxRetries: 4, delay: time.Duration(400) * time.Millisecond}

@Rikkuru
Copy link

Rikkuru commented May 23, 2024

I had this problem with v1.6.0 (actually replace github.com/gocql/gocql => github.com/scylladb/gocql v1.12.0)

It was a configuration isuue. (thankfully only in dev environment)

My nodes have 2 ip addrs 100.* (extip) and 10.*(ip). Config made use of extip addrs but system.local returns ip addrs. And firewall only allowed extip addrs
Service was able to start as initial addr was taken from config, but all the other nodes used ip provided by scylla and were down.
If this one node went down too we ended up without conns to scylla
There is a fallback to cfg addrs in control but when it eventually succeeded setupConn still used this conn to request system.local and added received ips to ring.

Log

2024/05/23 14:38:20.717959 Session.ring:[100.*:UP][10.*:DOWN]
// after 100.* went down 
2024/05/23 15:16:20.717901 Session.ring:[10.*:DOWN][10.*:DOWN] 
// 100.* returned there are attemts to fallback to config but ring is not changed (all errors on dial to 10.* addrs )
2024/05/23 15:22:16.845645 gocql: control falling back to initial contact points.

@djschaap
Copy link
Contributor

This issue is seven years old as of yesterday 🎂 - quite a few changes have been made to the connection failure recovery logic since it was created.

Might I suggest closing this issue and documenting each remaining failure case in its own (new) issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests