Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gocql does not re-resolve DNS names #831

Closed
chummydog opened this issue Nov 15, 2016 · 31 comments
Closed

gocql does not re-resolve DNS names #831

chummydog opened this issue Nov 15, 2016 · 31 comments

Comments

@chummydog
Copy link

In using the gocql library from about two weeks ago I noticed the following issue (we had been using a gocql version from last april and see the same issue, even though your code has changed in this area - you seemed to fix a big in ring.go). Our application runs in a cloud environment where cassandra instances can move around from node to node (IP addresses will be different). Therefore, we use dns to manage this. In this use case, we have a single node Cassandra cluster, and at application startup, pass in this name to our application (which is passed to the gocql Session abstraction). All works well until the cassandra node is restarted, which means it has started bound to a new IP. The control connection in gocql fails as it notices the connection to the old IP has been closed. As this point in time, the only answer is to restart out application because gocql has no way to know the new IP of the cassandra node. The issue seems to be that gocql loses information regarding the dns name we passed in. I'm new to gocql, but can't find a way (via some config setting) to address this issue in our application. Any help would be appreciated.

@kenng
Copy link

kenng commented Jan 12, 2017

Any follow up on this issue? We having similar issue with this as well when running in the kubernetes services. The IP of the cassandra change slightly (from 10.48.0.54 to 10.48.0.56) after the pod image has been updated to new version. When this happened, the error will be thrown out.

@laz2
Copy link

laz2 commented Jan 30, 2017

Same problem with deploy in kubernetes.

@jdness
Copy link

jdness commented Jan 30, 2017

We were frequently getting this error but for us it ended up being Cassandra-related, not gocql-related. Our single instance Cassandra cluster was frequently hitting stop the world garbage collection for hundreds of milliseconds at a time. We optimized our application database operations, tuned the Cassandra JVM settings, changed the GC mechanism (CMS -> G1GC), lengthened the gocql timeouts (600ms -> 5000ms) and gave Cassandra more RAM. The GC stoppage after these changes is much less frequent and much shorter now and we no longer see this gocql error. Not sure if that helps any of you hitting this same error but might want to check your Cassandra system.log and/or gc.log to see if no hosts are available because Cassandra is not responding to events.

@robusto
Copy link
Contributor

robusto commented Feb 4, 2017

TL;DR: In K8s (or equivalent), you should be using Pet Sets (or equivalent) with a stable hostname and linked volumes for your stateful services, like Cassandra. Basically this: http://blog.kubernetes.io/2016/07/thousand-instances-of-cassandra-using-kubernetes-pet-set.html


Shifting hosts under DNS is sort of an anti-pattern in C* because it relies on concrete, addressable targets in order to gossip about and maintain cluster state. Cassandra is a stateful service and is in constant communication with its peers about their (and their neighbors) individual states. It's unreasonable to expect the state of your local DNS and TTLs would propagate in perfect sync with the state of an arbitrary number of nodes in a distributed system.

Additionally, C* clients attempt to establish connections with all (or many) nodes in the cluster, not just the one(s) you provide in ClusterConfig. This allows clients to intelligently make decisions about query load balancing and cluster availability. Recall that Cassandra has no master nodes so all nodes are equally available to serve queries. The mantra is "no single point of failure" and, as you've discovered, DNS can be a single point of failure.

This problem is not really related to gocql in particular. I believe you'd find the same problem in any of the other stable Cassandra drivers because of how Cassandra and any client driver is (and must be) designed.


Regarding GC and high load: Running a single-node of Cassandra doesn't really make sense, but I understand if you're evaluating it from a development standpoint. (Even so, a small 3 node cluster will help familiarize yourself with consistency levels and replication)

Long GC STW pauses is a strong sign that your "cluster" is overloaded. Tuning Cassandra and the JVM (especially the heap) proportional to your container's allocated CPU and memory for your deployment is usually necessary, in any case.

Some reading on GC and C* tuning:

@idealhack
Copy link

@robusto Hi, thanks for your explain.

We use Cassandra in K8s Stateful Sets (follow the official example), then the client can use a pod DNS name to connect to servers. But When some pods been deleted (e.g. because of node migration) while the data volumes still the same, the client still receive errors about the old IPs.

Is there anyway to solve this problem?

@Zariel
Copy link
Contributor

Zariel commented Aug 3, 2017

Can you build with the gocql_debug tag and provide the logs of the hosts being discovered? Cassandra should notify the driver the node went down, then another one came up.

@idealhack
Copy link

@Zariel Thank you.

Output from nodetool status, which is latest and correct:

Datacenter: dev
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
DN  10.244.3.66  273.71 KiB  32           100.0%            57e4117c-db4a-4eeb-b51a-f24edf8da8a4  test
UN  10.244.2.79  2.29 GiB   32           100.0%            3f70d9f9-4803-46a7-b8af-41dff9b5a527  test
UN  10.244.4.47  174.86 KiB  32           100.0%            1768dbdd-addd-442d-ab8c-fb52b126307d  test

Output from gocql:

2017/08/03 21:31:48 gocql: Session.handleNodeUp: 10.244.2.79:9042
2017/08/03 21:31:50 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:31:50 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/03 21:31:51 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/03 21:31:52 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/03 21:31:52 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: getsockopt: no route to host
2017/08/03 21:31:52 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/03 21:31:54 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/03 21:31:54 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/03 21:31:56 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/03 21:31:56 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/03 21:31:58 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/03 21:31:58 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/03 21:31:58 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/03 21:31:59 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:32:00 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/03 21:32:00 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/03 21:32:01 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/03 21:32:01 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/03 21:32:01 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/03 21:32:03 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/03 21:32:03 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/03 21:32:03 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/03 21:32:05 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/03 21:32:05 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/03 21:32:05 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/03 21:32:05 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/03 21:32:07 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/03 21:32:07 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/03 21:32:07 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/03 21:32:07 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: getsockopt: no route to host
2017/08/03 21:32:07 gocql: Session.handleNodeUp: 10.244.2.79:9042

And output from C++ client in the same cluster (if it helps):

1501767219.595 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767219.597 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767219.600 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.6.10 because of the following error: Connection timeout
1501767219.602 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.6.10 because of the following error: Connection timeout
1501767219.967 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.5.5 because of the following error: Connection timeout
1501767219.970 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.5.5 because of the following error: Connection timeout
1501767220.105 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767220.105 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767220.120 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.36 because of the following error: Connect error 'host is unreachable'
1501767220.120 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.36 because of the following error: Connect error 'host is unreachable'
1501767220.567 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.586 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.588 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.653 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767249.654 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767250.165 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767250.165 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'

I reproduced this by following the steps I said in last comment.

@thrawn01
Copy link
Contributor

thrawn01 commented Aug 3, 2017

I'm assuming the node 10.244.5.5 goes down for node migration here

2017/08/03 21:31:50 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:31:50 gocql: Session.handleNodeDown: 10.244.5.5:9042

Then returns with a new kubernetes assigned IP address here

2017/08/03 21:31:58 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/03 21:31:59 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout

Question is, does C* send a UP node event with the original ip address or the new ip address.

This log suggests C* sends the original ip address in the UP node event. (which kinda makes sense) If it's the original ip address gocql uses the HostInfo from the existing ring entry when attempting to dial the node; hence the dial errors.

Perhaps gocql receives a MOVED_NODE event during this time but gocql doesn't handle it. If we did, we could refresh the ring and connect to the updated ip address?

@idealhack are you able to reproduce with the gocql_debug compile flag enabled? Enabling that flag should log all the received events during the kubernetes node migration; that will give us more information.

if you are using K8 StatefulSets then when the pod returns it should have the same IP address as before, and gocql should have no issue reconnecting to the node. Can you confirm the C* pod returns with the same ip address?

@idealhack
Copy link

idealhack commented Aug 4, 2017

@thrawn01 Thank you.

I'm convinced that a pod's IP should not change when it crashes and restarts, but it will change when it has been deleted and another pod comes up, which is the situation I reproduced, with gocql_debug flag enabled.

The logs above were wrote after the deletion and recreation, even more, the clients were also restarted, so I thought the old IPs should never appears in client logs.

As I said, this leads me to one reason: the old IPs were stored and have been wrote to disk (and did not been removed after the pod been deleted). If so, I wondered if there is a way to avoid this?

I will try to get more logs to cover when the pods been deleted.

@idealhack
Copy link

I deleted all pods (without deleting data on most node) again, and found out the first pod has those old IPs (I guess it reads from previous disk data). And I continued adding new pod.

At last the nodetool status on the first pod reports:

Datacenter: dev
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
DN  10.244.4.36  ?          32           35.1%             c1f27d75-489a-4755-93d4-a50398a76233  test
UN  10.244.9.5   72.74 KiB  32           33.2%             fd0c49a4-29d1-4d8b-8641-580bc7673ce5  test
DN  10.244.5.5   ?          32           29.1%             742cf6aa-c046-416e-8bcc-2bc5c13e423a  test
UN  10.244.3.71  180.74 KiB  32           32.7%             a45e7118-ebbe-4c98-90d0-55aeca302baa  test
UN  10.244.2.81  2.35 GiB   32           26.7%             3f70d9f9-4803-46a7-b8af-41dff9b5a527  test
DN  10.244.3.66  ?          32           30.1%             57e4117c-db4a-4eeb-b51a-f24edf8da8a4  test
DN  10.244.4.34  ?          32           34.6%             d0abe073-38b5-46ca-9065-5f82ed0e5372  test
UN  10.244.4.51  70.28 MiB  32           24.5%             1768dbdd-addd-442d-ab8c-fb52b126307d  test
DN  10.244.3.57  ?          32           24.2%             04f42c55-a415-47bf-b99a-546fa9749497  test
DN  10.244.6.10  ?          32           29.7%             425f46be-0a65-4d4b-870c-1a1417f11cce  test

In the mean time, gocql reports (omits some redundant part):

2017/08/04 14:39:23 Session.ring:[10.244.3.66:DOWN][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.34:DOWN][10.244.4.47:DOWN][10.244.4.36:DOWN]
2017/08/04 14:39:23 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/04 14:39:25 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 14:39:25 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 14:39:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:39:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:39:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:39:27 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:39:28 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:39:28 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:39:28 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:39:30 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:39:30 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 14:39:30 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:39:32 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 14:39:32 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:39:32 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 14:39:34 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:39:34 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:39:34 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:39:36 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:39:36 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:40:23 Session.ring:[10.244.4.34:DOWN][10.244.4.47:DOWN][10.244.4.36:DOWN][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN]

...

2017/08/04 14:44:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:44:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:44:25 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:44:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:44:27 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:44:27 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:44:27 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:44:28 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:44:29 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:44:29 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:44:30 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:44:30 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:44:30 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:44:32 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:44:32 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 14:44:32 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:44:34 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 14:44:34 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 14:45:23 Session.ring:[10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:DOWN]

...

2017/08/04 14:52:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:52:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:52:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:52:25 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:52:26 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:52:27 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:52:27 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:52:28 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:52:28 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:52:28 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:52:30 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:52:30 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:52:30 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:52:32 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:52:32 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:53:23 Session.ring:[10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:UP]

...

2017/08/04 15:00:23 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:00:25 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:00:25 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:00:25 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:00:27 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 15:00:27 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 15:00:27 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:00:28 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:00:28 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:01:23 Session.ring:[10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:UP][10.244.3.57:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.3.66:UP]

...

2017/08/04 15:03:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:03:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:03:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:03:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:03:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:03:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:03:48 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.9.5 port=9042]
2017/08/04 15:03:48 gocql: handling frame: [status_change change=UP host=10.244.9.5 port=9042]
2017/08/04 15:03:49 gocql: dispatching event: &{change:UP host:[10 244 9 5] port:9042}
2017/08/04 15:03:49 gocql: Session.handleNodeUp: 10.244.9.5:9042
2017/08/04 15:04:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP]

...

2017/08/04 15:15:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:15:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:15:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:15:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:15:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:15:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:15:51 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.3.71 port=9042]
2017/08/04 15:15:51 gocql: handling frame: [status_change change=UP host=10.244.3.71 port=9042]
2017/08/04 15:15:52 gocql: dispatching event: &{change:UP host:[10 244 3 71] port:9042}
2017/08/04 15:15:52 gocql: Session.handleNodeUp: 10.244.3.71:9042
2017/08/04 15:16:23 Session.ring:[10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:16:23 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:16:25 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:16:25 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:16:25 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:16:27 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:16:27 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:17:23 Session.ring:[10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:17:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:17:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:17:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:17:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:17:26 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:17:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:18:23 Session.ring:[10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN]
2017/08/04 15:18:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:18:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:18:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:19:23 Session.ring:[10.244.3.57:UP][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP]
2017/08/04 15:19:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:19:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:19:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:20:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:20:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:20:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:20:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:21:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:21:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:21:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:21:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:22:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:22:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:22:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:22:25 gocql: Session.handleNodeDown: 10.244.4.47:9042


2017/08/04 15:23:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:23:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:23:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:23:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:24:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:24:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:24:24 gocql: handling frame: [status_change change=UP host=10.244.4.51 port=9042]
2017/08/04 15:24:25 gocql: dispatching event: &{change:UP host:[10 244 4 51] port:9042}
2017/08/04 15:24:25 gocql: Session.handleNodeUp: 10.244.4.51:9042
2017/08/04 15:24:25 Found invalid peer '[HostInfo connectAddress="<nil>" peer="10.244.4.51" rpc_address="10.244.4.51" broadcast_address="<nil>" port=9042 data_centre="dev" rack="test" host_id="1768dbdd-addd-442d-ab8c-fb52b126307d" version="v3.9.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored
2017/08/04 15:24:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:24:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:25:23 Session.ring:[10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:25:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:25:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:25:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:26:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.4.51:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP]
2017/08/04 15:26:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:26:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:26:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:27:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.4.51:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP]
2017/08/04 15:27:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:27:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:27:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:28:23 Session.ring:[10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN]
2017/08/04 15:28:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:28:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:28:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:29:23 Session.ring:[10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP]
2017/08/04 15:29:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:29:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:29:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
^C

Hmm, seems something wrong about those UP nodes, right?

I stop gocql and run it again, it reports:

2017/08/04 15:38:12 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 15:38:12 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 15:38:12 Found invalid peer '[HostInfo connectAddress="<nil>" peer="10.244.4.51" rpc_address="10.244.4.51" broadcast_address="<nil>" port=9042 data_centre="dev" rack="test" host_id="1768dbdd-addd-442d-ab8c-fb52b126307d" version="v3.9.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored
2017/08/04 15:38:12 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:38:14 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:38:14 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:38:14 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:38:16 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 15:38:16 gocql: Session.handleNodeUp: 10.244.3.71:9042
2017/08/04 15:38:16 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 15:38:16 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 15:38:18 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 15:38:18 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 15:38:18 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 15:38:19 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 15:38:19 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 15:38:19 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 15:38:21 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 15:38:21 gocql: Session.handleNodeUp: 10.244.9.5:9042
2017/08/04 15:38:21 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 15:38:21 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/04 15:38:23 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 15:38:23 gocql: Session.handleNodeUp: 10.244.2.81:9042
to create tables
2017/08/04 15:38:23 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 15:39:23 Session.ring:[10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.36:DOWN][10.244.3.66:DOWN][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.71:UP][10.244.4.34:DOWN][10.244.9.5:UP]
2017/08/04 15:39:23 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:39:25 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout

This is more like what nodetool status tells. So Does this suggests something wrong about Session.handleNodeUp?

Also I found kubernetes/kubernetes#49618 is a better way when stopping pod, I will change the stateful sets, clean all data, and try adding and deleting pods.

@Zariel
Copy link
Contributor

Zariel commented Aug 4, 2017

I think whats going on here is your cassandra cluster has stale nodes in gossip. Gocql will get an node up event, then refresh the ring which returns the down nodes (the system tables do not include gossip state). The question here is how long should gocql keep trying to connect to downed nodes before they are removed from the local ring? If you do nodetool removenode <node> does gocql drop the node from the cache? Gocqls local ring cache should be the same as the output of nodetool status. Something to do would be to add a source to the events so if the driver triggers a node up it is apparent and has a reason.

One issue I can see is that the ring describer wont remove nodes which is what leads to the first logs you posted https://github.com/gocql/gocql/blob/b96c067a43582b10f95d9e9dabb926483909908a/host_source.go#L663

What issue do you see when the driver is in this state?

@idealhack
Copy link

Sorry I'm not that familiar with cassandra nor gocql.

I think the main issue is gocql somehow reports a ring contains some UP nodes which are DOWN actually. As time goes, the number of this kind of nodes keeps adding up, Eventually:

2017/08/04 15:29:23 Session.ring:[10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP]

But according to nodetool status, these nodes were DOWN all along. @Zariel Do you suggests this is because of gocql were not removing these nodes?

Also I have not tried nodetool removenode.

When I was posting the first comment I thought these errors may lead some consistence problems, but it seems consistence is only affect by the level.

@robdefeo
Copy link

Since gocql does not try to reconnect what is the best way to handle "gocql: no hosts available in the pool"?

@thrawn01
Copy link
Contributor

@robdefeo We use gocql for our analytics engine at Mailgun. We currently restart the service once every 2 days to ensure the connection pool is full and monitor the size of the pool by emitting metrics on the size of the connection pool. (We modify gocql to achieve this) This is a temporary solution until we have sufficient time to formulate a full patch to gocql.

If I don’t find time for working on a patch this quarter I’ll be very unhappy. This has been a major pain point for us.

@guanw
Copy link

guanw commented Apr 17, 2019

Hi, Folks, any updates on this story? We recently had a big outage that seems to be partially related to this error. I'm testing it locally and can see this error message showing. Basically what I tried is pause the cassandra docker image and restart it(To mimic whole cassandra cluster down). gocql complains no hosts available in the pool because it doesn't recreate session in this case. Is there any suggestions on this scenario? Do we need to manually recreate session in this case? I'm kind of hesitated on this because i suspect this error could happen in some other scenarios so recreating session probably won't fit in those cases. Could you please confirm that? Thanks in advance.

@steebchen
Copy link

I am using gocql in my kubernetes cluster with a 3-node cassandra setup. It works fine. However, if I want to test locally on my machine, I usually use kubectl port-forward xxx to be able to connect to the cassandra cluster:

kubectl port-forward --namespace cassandra service/cassandra 9042:9042

gocql seems to have a problem with that, as it discovers the cluster but apparently wants to connect to the nodes directly:

2019/07/06 13:18:36 gocql: Session.handleNodeUp: 10.42.96.11:9042
2019/07/06 13:18:36 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:38 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:40 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:41 unable to dial "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout
2019/07/06 13:18:41 gocql: Session.handleNodeDown: 10.42.96.11:9042
2019/07/06 13:18:41 Server is running on:
http://localhost:4000
2019/07/06 13:18:41 Playground is available at:
http://localhost:4000/api/playground

10.42.96.11 is the local IP in the node cluster, but obviously this is not available locally on my machine, only localhost:9042

The weird thing is that after ~5 seconds of trying, my application starts, and I can query my Cassandra cluster. I tried setting:

cluster.DisableInitialHostLookup = true
cluster.IgnorePeerAddr = true

That didn't help, though.

Also, after another 20-30 seconds, the node seems to be flapping up and down again:

2019/07/06 13:18:41 Server is running on:
http://localhost:4000
2019/07/06 13:18:41 Playground is available at:
http://localhost:4000/api/playground
2019/07/06 13:19:41 Session.ring:[10.42.96.11:DOWN][127.0.0.1:UP]
2019/07/06 13:19:41 gocql: Session.handleNodeUp: 10.42.96.11:9042
2019/07/06 13:19:42 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:43 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:45 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:46 unable to dial "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout
[10x the same message]

Is there anything I can do to prevent this? Locally, it's totally fine if gocql only connects with a single node, it's just for development purposes, and as I said, gocql works perfectly fine when deployed in the production kubernetes cluster.

@thrawn01
Copy link
Contributor

thrawn01 commented Jul 21, 2019

@steebchen I've not testing your setup. I only use a single node locally to develop. However, disabling the initial host lookup will only keep gocql from asking the control node (the first connection) about other nodes in the cluster. It will not keep the control node from telling the client (gocql) about changes to the status of other nodes in the cluster. (STATUS_CHANGE events, host UP, DOWN events etc...) if gocql receives an event about another node from the control node, it will attempt to connect to the other node given the address address provided in the event. This Might be why after a few seconds gocql attempts to connect to another node and then warns it's down when it can't connect. It's receiving cluster information about another node and attempts to connect to it. It's annoying, but it shouldn't effect anything. You should be able to send queries through the control node just fine.

@sanjimoh
Copy link

We also ran into similar issue recently with Cassandra deployed as statefulset in a Kubernetes cluster.

A little details about our setup -
Our Kubernetes cluster is consisting of 5 worker nodes hosting a 3 node Cassandra cluster. We've anti-affinity rules defined for Cassandra which means all 3 nodes of Cassandra are running on different Kubernetes worker nodes for high availability.

On the Kubernetes cluster, Cassandra is exposed as Kubernetes service. Go clients then connect to Cassandra cluster through this Kubernetes service name which is essentially the DNS for to the ip address of the running container pod.

Now about the issue -
The issue is visible whenever a worker node hosting a Cassandra pod goes down. As expected, Kubernetes is able to successfully reschedule that Cassandra pod instance to a new available worker node & that Cassandra pod is able to successfully rejoin the Cassandra cluster.

In the above scenario, Cassandra pod instance will come up with a different ip address but under the same DNS name. However, looking at the gocql documentations it looks like there is an assumption that user need to only pass the ip addresses and not the DNS which is really not possible in such setups because the moment a Cassandra node goes down & restarted, it will come up with a different ip address but with same DNS. Could it be the case that the gocql driver is unable to re-establish because it is still trying to do that against the old ip address?

I feel such an assumption is not apt because if the driver is supplied with a DNS name, it should use it while trying to reconnect.

This has been already resolved in the official Java Cassandra driver by Datastax team. Here is the ticket for your reference.

So, would you please prioritize this ticket & help with necessary corrections?

@alourie
Copy link
Contributor

alourie commented Oct 15, 2019

Okay, I'll try and have a look. Considering we're beginning to work with k8 as well, this may end up handy and sooner than expected.

@sanjimoh
Copy link

@alourie thank you for looking at it!

Is there any timeline by when could we expect a resolution? Unfortunately, its a critical need for us.

@alourie
Copy link
Contributor

alourie commented Oct 18, 2019

@sanjimoh Sorry, it would be hard to timeline it. I'm finishing up something first, then will get to it, probably mid-next week. From there it could take some time until I figure this out.

As I said, we need it too, so I wouldn't delay this too much.

@sanjimoh
Copy link

Hi, did you get a chance to check this now?

@alourie
Copy link
Contributor

alourie commented Oct 30, 2019 via email

@alourie
Copy link
Contributor

alourie commented Nov 13, 2019

I have some personal circumstances that won't allow me to look at this for awhile. Sorry about that.

@vadalikrishna
Copy link

Hi - We are facing same problem described by @sanjimoh , is there any resolution for this yet?

@elbek
Copy link

elbek commented Jan 2, 2020

this happened today on my local machine where I forgot to close Iter instance, I haven't run anything on prod with quorum setup, I run locally with one node.

@sanjimoh
Copy link

sanjimoh commented Jan 2, 2020

@alourie : Could it be worked on now? If not you, anyone else from the library maintainers?

@cdent
Copy link

cdent commented Nov 2, 2020

Of the people who have developed their own techniques to work around this which of the two apparent strategies are you using:

  • The java driver fix which keeps the original cluster hostname around for reconnections, resolving each time it is used
  • The more catastrophic fix: recreate the cluster session from scratch when "the problem" is noticed

?

Are there additional strategies?

@martin-sucha martin-sucha changed the title qocql: no hosts available in the pool gocql does not re-resolve DNS names Dec 22, 2021
@martin-sucha
Copy link
Contributor

This is related to #1575, particularly #1575 (comment)

@vikage
Copy link

vikage commented Nov 24, 2022

I faced this problem too. I solved this with config ConnectObserver from ClusterConfig to listen to the connection state. When it has an issue and recreates another session

@avelanarius
Copy link

I think this issue can now be closed since b9737dd was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests