Reresolve DNS as fallback when all hosts are unreachable #254

sylwiaszunejko · 2023-08-29T05:43:35Z

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the ControlConnection _reconnect_internal() logic so that when no known host is reachable, Cluster one again resolves all the known hostnames and ControlConnection tries to connect them.

I ran the manual tests, they involved:

stopping systemd DNS service,
running custom local DNS service that maps hostnames to "old" IPs,
using a proxy on connections to all nodes, listening on "old" IPs,
running a crafted test that periodically sends queries,
breaking connections by stopping proxies,
changing DNS rules to resolve to new IPs,
reestablishing proxies on new IPs,
waiting until all pools get populated again,
asserting that it happens in reasonable time.

Fixes: #239

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection logic so that when no known host is reachable, Cluster one again resolves all the known hostnames and ControlConnection tries to connect them.

mykaul · 2023-08-29T07:51:43Z

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

sylwiaszunejko · 2023-08-29T08:33:47Z

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR apache/cassandra-gocql-driver#1708, there were some customer issues (example: https://github.com/scylladb/scylla-enterprise/issues/2212) so it is not so rare I suppose.

mykaul · 2023-08-29T09:03:02Z

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR gocql/gocql#1708, there were some customer issues (example: scylladb/scylla-enterprise#2212) so it is not so rare I suppose.

Ok - I'm unsure, but I think what happened is: there was an initial list of DNS entries fed to the client. It worked. Then nodes were replaced one by one. For some reason, while it learned on new node via the system_peers, when it disconnected, it did not re-resolve the initial list of DNS entries. Since those nodes (in the mean time) were replaced with different IPs, connection failed. It happened when you, for example, feed the driver with 3 nodes, and do a rolling replace of those node (I think!)

sylwiaszunejko · 2023-08-29T09:22:23Z

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR gocql/gocql#1708, there were some customer issues (example: scylladb/scylla-enterprise#2212) so it is not so rare I suppose.

Ok - I'm unsure, but I think what happened is: there was an initial list of DNS entries fed to the client. It worked. Then nodes were replaced one by one. For some reason, while it learned on new node via the system_peers, when it disconnected, it did not re-resolve the initial list of DNS entries. Since those nodes (in the mean time) were replaced with different IPs, connection failed. It happened when you, for example, feed the driver with 3 nodes, and do a rolling replace of those node (I think!)

Okay, yes, and when they are replaced one by one, eventually all are replaced and we have to reresolve initial DNS entries and that it what I added with this PR

fruch

LGTM

pdbossman · 2023-12-06T15:52:14Z

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

All DNS nodes are replaced for scale-up operations. If you have 3 nodes, instance i3en.2xlarge, and you scale-up to i3en.6xlarge - the process is add 3 i3en.6xlarge, then decommission the i3en.2xlarge. All DNS nodes are replaced.

mykaul · 2023-12-06T16:49:23Z

I thought we update DNS on every change.

pdbossman · 2023-12-06T17:33:30Z

We do, but if you are looking at an operational scenario where the complete process will result in replacement of every DNS, scale-up is an operational process that does that.

mykaul · 2023-12-06T19:25:41Z

Not at once. One by one, even if we replace all of them eventually. Which means that the driver has the time to re-learn (from existing connections) or re-connect (if needed and all connections failed)

sylwiaszunejko requested a review from Lorak-mmk August 29, 2023 06:08

sylwiaszunejko requested a review from fruch August 31, 2023 10:38

Lorak-mmk approved these changes Sep 4, 2023

View reviewed changes

fruch approved these changes Sep 4, 2023

View reviewed changes

fruch merged commit d735957 into scylladb:master Sep 4, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reresolve DNS as fallback when all hosts are unreachable #254

Reresolve DNS as fallback when all hosts are unreachable #254

sylwiaszunejko commented Aug 29, 2023 •

edited

Loading

mykaul commented Aug 29, 2023

sylwiaszunejko commented Aug 29, 2023

mykaul commented Aug 29, 2023

sylwiaszunejko commented Aug 29, 2023

fruch left a comment

pdbossman commented Dec 6, 2023

mykaul commented Dec 6, 2023

pdbossman commented Dec 6, 2023

mykaul commented Dec 6, 2023

Reresolve DNS as fallback when all hosts are unreachable #254

Reresolve DNS as fallback when all hosts are unreachable #254

Conversation

sylwiaszunejko commented Aug 29, 2023 • edited Loading

mykaul commented Aug 29, 2023

sylwiaszunejko commented Aug 29, 2023

mykaul commented Aug 29, 2023

sylwiaszunejko commented Aug 29, 2023

fruch left a comment

Choose a reason for hiding this comment

pdbossman commented Dec 6, 2023

mykaul commented Dec 6, 2023

pdbossman commented Dec 6, 2023

mykaul commented Dec 6, 2023

sylwiaszunejko commented Aug 29, 2023 •

edited

Loading