Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reresolve DNS as fallback when all hosts are unreachable #254

Merged
merged 1 commit into from
Sep 4, 2023

Conversation

sylwiaszunejko
Copy link
Collaborator

@sylwiaszunejko sylwiaszunejko commented Aug 29, 2023

If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the ControlConnection _reconnect_internal() logic so that when no known host is reachable, Cluster one again resolves all the known hostnames and ControlConnection tries to connect them.

I ran the manual tests, they involved:

  • stopping systemd DNS service,
  • running custom local DNS service that maps hostnames to "old" IPs,
  • using a proxy on connections to all nodes, listening on "old" IPs,
  • running a crafted test that periodically sends queries,
  • breaking connections by stopping proxies,
  • changing DNS rules to resolve to new IPs,
  • reestablishing proxies on new IPs,
  • waiting until all pools get populated again,
  • asserting that it happens in reasonable time.

Fixes: #239

If all nodes in the cluster change their IPs at one time, driver used to
no longer be able to ever contact the cluster; the only solution was to
restart the driver. A fallback is added to the control connection
logic so that when no known host is reachable, Cluster one again
resolves all the known hostnames and ControlConnection tries to connect them.
@mykaul
Copy link

mykaul commented Aug 29, 2023

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

@sylwiaszunejko
Copy link
Collaborator Author

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR apache/cassandra-gocql-driver#1708, there were some customer issues (example: https://github.com/scylladb/scylla-enterprise/issues/2212) so it is not so rare I suppose.

@mykaul
Copy link

mykaul commented Aug 29, 2023

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR gocql/gocql#1708, there were some customer issues (example: scylladb/scylla-enterprise#2212) so it is not so rare I suppose.

Ok - I'm unsure, but I think what happened is: there was an initial list of DNS entries fed to the client. It worked. Then nodes were replaced one by one. For some reason, while it learned on new node via the system_peers, when it disconnected, it did not re-resolve the initial list of DNS entries. Since those nodes (in the mean time) were replaced with different IPs, connection failed. It happened when you, for example, feed the driver with 3 nodes, and do a rolling replace of those node (I think!)

@sylwiaszunejko
Copy link
Collaborator Author

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

This PR is python-driver equivalent of this gocql PR gocql/gocql#1708, there were some customer issues (example: scylladb/scylla-enterprise#2212) so it is not so rare I suppose.

Ok - I'm unsure, but I think what happened is: there was an initial list of DNS entries fed to the client. It worked. Then nodes were replaced one by one. For some reason, while it learned on new node via the system_peers, when it disconnected, it did not re-resolve the initial list of DNS entries. Since those nodes (in the mean time) were replaced with different IPs, connection failed. It happened when you, for example, feed the driver with 3 nodes, and do a rolling replace of those node (I think!)

Okay, yes, and when they are replaced one by one, eventually all are replaced and we have to reresolve initial DNS entries and that it what I added with this PR

Copy link

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch fruch merged commit d735957 into scylladb:master Sep 4, 2023
13 checks passed
@pdbossman
Copy link

What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios?

All DNS nodes are replaced for scale-up operations. If you have 3 nodes, instance i3en.2xlarge, and you scale-up to i3en.6xlarge - the process is add 3 i3en.6xlarge, then decommission the i3en.2xlarge. All DNS nodes are replaced.

@mykaul
Copy link

mykaul commented Dec 6, 2023

I thought we update DNS on every change.

@pdbossman
Copy link

We do, but if you are looking at an operational scenario where the complete process will result in replacement of every DNS, scale-up is an operational process that does that.

@mykaul
Copy link

mykaul commented Dec 6, 2023

Not at once. One by one, even if we replace all of them eventually. Which means that the driver has the time to re-learn (from existing connections) or re-connect (if needed and all connections failed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-resolve hostnames as fallback when all hosts are unreachable
5 participants