-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reresolve DNS as fallback when all hosts are unreachable #254
Conversation
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the control connection logic so that when no known host is reachable, Cluster one again resolves all the known hostnames and ControlConnection tries to connect them.
What's the use case where the all nodes in cluster change their IP at the same time? I imagine a complete abrupt restart of the whole cluster + they are using DHCP, sounds rare - very. What other scenarios? |
This PR is python-driver equivalent of this gocql PR apache/cassandra-gocql-driver#1708, there were some customer issues (example: https://github.com/scylladb/scylla-enterprise/issues/2212) so it is not so rare I suppose. |
Ok - I'm unsure, but I think what happened is: there was an initial list of DNS entries fed to the client. It worked. Then nodes were replaced one by one. For some reason, while it learned on new node via the system_peers, when it disconnected, it did not re-resolve the initial list of DNS entries. Since those nodes (in the mean time) were replaced with different IPs, connection failed. It happened when you, for example, feed the driver with 3 nodes, and do a rolling replace of those node (I think!) |
Okay, yes, and when they are replaced one by one, eventually all are replaced and we have to reresolve initial DNS entries and that it what I added with this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
All DNS nodes are replaced for scale-up operations. If you have 3 nodes, instance i3en.2xlarge, and you scale-up to i3en.6xlarge - the process is add 3 i3en.6xlarge, then decommission the i3en.2xlarge. All DNS nodes are replaced. |
I thought we update DNS on every change. |
We do, but if you are looking at an operational scenario where the complete process will result in replacement of every DNS, scale-up is an operational process that does that. |
Not at once. One by one, even if we replace all of them eventually. Which means that the driver has the time to re-learn (from existing connections) or re-connect (if needed and all connections failed) |
If all nodes in the cluster change their IPs at one time, driver used to no longer be able to ever contact the cluster; the only solution was to restart the driver. A fallback is added to the
ControlConnection
_reconnect_internal()
logic so that when no known host is reachable, Cluster one again resolves all the known hostnames andControlConnection
tries to connect them.I ran the manual tests, they involved:
Fixes: #239