Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dont reresolve dns address #1670

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

sseidman
Copy link
Contributor

@sseidman sseidman commented Jan 11, 2023

If using a DNS address to establish a connection to a cassandra cluster, it is possible to get a mismatch in connect_address and broadcast_address for a single HostInfo type if using the default HostDialer. This can occur in the following code path:

  1. DNS resolved to IP addresses
  2. Hosts passed to establish control connection
  3. Session dials the host
  4. DialHost with default HostDialer
  5. Use the hostname as dial address instead of IP

At step five the HostInfo struct has the following form:

HostInfo hostname="my-cassandra-dns-address" connectAddress="10.128.189.205" peer="<nil>" 
rpc_address="<nil>" broadcast_address="<nil>" preferred_ip="<nil>" connect_addr="10.128.189.205" 
connect_addr_source="connect_address" port=9042 data_centre="" rack="    " 
host_id="" version="v0.0.0" state=UP num_tokens=0

But when attempting to connect the cassandra cluster with the DNS address, it can connect to any node in the cluster that the address resolves to. This eventually can result in the following:

HostInfo hostname="" connectAddress="10.128.189.205" peer="<nil>" rpc_address="10.128.89.255" 
broadcast_address="10.128.89.255" preferred_ip="<nil>" connect_add    r="10.128.189.205" 
connect_addr_source="connect_address" port=9042 data_centre="us-east" rack="1c" 
host_id="b835f47c-caaf-464f-9f79-aa01eacfa512" version="v3.11.13" state=UP num_tokens=256

This HostInfo has connectAddress="10.128.189.205 and broadcast_address="10.128.89.255" where both IP addresses are nodes in the cassandra cluster. The host_id of the HostInfo is for that of the node whose IP is equal to the broadcast_address. The result of this is that although the Connection is supposed to be established to the node whose IP address is equal to the broadcast_address it is actually connected to the node whose IP is the connect_address. Additionally, the ring will now have duplicate hosts:

2023/01/11 19:10:49 Session.ring:[10.128.189.205:UP][10.128.93.223:UP][10.128.189.205:UP]

I believe this was effected by the following change: #1632. Nodes were previously added/removed by their connect_address (contrary to the PR title), so, using the previous example, if 10.128.189.205 went down, both Hosts were removed from the Ring. Now, when 10.128.189.205 goes down, only the host whose broadcast_address will be removed. So by product of replacing nodes in the cluster, it is possible to end up in a state where a number of clients are attempting to connect to outdated IP addresses that are no longer part of the cluster and that the DNS address no longer resolves to either

@sseidman
Copy link
Contributor Author

sseidman commented Jan 20, 2023

I think the initial commit I proposed had the potential to cause an issue withWrapTLS since the hostname can be used for SNI. The 2nd commit instead makes sure to use the IP address that the control connection has established a connection to. This is needed because when establishing the control connection, the HostDialer will connect via the DNS address instead of the connectAddress, potentially causing a mismatch in IPs between the host information parsed from localHostInfo here and the previously assigned connectAddress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant