Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling restart of k8s db cluster causes all nodes to remain down and not recover #1582

Open
justinfx opened this issue Oct 4, 2021 · 9 comments

Comments

@justinfx
Copy link

justinfx commented Oct 4, 2021

Please answer these questions before submitting your issue. Thanks!

What version of Cassandra are you using?

N/A (Yugabyte 2.9.0)

What version of Gocql are you using?

gocql@v0.0.0-20200602185649-ef3952a45ff4
(https://github.com/yugabyte/gocql)

What version of Go are you using?

1.17.0

What did you do?

Scale all database nodes in cluster down from 3 -> 0 and then back from 0 -> 3

What did you expect to see?

I expected to see the gocql client log the connection failure for the 3 database node ip addresses, but then recover with 3 new ip addresses for the 3 new database pods.

What did you see instead?

The client continues to see the original 3 node ips as DOWN and never recovers with the new nodes.


refs #1575 , yugabyte/yugabyte-db#10182

Because the forked client I am using is outdated, it is possible that this issue is resolved in a newer release of gocql, or that it is still outstanding as per #1575

My connection host list look like the following:

yb-tserver-0.yb-tservers.db-namespace.svc.cluster.local
yb-tserver-1.yb-tservers.db-namespace.svc.cluster.local
yb-tserver-2.yb-tservers.db-namespace.svc.cluster.local

In k8s these are 3 unique ip addresses in the cluster. When I scale down and back up, the pods have new ip addresses.

I am attaching the client application logs, using the build tag gocql_debug, showing two different scenarios.

The first test is where I only scale from 3 -> 2 and then back up from 2 -> 3.
This log shows that it does seem to handle the old IP being down but then the new IP becoming part of the ring again.
cql_scale_1.log

The second test is where I scale all 3 nodes ->0 and then back up to 3 again. I am using a policy that scales each node, one at a time. The log shows the nodes going down, but then never recover with any sign of the new nodes.
cql_scale_all.log

If I manually kill the database pods, depending on the order I choose, I may or may not see this topology event:

{"level":"info","ts":"2021-10-04T21:14:26.465Z","caller":"gocql@v0.0.0-20200602185649-ef3952a45ff4/events.go:91","msg":"gocql: handling frame: [topology_change change=NEW_NODE host=10.42.9.48 port=9042]"}
@justinfx
Copy link
Author

justinfx commented Oct 4, 2021

@martin-sucha

@justinfx
Copy link
Author

Because the forked client I am using is outdated, it is possible that this issue is resolved in a newer release of gocql, or that it is still outstanding as per #1575

Just an update to this statement. Per yugabyte/yugabyte-db#10182 it is still an active issue even with the forked client having been updated to gocql

@mr-andreas
Copy link

mr-andreas commented Nov 11, 2022

We're also hitting this in our production environment when doing rolling
restarts of our cassandra cluster. This causes all nodes to get new IP addresses
which gocql do not detect. The problem can be consistently reproduced by
following the steps below. As we do some manual routing config in the steps,
they'll probably only work on linux.

Create a minikube cluster:

$ minikube start --cni calico

Deploy a 3 node cassandra cluster to minikube:

$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: cassandra
  labels:
    app: cassandra
spec:
  ports:
  - port: 9042
    name: cql
  type: ClusterIP
  selector:
    app: cassandra
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
spec:
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  serviceName: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      terminationGracePeriodSeconds: 1800
      containers:
      - name: cassandra
        image: cassandra:3.11.14
        readinessProbe:
          tcpSocket:
            port: 9042
        env:
          - name: CASSANDRA_SEEDS
            value: cassandra-0,cassandra-0.cassandra
        ports:
          - containerPort: 9042
            name: cql
          - containerPort: 7000
            name: intra-node
        lifecycle:
          preStop:
            exec:
              command: [ "nodetool", "drain" ]
        volumeMounts:
          - name: data
            mountPath: /var/lib/cassandra
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi
EOF

Wait for cassandra to start. The progress may be monitored by calling:

$ kubectl get pod -l app=cassandra

Once all pods are marked as running, we are good to go:

NAME          READY   STATUS    RESTARTS       AGE
cassandra-0   1/1     Running   0              2m28s
cassandra-1   1/1     Running   2 (2m2s ago)   2m8s
cassandra-2   1/1     Running   0              48s

On my system, cassandra-1 restarts a couple of times before bootstrapping, as
the seed node cassandra-0 is not yet ready. This doesn't matter, eventually, it
boots and finds the seed node.

Now we need to add a route to the system so that we can directly connect to the
cassandra nodes from our host machine:

$ sudo route add -net $(kubectl cluster-info dump | grep -m 1 cluster-cidr | cut -d = -f 2 | cut -d '"' -f 1) gw $(minikube ip)

After this, we can run the following test program:

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"

	"github.com/gocql/gocql"
)

type connectObs struct{}

func (connectObs) ObserveConnect(c gocql.ObservedConnect) {
	if c.Err == nil {
		fmt.Printf(
			"Connected to host %s in %s\n",
			c.Host.HostnameAndPort(), c.End.Sub(c.Start),
		)
	} else {
		fmt.Printf(
			"Failed connecting to host %s after %s: %s\n",
			c.Host.HostnameAndPort(), c.End.Sub(c.Start), c.Err,
		)
	}
}

func main() {
	if len(os.Args) != 2 {
		fmt.Println("Supply a cassandra host")
		os.Exit(1)
	}

	conf := gocql.NewCluster(os.Args[1])
	conf.ProtoVersion = 4

	_, err := gocql.NewSession(*conf)
	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}

	fmt.Println("connected")

	ch := make(chan os.Signal)
	signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM)
	<-ch
}

Now we can start the test program and connect to a node:

$ go run -tags gocql_debug . $(kubectl get pod cassandra-0 --template '{{.status.podIP}}')
2022/11/11 10:28:14 gocql: Session.handleNodeConnected: 10.244.120.67:9042
2022/11/11 10:28:14 gocql: conns of pool after stopped "10.244.120.67": 2
2022/11/11 10:28:14 gocql: Session.handleNodeConnected: 10.244.120.68:9042
connected
2022/11/11 10:28:14 gocql: Session.handleNodeConnected: 10.244.120.69:9042
2022/11/11 10:28:14 gocql: conns of pool after stopped "10.244.120.68": 2
2022/11/11 10:28:14 gocql: conns of pool after stopped "10.244.120.69": 2

We can see that all three nodes were connected. Now we will restart one of the
nodes:

$ kubectl delete pod cassandra-2
pod "cassandra-2" deleted

It will get a new IP. gocql should discover and connect to the new node, but
this does not happen. We see the following log messages from the program:

2022/11/11 10:28:40 gocql: pool connection error "10.244.120.69:9042": read tcp 192.168.49.1:45340->10.244.120.69:9042: read: connection reset by peer
2022/11/11 10:28:40 gocql: pool connection error "10.244.120.69:9042": read tcp 192.168.49.1:45356->10.244.120.69:9042: read: connection reset by peer
2022/11/11 10:28:40 gocql: unable to dial "[HostInfo hostname=\"10.244.120.69\" connectAddress=\"10.244.120.69\" peer=\"10.244.120.69\" rpc_address=\"10.244.120.69\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.244.120.69\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"datacenter1\" rack=\"rack1\" host_id=\"e9d19d2b-c438-4714-a079-3ee7a024cc15\" version=\"v3.11.14\" state=UP num_tokens=256]": dial tcp 10.244.120.69:9042: connect: connection refused
2022/11/11 10:28:40 gocql: filling stopped "10.244.120.69": dial tcp 10.244.120.69:9042: connect: connection refused
2022/11/11 10:28:40 gocql: handling frame: [status_change change=DOWN host=10.244.120.69 port=9042]
2022/11/11 10:28:40 gocql: conns of pool after stopped "10.244.120.69": 0
2022/11/11 10:28:40 gocql: Session.handleNodeDown: 10.244.120.69:9042
2022/11/11 10:28:41 gocql: dispatching event: &{change:DOWN host:[10 244 120 69] port:9042}
2022/11/11 10:28:41 gocql: Session.handleNodeDown: 10.244.120.69:9042
2022/11/11 10:28:58 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.120.70 port=9042]
2022/11/11 10:28:58 gocql: handling frame: [status_change change=UP host=10.244.120.70 port=9042]
2022/11/11 10:28:59 gocql: dispatching event: &{change:UP host:[10 244 120 70] port:9042}
2022/11/11 10:28:59 gocql: Session.handleNodeUp: 10.244.120.70:9042
2022/11/11 10:29:14 Session.ring:[10.244.120.68:UP][10.244.120.67:UP][10.244.120.69:DOWN]
2022/11/11 10:29:15 gocql: connection failed "10.244.120.69": dial tcp 10.244.120.69:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy

gocql misses the message [topology_change change=NEW_NODE host=10.244.120.70 port=9042]
and keeps trying to connect to the old node IP at 10.244.120.69. I think that
there are some different causes to this bug. The first is that we accidently
drop the topology_change message. At https://github.com/gocql/gocql/blob/v1.2.1/events.go#L144
we see that all frames are grouped by their host ID. From our logs above, we see
that we get two frames at the same time, about the same host:

2022/11/11 10:28:58 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.120.70 port=9042]
2022/11/11 10:28:58 gocql: handling frame: [status_change change=UP host=10.244.120.70 port=9042

This will cause the topology_change to be overwritten with the status_change.

I tried patching this so that the message doesn't get dropped, but the bug still
persists. The message we receive when the node goes down
([status_change change=DOWN host=10.244.120.69 port=9042]) does not actually
remove the node from our host list. The means that when attempting to add the
host at the new address at
https://github.com/gocql/gocql/blob/9cf1e71082de020c21f52a2f77803c4688f578f8/host_source.go#L684
this will have no effect as the old node is still in the list.

Hope this helps. I'd be happy to assist further in fixing the bug :)

@mr-andreas
Copy link

I think that this broke at 64cda7b. If I revert to the commit before that (ae365fa) the restarted cassandra node is detected with its new IP:

2022/11/11 11:37:02 gocql: Session.handleNodeConnected: 10.244.120.67:9042
2022/11/11 11:37:02 gocql: Session.handleNodeConnected: 10.244.120.68:9042
2022/11/11 11:37:02 gocql: conns of pool after stopped "10.244.120.67": 2
2022/11/11 11:37:02 gocql: conns of pool after stopped "10.244.120.68": 2
connected
2022/11/11 11:37:02 gocql: Session.handleNodeConnected: 10.244.120.72:9042
2022/11/11 11:37:02 gocql: conns of pool after stopped "10.244.120.72": 2

(running kubectl delete pod cassandra-2 here)

2022/11/11 11:37:10 gocql: pool connection error "10.244.120.72:9042": read tcp 192.168.49.1:43828->10.244.120.72:9042: read: connection reset by peer
2022/11/11 11:37:10 gocql: pool connection error "10.244.120.72:9042": read tcp 192.168.49.1:43834->10.244.120.72:9042: read: connection reset by peer
2022/11/11 11:37:10 gocql: unable to dial "10.244.120.72": dial tcp 10.244.120.72:9042: connect: connection refused
2022/11/11 11:37:10 gocql: filling stopped "10.244.120.72": dial tcp 10.244.120.72:9042: connect: connection refused
2022/11/11 11:37:10 gocql: handling frame: [status_change change=DOWN host=10.244.120.72 port=9042]
2022/11/11 11:37:10 gocql: conns of pool after stopped "10.244.120.72": 0
2022/11/11 11:37:10 gocql: Session.handleNodeDown: 10.244.120.72:9042
2022/11/11 11:37:11 gocql: dispatching event: &{change:DOWN host:[10 244 120 72] port:9042}
2022/11/11 11:37:11 gocql: Session.handleNodeDown: 10.244.120.72:9042

(new cassandra node becomes ready)

2022/11/11 11:37:29 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.120.73 port=9042]
2022/11/11 11:37:29 gocql: handling frame: [status_change change=UP host=10.244.120.73 port=9042]
2022/11/11 11:37:30 gocql: dispatching event: &{change:UP host:[10 244 120 73] port:9042}
2022/11/11 11:37:30 gocql: Session.handleNodeUp: 10.244.120.73:9042
2022/11/11 11:37:30 gocql: Session.handleNodeConnected: 10.244.120.73:9042
2022/11/11 11:37:30 gocql: conns of pool after stopped "10.244.120.73": 2
2022/11/11 11:38:02 Session.ring:[10.244.120.67:UP][10.244.120.68:UP][10.244.120.73:UP]

The topology_changed frame is still dropped, but the new node is detected and added anyways due to the status_change message.

@justinfx
Copy link
Author

I'm surprised this hasn't been an issue for more users to get more attention on this. That suggests most people are not running in a container orchestrated platform. Maybe the most common is VM or bare metal?

@koovee
Copy link

koovee commented Nov 15, 2022

We have same problem. I just noticed it when performing disaster recovery in development env (restored medusa backup which causes datacenter shutdown / restart).

running k8ssandra-operator 0.38.2 (cassandra 4.0.1) and gocql v1.2.1

I was planning on fixing this by restarting pod(s) that have problems connecting to Cassandra since these errors could be caused by actual network problems too. Never-ending reconnect loop in the pod is not a good idea. Maybe few reconnect attempts and then change pod health check response so that K8s will kill the pod and re-schedule it..

@ufoot
Copy link

ufoot commented Jan 4, 2023

I'm surprised this hasn't been an issue for more users to get more attention on this. That suggests most people are not running in a container orchestrated platform. Maybe the most common is VM or bare metal?

Hi @justinfx , for context, we are experiencing a very similar situation here, we are running Cassandra 3.11.13 on many (hundreds of) clusters, and following a recent trend (on our side) of node replacements, with new IPs etc, we are seeing this happening daily. It is complex for us to reproduce it though, and we're working on a small isolated use-case which could expose the problem. Our GoCQL version is 1.3.0.

The problem is definitely real, we witness it routinely, and have thousands of Cassandra nodes running on k8s.

Also I think this #1667 is related as well.

We've been studying GoCQL code on our side, but did not identify the root cause yet. What we witness is basically clients dying with error *errors.errorString: gocql: no hosts available in the pool after all nodes in a cluster have been replaced. So somewhere, the driver is loosing track of topology changes. We may have never stumbled on the issue before because we were not rotating nodes so aggressively, or because the problem has been introduced by some (recent?) library change as suggested in this thread. Or both.

As I mentioned, we're trying to reproduce the problem on a smaller scale.

@justinfx
Copy link
Author

justinfx commented Jan 4, 2023

@ufoot I'm glad to hear others are trying to look at this and narrow down repros. You seems to have quite a massive deployment, whereas my repo focused on what might be equivalent to an outtage with all nodes restarting too quickly.
yugabyte/yugabyte-db#10182

If I recall, I did observe a slow but successful topology update if a node is restarted one at a time with enough time between. There is a bit of time on the client side where it still has the old node ip and can't connect to the new one yet. I'm just really concerned about more unexpected situations where nodes in a small cluster restart too quickly leaving the cluster unaccessible from a running client because of the stale client pool.

@mikelococo
Copy link

@ufoot @sseidman and I all work together and have spent time investigating this issue. We've filed #1668 to document our findings. After initially thinking we were seeing a manifestation of what's reported here, I now think we're dealing with something different for the following reasons:

  • sseidman has bisected this problem down to Identify nodes by their host_id instead broadcasted_address #1632, which I believe was released in GoCQL 1.2.0. This issue was first reported against gocql@v0.0.0-20200602185649-ef3952a45ff4 (or maybe a fork thereof?). That commit didn't exist at the time this was first filed.
  • Our issue is being triggered by replacements via -Dcassandra.replace_address_first_boot=<dead_node_ip>, or at least that's what our repro is testing for. This seems to be either about pod-restarts (I think... there's a lot of scroll and it's been a while since I read this one top to bottom).

It may be that there's more than one underlying issue that can cause some kind of topology corruption in gocql, but it feels to me like the acute issue we're currently experiencing is likely to be different from the one originally described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants