[bitnami/etcd] #1640

sboschman · 2019-11-21T10:25:39Z

Which chart:
bitnami/etcd version 4.4.3

Description
Deployed a etcd cluster with 3 replicas. If a pod restarts (e.g. triggered by a rolling update or a manual pod delete) it is unable to rejoin the cluster.

Describe the results you received:
The pod keeps restarting with the following output:

==> Detected data from previous deployments...
rm: cannot remove '/bitnami/etcd/member_removal.log': No such file or directory
==> Updating member in existing cluster...
2019-11-21 10:13:52.649453 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
...
2019-11-21 10:13:52.675047 E | etcdserver: the member has been permanently removed from the cluster
2019-11-21 10:13:52.675059 I | etcdserver: the data-dir used by this member must be removed.
2019-11-21 10:13:52.675092 E | etcdserver: publish error: etcdserver: request cancelled
2019-11-21 10:13:52.675106 I | etcdserver: aborting publish because server is stopped

Unexpected ETCD_INITIAL_CLUSTER_STATE=new, as it should be joining a existing cluster.
As the pod gets restarted I might have missed the log of the very first start. So this might just be the log of restart > 1

Describe the results you expected:
The cluster still has 2 healthy members, so it should be able to recover on its own as the quorum is 2. So, I expect a rolling update to work, and to recover from a single member failure.

The text was updated successfully, but these errors were encountered:

juan131 · 2019-11-22T09:12:12Z

Hi @sboschman

The problem is related to the regex that checks whether the member was successfully removed from the cluster or not. I just created a PR to address the issue. Thanks for reporting it.

sboschman · 2019-11-25T16:22:22Z

A etcd pod restart is still unable to recover itself with the fix from #1650

Cluster is up and running:

128da9111efa73eb, started, etcd-0, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379, false
3ea5934f802e28a0, started, etcd-1, http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-1.etcd-headless.etcd.svc.cluster.local:2379, false
8cbe679ed0a3c0ea, started, etcd-2, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379, false

Let's delete pod etcd-1 and see if it recovers:

==> Bash debug is on
==> Detected data from previous deployments...
http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 19.947775ms
http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 25.196435ms
Member 3ea5934f802e28a0 removed from cluster 62690dd5c1e46bad
==> Adding new member to existing cluster...
{"level":"warn","ts":"2019-11-25T16:08:07.919Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-2fedc309-f821-4316-af38-4357825b99b9/etcd-0.etcd-headless.etcd.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: <nil>

Crashloop restart:

==> Bash debug is on
==> Detected data from previous deployments...
http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 4.35945ms
http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 5.196322ms
grep: /bitnami/etcd/member_removal.log: No such file or directory
==> Adding new member to existing cluster...
{"level":"warn","ts":"2019-11-25T16:14:04.642Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-ace9cb84-e763-40a1-9c2d-d016ec08feb8/etcd-0.etcd-headless.etcd.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: Peer URLs already exists"}
Error: etcdserver: Peer URLs already exists

Cluster status:

128da9111efa73eb, started, etcd-0, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379, false
19c1063d83341120, unstarted, , http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380, , true
8cbe679ed0a3c0ea, started, etcd-2, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379, false

Meanwhile in the logs of pod etcd-0:

 etcdserver/membership: added member 19c1063d83341120 [http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380] to cluster 62690dd5c1e46bad
2019-11-25 16:07:37.913023 I | rafthttp: starting peer 19c1063d83341120...
2019-11-25 16:07:37.913048 I | rafthttp: started HTTP pipelining with peer 19c1063d83341120
2019-11-25 16:07:37.915077 I | rafthttp: started streaming with peer 19c1063d83341120 (writer)
2019-11-25 16:07:37.915179 I | rafthttp: started peer 19c1063d83341120
2019-11-25 16:07:37.915193 I | rafthttp: added peer 19c1063d83341120
2019-11-25 16:07:37.915215 I | rafthttp: started streaming with peer 19c1063d83341120 (writer)
2019-11-25 16:07:37.915238 I | rafthttp: started streaming with peer 19c1063d83341120 (stream MsgApp v2 reader)
2019-11-25 16:07:37.915517 I | rafthttp: started streaming with peer 19c1063d83341120 (stream Message reader)
2019-11-25 16:07:41.395274 W | etcdserver: failed to reach the peerURL(http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380) of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.90:2380: i/o timeout)
2019-11-25 16:07:41.395299 W | etcdserver: cannot get the version of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.90:2380: i/o timeout)
2019-11-25 16:07:42.915295 W | rafthttp: health check for peer 19c1063d83341120 could not connect: <nil>
2019-11-25 16:07:42.915660 W | rafthttp: health check for peer 19c1063d83341120 could not connect: <nil>
2019-11-25 16:07:45.399837 W | etcdserver: failed to reach the peerURL(http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380) of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.91:2380: connect: connection refused)
2019-11-25 16:07:45.399857 W | etcdserver: cannot get the version of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.91:2380: connect: connection refused)
2019-11-25 16:07:47.916164 W | rafthttp: health check for peer 19c1063d83341120 could not connect: dial tcp 10.42.6.90:2380: connect: no route to host
2019-11-25 16:07:47.916234 W | rafthttp: health check for peer 19c1063d83341120 could not connect: dial tcp 10.42.6.90:2380: i/o timeout
... keeps trying to connect to etcd-1, which is still offline

This behaviour looks alot like the issue in etcd-io/etcd#11186. The latest chart deploys docker.io/bitnami/etcd:3.4.3-debian-9-r12. The fix is supposed to be in 3.4.2, https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v342-2019-10-11

juan131 · 2019-11-26T08:41:44Z

Hi @sboschman

Let's merge this issue on the one below since others users are experiencing similar issues and we have an associated task to that issue:

Single-node etcd stuck in CrashLoopBackOff after restart #1513

sboschman mentioned this issue Nov 21, 2019

[bitnami/etcd] redirect output to new_member_envs #1642

Closed

juan131 mentioned this issue Nov 22, 2019

[bitnami/etcd] Fix regexp to detect removed members from cluster #1650

Merged

2 tasks

juan131 closed this as completed in #1650 Nov 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/etcd] #1640

[bitnami/etcd] #1640

sboschman commented Nov 21, 2019

juan131 commented Nov 22, 2019

sboschman commented Nov 25, 2019

juan131 commented Nov 26, 2019

[bitnami/etcd] #1640

[bitnami/etcd] #1640

Comments

sboschman commented Nov 21, 2019

juan131 commented Nov 22, 2019

sboschman commented Nov 25, 2019

juan131 commented Nov 26, 2019