Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/etcd] #1640

Closed
sboschman opened this issue Nov 21, 2019 · 3 comments · Fixed by #1650
Closed

[bitnami/etcd] #1640

sboschman opened this issue Nov 21, 2019 · 3 comments · Fixed by #1650

Comments

@sboschman
Copy link
Contributor

Which chart:
bitnami/etcd version 4.4.3

Description
Deployed a etcd cluster with 3 replicas. If a pod restarts (e.g. triggered by a rolling update or a manual pod delete) it is unable to rejoin the cluster.

Describe the results you received:
The pod keeps restarting with the following output:

==> Detected data from previous deployments...
rm: cannot remove '/bitnami/etcd/member_removal.log': No such file or directory
==> Updating member in existing cluster...
2019-11-21 10:13:52.649453 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
...
2019-11-21 10:13:52.675047 E | etcdserver: the member has been permanently removed from the cluster
2019-11-21 10:13:52.675059 I | etcdserver: the data-dir used by this member must be removed.
2019-11-21 10:13:52.675092 E | etcdserver: publish error: etcdserver: request cancelled
2019-11-21 10:13:52.675106 I | etcdserver: aborting publish because server is stopped

Unexpected ETCD_INITIAL_CLUSTER_STATE=new, as it should be joining a existing cluster.
As the pod gets restarted I might have missed the log of the very first start. So this might just be the log of restart > 1

Describe the results you expected:
The cluster still has 2 healthy members, so it should be able to recover on its own as the quorum is 2. So, I expect a rolling update to work, and to recover from a single member failure.

@juan131
Copy link
Contributor

juan131 commented Nov 22, 2019

Hi @sboschman

The problem is related to the regex that checks whether the member was successfully removed from the cluster or not. I just created a PR to address the issue. Thanks for reporting it.

@sboschman
Copy link
Contributor Author

A etcd pod restart is still unable to recover itself with the fix from #1650

Cluster is up and running:

128da9111efa73eb, started, etcd-0, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379, false
3ea5934f802e28a0, started, etcd-1, http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-1.etcd-headless.etcd.svc.cluster.local:2379, false
8cbe679ed0a3c0ea, started, etcd-2, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379, false

Let's delete pod etcd-1 and see if it recovers:

==> Bash debug is on
==> Detected data from previous deployments...
http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 19.947775ms
http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 25.196435ms
Member 3ea5934f802e28a0 removed from cluster 62690dd5c1e46bad
==> Adding new member to existing cluster...
{"level":"warn","ts":"2019-11-25T16:08:07.919Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-2fedc309-f821-4316-af38-4357825b99b9/etcd-0.etcd-headless.etcd.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: <nil>

Crashloop restart:

==> Bash debug is on
==> Detected data from previous deployments...
http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 4.35945ms
http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 5.196322ms
grep: /bitnami/etcd/member_removal.log: No such file or directory
==> Adding new member to existing cluster...
{"level":"warn","ts":"2019-11-25T16:14:04.642Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-ace9cb84-e763-40a1-9c2d-d016ec08feb8/etcd-0.etcd-headless.etcd.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: Peer URLs already exists"}
Error: etcdserver: Peer URLs already exists

Cluster status:

128da9111efa73eb, started, etcd-0, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-0.etcd-headless.etcd.svc.cluster.local:2379, false
19c1063d83341120, unstarted, , http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380, , true
8cbe679ed0a3c0ea, started, etcd-2, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2380, http://etcd-2.etcd-headless.etcd.svc.cluster.local:2379, false

Meanwhile in the logs of pod etcd-0:

 etcdserver/membership: added member 19c1063d83341120 [http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380] to cluster 62690dd5c1e46bad
2019-11-25 16:07:37.913023 I | rafthttp: starting peer 19c1063d83341120...
2019-11-25 16:07:37.913048 I | rafthttp: started HTTP pipelining with peer 19c1063d83341120
2019-11-25 16:07:37.915077 I | rafthttp: started streaming with peer 19c1063d83341120 (writer)
2019-11-25 16:07:37.915179 I | rafthttp: started peer 19c1063d83341120
2019-11-25 16:07:37.915193 I | rafthttp: added peer 19c1063d83341120
2019-11-25 16:07:37.915215 I | rafthttp: started streaming with peer 19c1063d83341120 (writer)
2019-11-25 16:07:37.915238 I | rafthttp: started streaming with peer 19c1063d83341120 (stream MsgApp v2 reader)
2019-11-25 16:07:37.915517 I | rafthttp: started streaming with peer 19c1063d83341120 (stream Message reader)
2019-11-25 16:07:41.395274 W | etcdserver: failed to reach the peerURL(http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380) of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.90:2380: i/o timeout)
2019-11-25 16:07:41.395299 W | etcdserver: cannot get the version of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.90:2380: i/o timeout)
2019-11-25 16:07:42.915295 W | rafthttp: health check for peer 19c1063d83341120 could not connect: <nil>
2019-11-25 16:07:42.915660 W | rafthttp: health check for peer 19c1063d83341120 could not connect: <nil>
2019-11-25 16:07:45.399837 W | etcdserver: failed to reach the peerURL(http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380) of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.91:2380: connect: connection refused)
2019-11-25 16:07:45.399857 W | etcdserver: cannot get the version of member 19c1063d83341120 (Get http://etcd-1.etcd-headless.etcd.svc.cluster.local:2380/version: dial tcp 10.42.6.91:2380: connect: connection refused)
2019-11-25 16:07:47.916164 W | rafthttp: health check for peer 19c1063d83341120 could not connect: dial tcp 10.42.6.90:2380: connect: no route to host
2019-11-25 16:07:47.916234 W | rafthttp: health check for peer 19c1063d83341120 could not connect: dial tcp 10.42.6.90:2380: i/o timeout
... keeps trying to connect to etcd-1, which is still offline

This behaviour looks alot like the issue in etcd-io/etcd#11186. The latest chart deploys docker.io/bitnami/etcd:3.4.3-debian-9-r12. The fix is supposed to be in 3.4.2, https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v342-2019-10-11

@juan131
Copy link
Contributor

juan131 commented Nov 26, 2019

Hi @sboschman

Let's merge this issue on the one below since others users are experiencing similar issues and we have an associated task to that issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants