Replies: 1 comment
-
The existing design is to ensure that once there is corruption, then do not serve any requests until the corruption is resolved. I agree that it isn't perfect, but it's prudent solution to avoid possibly worsen the situation. Corruption is not something that happens often in production; instead it should be rare. FYI. we are working on etcd-operator, one of the goal is to resolve such situation automatically. https://github.com/etcd-io/etcd-operator/blob/main/docs/roadmap.md |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Learnt from #14828 that we are able to identify corrupted members, this is great.
However, the desired behavior of alarm activation should only those corrupted members failed when serving KV APIs.
Instead of the whole etcd cluster availability is impacted now.
The existing alarm activation behavior makes it difficult to adopt the corruption checker feature in production and convince users.
See code reference where once the alarm activation request is agreed upon raft and materialized to each member in apply stage
etcd/server/etcdserver/apply/uber_applier.go
Lines 218 to 226 in 6ea81c1
etcd/server/etcdserver/apply/uber_applier.go
Lines 99 to 109 in 6ea81c1
@ahrtr @serathius @jmhbnz
Beta Was this translation helpful? Give feedback.
All reactions