Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

chaochn47 · 2024-09-20T18:30:57Z

chaochn47
Sep 20, 2024
Collaborator

Learnt from #14828 that we are able to identify corrupted members, this is great.

However, the desired behavior of alarm activation should only those corrupted members failed when serving KV APIs.

Instead of the whole etcd cluster availability is impacted now.

The existing alarm activation behavior makes it difficult to adopt the corruption checker feature in production and convince users.

See code reference where once the alarm activation request is agreed upon raft and materialized to each member in apply stage

etcd/server/etcdserver/apply/uber_applier.go

Lines 218 to 226 in 6ea81c1

    
           func (a *uberApplier) Alarm(ar *pb.AlarmRequest) (*pb.AlarmResponse, error) { 
        
           	resp, err := a.applyV3.Alarm(ar) 
        
           	if ar.Action == pb.AlarmRequest_ACTIVATE || 
        
           		ar.Action == pb.AlarmRequest_DEACTIVATE { 
        
           		a.restoreAlarms() 
        
           	} 
        
           	return resp, err 
        
           }

etcd/server/etcdserver/apply/uber_applier.go

Lines 99 to 109 in 6ea81c1

    
           func (a *uberApplier) restoreAlarms() { 
        
           	noSpaceAlarms := len(a.alarmStore.Get(pb.AlarmType_NOSPACE)) > 0 
        
           	corruptAlarms := len(a.alarmStore.Get(pb.AlarmType_CORRUPT)) > 0 
        
           	a.applyV3 = a.applyV3base 
        
           	if noSpaceAlarms { 
        
           		a.applyV3 = newApplierV3Capped(a.applyV3) 
        
           	} 
        
           	if corruptAlarms { 
        
           		a.applyV3 = newApplierV3Corrupt(a.applyV3) 
        
           	} 
        
           }

@ahrtr @serathius @jmhbnz

ahrtr · 2024-09-20T19:39:40Z

ahrtr
Sep 20, 2024
Maintainer

The existing design is to ensure that once there is corruption, then do not serve any requests until the corruption is resolved. I agree that it isn't perfect, but it's prudent solution to avoid possibly worsen the situation. Corruption is not something that happens often in production; instead it should be rare.

FYI. we are working on etcd-operator, one of the goal is to resolve such situation automatically. https://github.com/etcd-io/etcd-operator/blob/main/docs/roadmap.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Feature request [Availability improvements] - stop the world only for corrupted members, not for all members #18613

chaochn47 Sep 20, 2024 Collaborator

Replies: 1 comment

ahrtr Sep 20, 2024 Maintainer

chaochn47
Sep 20, 2024
Collaborator

ahrtr
Sep 20, 2024
Maintainer