*: configure server keepalive, optimize client balancer with health check #8477

gyuho · 2017-08-31T21:30:58Z

Configure server gRPC keepalive parameters
Optimize client endpoint switch by gray-listing transient-failed nodes

keepalive timed-out is 'connectivity.TransientFailure'
in gRPC; it keeps retrying (calling 'Balancer.Up') until
success. This is problematic in multi-endpoint balancer
with an endpoint being blackholed. Balancer can get stuck
retrying blackholed endpoint, taking several seconds to
find healthy ones.

Also gray-listing(#8463) endpoints doesn't work in following case:

# TestKVGetResetLoneEndpoint

CODE / CURRENT PINNED ADDRESS
01. clientv3.New(ep1, ep2) / NONE
02. notifyAddrs(ep1, ep2) / NONE
03. grpc.lbWatcher receives ep1, ep2 from Notify() / NONE
04. grpc.lbWatcher ADD calls resetAddrConn ep1 / NONE
05. grpc.lbWatcher ADD calls resetAddrConn ep2 / NONE

06. grpc calls resetTransport Up(ep1) / NONE
07. clientv3.Balancer Up pins ep1 / ep1

08. grpc calls resetTransport Up(ep2) / ep1
09. clientv3.Balancer Up DOES NOT pin ep2 / ep1

10. updateNotifyLoop sends b.notifyCh <- pinned ep1 / ep1

11. Stop(ep1) / ep1

12. grpc.lbWatcher receives ep1 from Notify() / ep1
13. grpc.lbWatcher DEL calls tearDown(errConnDrain) ep2 / ep1

14. Stop(ep2) / ep1

15. clientv3.Balancer Up down network I/O error on ep1 / NONE
16. Gray-list ep1

# ep1 is gray-listed, so only notify ep2
# but ep2 is also stopped
# this makes balancer stuck with ep2
17. notifyAddrs(ep2) / NONE
18. notifyAddrs(ep2) / NONE
19. grpc.lbWatcher receives ep2 from Notify() / NONE
20. grpc.lbWatcher ADD calls resetAddrConn ep2 / NONE
21. grpc.lbWatcher DEL calls tearDown(errConnDrain) ep1 / NONE
22. grpc.lbWatcher receives ep2 from Notify() / NONE

23. Restart(ep1) / NONE

24. Get(ep1,ep2) timed out / NONE

At step 17 above, we should instead notify both ep1 and ep2.
If we exclude gray-listed ep1, balancer gets stuck with stopped endpoint ep2.

Problem is:

ep2 was never pinned
ep2 is also stopped
So gRPC fails to connect to ep2 with error "transport: Error while dialing dial unix ep2: connect: no such file or directory"
There's no way to check if ep2 is down.

This PR adds additional health-check API call to discover endpoint status on endpoint notify.

heyitsanthony · 2017-08-31T23:51:58Z

If we exclude gray-listed ep1, balancer gets stuck with stopped endpoint ep2.

Is this with posting ep1 to notifyCh once the gray list deadline times out?

gyuho · 2017-09-01T00:11:57Z

If we exclude gray-listed ep1, balancer gets stuck with stopped endpoint ep2.

Is this with posting ep1 to notifyCh once the gray list deadline times out?

That happens when we only post ep2 (ep1 is gray-listed), which is step 17.
To post ep1 again (after ep1 gray-list time-out), ep2 has to be pinned first, and then gets network I/O error, so it can trigger notifyAddrs in #8463. But this whole process takes more than 5-seconds, so Get request in TestKVGetResetLoneEndpoint times out even before ep2 gets pinned--easy to repro in slow CPU machine.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

heyitsanthony

balancer will have to be a separate patch before keepalive can be merged in; I'm trying to refactor this into something that can cleanly work with partition failover as well. simpleBalancer is getting too complicated to reason about

heyitsanthony · 2017-09-08T22:29:06Z

clientv3/balancer.go

+		if len(addrs) == 0 { // no better alternative found
+			addrs = b.addrs
+		} else { // sort that latest failed be at the end
+			addrConns := make([]addrConn, 0, len(b.failed))


does grpc make any guarantees about the ordering? my understanding is it can try all the connections at once

Right, there's no ordering and gRPC tries all at once. Still, goroutine starts in order.
Anyway, this wasn't necessary.

gyuho · 2017-09-09T00:17:39Z

@heyitsanthony Agree. This is getting too complicated. I will separate out server options, first.

fanminshi · 2017-09-09T00:29:09Z

there is something call pickFirstBalancer from grpc-go client. Has anyone took a deep look on what it does?

// pickFirst is used to test multi-addresses in one addrConn in which all addresses share the same addrConn.
// It is a wrapper around roundRobin balancer. The logic of all methods works fine because balancer.Get()
// returns the only address Up by resetTransport().
type pickFirst struct {
	*roundRobin
}

func pickFirstBalancer(r naming.Resolver) Balancer {
	return &pickFirst{&roundRobin{r: r}}
}

https://github.com/grpc/grpc-go/blob/master/balancer.go#L399-L408

gyuho · 2017-09-13T20:14:25Z

@fanminshi pickFirstBalancer assumes the default grpc-go roundrobin balancer, so we can't use it. Maybe in the new balancer implementation grpc/grpc-go#1506.

gyuho · 2017-09-20T00:04:28Z

Update: closed in favor of #8545.

gyuho added the WIP label Aug 31, 2017

gyuho force-pushed the keepalive-2 branch 4 times, most recently from bd621e1 to c55d608 Compare August 31, 2017 22:46

gyuho changed the title ~~[WIP] *: configure server keepalive, optimize client balancer~~ [WIP] *: configure server keepalive, optimize client balancer with health check Sep 1, 2017

gyuho force-pushed the keepalive-2 branch 5 times, most recently from 03b521f to 3fff472 Compare September 6, 2017 17:56

gyuho changed the title ~~[WIP] *: configure server keepalive, optimize client balancer with health check~~ *: configure server keepalive, optimize client balancer with health check Sep 6, 2017

gyuho removed the WIP label Sep 6, 2017

gyuho requested a review from heyitsanthony September 6, 2017 19:23

gyuho force-pushed the keepalive-2 branch 2 times, most recently from 9a51d46 to 7a27a45 Compare September 8, 2017 16:32

gyuho added 7 commits September 8, 2017 09:50

api/rpc: accept grpc.ServerOption's for keepalive policy

380cad9

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

embed: define keepalive server options

28be30d

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

etcdmain: add 'grpc-keepalive-*' flags

ee3e9e0

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

integration: add Blackhole to bridgeConn

8b2531f

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

integration: configure keepalive parameters for server

94609cf

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3/integration: add TestWatchKeepAlive

b282925

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

clientv3: optimize balancer switch with health check

24a0800

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho force-pushed the keepalive-2 branch from 7a27a45 to 24a0800 Compare September 8, 2017 16:50

words: mask more words in spellcheck

f563710

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

heyitsanthony reviewed Sep 8, 2017

View reviewed changes

gyuho added the WIP label Sep 9, 2017

gyuho mentioned this pull request Sep 9, 2017

*: configure server keepalive #8535

Merged

gyuho closed this Sep 18, 2017

gyuho deleted the keepalive-2 branch October 5, 2017 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: configure server keepalive, optimize client balancer with health check #8477

*: configure server keepalive, optimize client balancer with health check #8477

gyuho commented Aug 31, 2017 •

edited

Loading

heyitsanthony commented Aug 31, 2017

gyuho commented Sep 1, 2017 •

edited

Loading

heyitsanthony left a comment

heyitsanthony Sep 8, 2017

gyuho Sep 9, 2017

gyuho commented Sep 9, 2017

fanminshi commented Sep 9, 2017

gyuho commented Sep 13, 2017

gyuho commented Sep 20, 2017

*: configure server keepalive, optimize client balancer with health check #8477

*: configure server keepalive, optimize client balancer with health check #8477

Conversation

gyuho commented Aug 31, 2017 • edited Loading

heyitsanthony commented Aug 31, 2017

gyuho commented Sep 1, 2017 • edited Loading

heyitsanthony left a comment

Choose a reason for hiding this comment

heyitsanthony Sep 8, 2017

Choose a reason for hiding this comment

gyuho Sep 9, 2017

Choose a reason for hiding this comment

gyuho commented Sep 9, 2017

fanminshi commented Sep 9, 2017

gyuho commented Sep 13, 2017

gyuho commented Sep 20, 2017

gyuho commented Aug 31, 2017 •

edited

Loading

gyuho commented Sep 1, 2017 •

edited

Loading