Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns + round_robin not updating address list in v1.26 #3353

Closed
travisgroth opened this issue Feb 3, 2020 · 7 comments · Fixed by #3361
Closed

dns + round_robin not updating address list in v1.26 #3353

travisgroth opened this issue Feb 3, 2020 · 7 comments · Fixed by #3361
Assignees

Comments

@travisgroth
Copy link

What version of gRPC are you using?

1.26

What version of Go are you using (go version)?

1.13.4

What operating system (Linux, Windows, …) and version?

Linux / Kubernetes / Docker - various.

What did you do?

When using grpc.WithBalancerName(roundrobin.Name), with a dns:/// Dial target, errors do not trigger re-resolve.

Detail: a client starts with dns:///my-service as a dial target and the round_robin load balancer. my-service DNS returns an IP 192.168.0.19. Due to an operational change, my-service is moved and DNS is updated to return 192.168.0.20. The service stops running on 192.168.0.19, refusing connections.

This is approximately the workflow that would occur when using Kubernetes and a headless service if a pod is replaced. This happens for a variety of reasons, including routine updates and rescheduling.

What did you expect to see?

(a) Upon receiving a connection refused, DNS should be re-queried for an updated endpoint list
(b) Without receiving an error from an existing endpoint, the load balancer should be repopulated occasionally with the latest records in DNS as endpoint sets change

This was the behavior previous to 1.26. By rolling code back to grpc 1.25.1, we were able to restore this behavior.
#3165 is likely the culprit.

What did you see instead?

(a) Endpoint list is never re-queried and the client simply throws a persistent error:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.1.5.253:443: connect: connection refused"

(b) Since every subsequent request fails, the endpoint list does not seem to getting updated with new records in DNS at all.

Additional

I recognize the API around DNS and load balancing may have changed recently (and are evolving) but I'm not seeing any clear indication of how to retain the expected behavior.

@dfawley
Copy link
Member

dfawley commented Feb 3, 2020

If you can reliably reproduce this, can you enable logging and include the output here (or in a pastebin)?

https://github.com/grpc/grpc-go#how-to-turn-on-logging

@travisgroth
Copy link
Author

Yes very easy to repro:

INFO: 2020/02/04 01:05:10 parsed scheme: "dns"
INFO: 2020/02/04 01:05:10 ccResolverWrapper: sending update to cc: {[{10.1.6.16:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:05:10 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.16:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:05:10 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, CONNECTING
INFO: 2020/02/04 01:05:10 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, READY
INFO: 2020/02/04 01:05:10 roundrobinPicker: newPicker called with info: {map[0xc00007f8b0:{{10.1.6.16:443  <nil> 0 <nil>}}]}
INFO: 2020/02/04 01:06:40 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, CONNECTING
INFO: 2020/02/04 01:06:40 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/04 01:06:40 grpc: addrConn.createTransport failed to connect to {10.1.6.16:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.16:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, TRANSIENT_FAILURE
WARNING: 2020/02/04 01:06:40 grpc: failed dns A record lookup due to lookup pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 01:06:40 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/02/04 01:06:40 base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, SHUTDOWN

It looks like I was missing something in the chain of events. I was testing a small footprint deployment and for a brief period there might be zero endpoints, thus the resolution error. This seems to be a state it won't recover from. There were no further log messages even upon subsequent retries.

If I start with one replica of my service, add a second one, then remove the first:

INFO: 2020/02/04 01:41:57 parsed scheme: "dns"
INFO: 2020/02/04 01:41:57 ccResolverWrapper: sending update to cc: {[{10.1.6.21:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:41:57 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.21:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:41:57 base.baseBalancer: handle SubConn state change: 0xc000162390, CONNECTING
INFO: 2020/02/04 01:41:57 base.baseBalancer: handle SubConn state change: 0xc000162390, READY
INFO: 2020/02/04 01:41:57 roundrobinPicker: newPicker called with info: {map[0xc000162390:{{10.1.6.21:443  <nil> 0 <nil>}}]}
INFO: 2020/02/04 01:52:35 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, CONNECTING
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/04 01:52:35 grpc: addrConn.createTransport failed to connect to {10.1.6.21:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.21:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, TRANSIENT_FAILURE
INFO: 2020/02/04 01:52:35 ccResolverWrapper: sending update to cc: {[{10.1.6.28:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:52:35 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.28:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc00049f290, CONNECTING
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, SHUTDOWN
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc00049f290, READY
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[0xc00049f290:{{10.1.6.28:443  <nil> 0 <nil>}}]}

In this case we've missed the addition of a second endpoint until the first one fails. The good news is there is no user error, but we're still not connecting to all the available endpoints as they change.

Here's the 1.25.1 behavior:
(a) I start with 1 service replica and simply restart it. We get some transient failure, but the new endpoint is added eventually:

INFO: 2020/02/04 02:39:04 parsed scheme: "dns"
INFO: 2020/02/04 02:39:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:39:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:39:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:39:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:39:04 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:39:04 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:39:04 base.baseBalancer: handle SubConn state change: 0xc00017d180, READY
INFO: 2020/02/04 02:39:04 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0  <nil>}:0xc00017d180]
INFO: 2020/02/04 02:39:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:39:34 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:39:34 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:39:34 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:39:34 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:40:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:40:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:40:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:40:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:40:04 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:40:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:40:34 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:40:34 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:40:34 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:40:34 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:41:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:41:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:41:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:41:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:41:04 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:41:08 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 02:41:08 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:08 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:08 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 02:41:08 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:10 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:10 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:10 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 02:41:10 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:11 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:11 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:31 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: i/o timeout". Reconnecting...
INFO: 2020/02/04 02:41:31 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:34 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:34 roundrobinPicker: newPicker called with readySCs: map[]
INFO: 2020/02/04 02:41:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:41:35 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:41:35 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:41:35 ccResolverWrapper: sending new addresses to cc: [{10.1.6.44:443 0  <nil>}]
INFO: 2020/02/04 02:41:35 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.44:443 0  <nil>}] <nil>} <nil>}
WARNING: 2020/02/04 02:41:35 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: operation was canceled". Reconnecting...
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00019eaa0, CONNECTING
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00017d180, SHUTDOWN
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00019eaa0, READY
INFO: 2020/02/04 02:41:35 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.44:443 0  <nil>}:0xc00019eaa0]

(b) I start with 1 service replica, then add a second. It shows up ~30 seconds later so that I have two endpoints available:

INFO: 2020/02/04 02:34:59 parsed scheme: "dns"
INFO: 2020/02/04 02:34:59 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:34:59 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:34:59 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:34:59 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:34:59 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:34:59 base.baseBalancer: handle SubConn state change: 0xc0005579c0, CONNECTING
INFO: 2020/02/04 02:34:59 base.baseBalancer: handle SubConn state change: 0xc0005579c0, READY
INFO: 2020/02/04 02:34:59 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0  <nil>}:0xc0005579c0]
INFO: 2020/02/04 02:35:29 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:35:29 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:35:29 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:35:29 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:35:29 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:35:59 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:35:59 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:35:59 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:35:59 ccResolverWrapper: sending new addresses to cc: [{10.1.6.41:443 0  <nil>} {10.1.6.31:443 0  <nil>}]
INFO: 2020/02/04 02:35:59 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.41:443 0  <nil>} {10.1.6.31:443 0  <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:35:59 base.baseBalancer: handle SubConn state change: 0xc000556b90, CONNECTING
INFO: 2020/02/04 02:35:59 base.baseBalancer: handle SubConn state change: 0xc000556b90, READY
INFO: 2020/02/04 02:35:59 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0  <nil>}:0xc0005579c0 {10.1.6.41:443 0  <nil>}:0xc000556b90]

@dfawley
Copy link
Member

dfawley commented Feb 4, 2020

It looks like I was missing something in the chain of events. I was testing a small footprint deployment and for a brief period there might be zero endpoints, thus the resolution error. This seems to be a state it won't recover from.

Interesting. I think this is an important step for reproducing; I'll take a look at this.

In this case we've missed the addition of a second endpoint until the first one fails. The good news is there is no user error, but we're still not connecting to all the available endpoints as they change.

This sounds like it's working as intended. The DNS resolver does not regularly poll as of 1.26; it only refreshes when connection errors occur. This is a behavior change from 1.25 (see #3228), which is why the new backend appears there without any error occurring. You can set MaxConnectionAge on the server to cause connections to be gracefully stopped periodically, which will trigger a re-resolve on the client.

@dfawley
Copy link
Member

dfawley commented Feb 4, 2020

@travisgroth if you don't mind, could you run your scenario with zero endpoints using #3361 and confirm this fixes the problem, i.e. the ClientConn should now connect to the new endpoint when it is added. Thanks!

@travisgroth
Copy link
Author

(a) Great! It seems to be resolved with that branch. We get a failed SC and it continues through the empty list until we get a new one that gets to a ready state:

INFO: 2020/02/05 00:19:40 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/05 00:19:40 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:19:40 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/05 00:19:40 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: connect: connection refused". Reconnecting...
INFO: 2020/02/05 00:19:40 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, TRANSIENT_FAILURE
INFO: 2020/02/05 00:19:40 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/02/05 00:19:40 base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/02/05 00:19:41 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:19:41 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/05 00:20:01 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: i/o timeout". Reconnecting...
INFO: 2020/02/05 00:20:01 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, TRANSIENT_FAILURE
INFO: 2020/02/05 00:20:03 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:20:03 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/02/05 00:20:10 ccResolverWrapper: sending update to cc: {[{10.1.6.76:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/05 00:20:10 base.baseBalancer: got new ClientConn state:  {{[{10.1.6.76:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc00046e410, CONNECTING
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, SHUTDOWN
WARNING: 2020/02/05 00:20:10 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: operation was canceled". Reconnecting...
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc00046e410, READY
INFO: 2020/02/05 00:20:10 roundrobinPicker: newPicker called with info: {map[0xc00046e410:{{10.1.6.76:443  <nil> 0 <nil>}}]}

(b) I feel like there are more corner cases that might trip over the total removal of the client regularly checking in for endpoints. Artificially closing connections regularly seems like a disruptive way of forcing a rediscovery interval. It also removes a lot of control over discovery semantics from the client. I wouldn't mind so much having to explicitly enable the feature, but being able to trigger client discovery outside of a server-initiated close sounds much more resilient. Thoughts?

@dfawley
Copy link
Member

dfawley commented Feb 5, 2020

@travisgroth thanks for testing! When this is merged I'll make a patch release with it.

Regarding (b), this is a common concern, however I don't think it's a problem in practice. Please see grpc/grpc#12295 (comment) (and several other comments in that issue) for more discussion on this topic. Note that HTTP/1 recreates connections far more often, so cycling these connections periodically (every 20 minutes or so) shouldn't be a major impact. If you need more responsive discovery, using something other than DNS that doesn't need to be polled at all would be recommended.

@dfawley dfawley added the P1 label Feb 5, 2020
@dfawley dfawley self-assigned this Feb 5, 2020
@travisgroth
Copy link
Author

I appreciate the quick response on this!

After reviewing the thread, I see the reasoning. I'm happy to sideline the DNS discussion for the moment and see how the settings work in practice.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants