-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS storm with round robin load balancing in grpc-js #2023
Comments
I think I see what is happening here: the clients are failing to connect to some of the addresses returned by the DNS. Those connection failures trigger DNS re-resolution attempts, which do not back off in this situation. The lack of a backoff here is a bug that I will fix. The connection failures would also explain the uneven request distribution. You can get logs with more information about what is happening here by setting the environment variables |
I don't know anything about your network setup, so I have no information about why you would have connection failures. It's not clear to me what that graph shows, but it's hard to evaluate the behavior of the round robin algorithm here because the connection failures will cause unevenness: each client will send all of its traffic only to the servers it is connected to. |
I have published grpc-js 1.5.1 with some throttling on DNS requests. Can you try that out and see what impact it has? |
Thanks for the quick response! I will have some cycles early next week to test. |
I managed to do a little data collection today with 1.5.1. I can't easily share any raw log data, but what I am seeing is a ton of logs from subchannels entering into the transient failure state: Makes sense to see that with the extra DNS lookups happening. Something else I'm seeing a lot of logs for is this pick result line here:
Is this just broken logging or a smoking gun? edit: ah, i think that might be a red herring. those are for clients which aren't configured to use the round robin picker. |
The first log indicates that a connection is dropped. In most cases, I would expect to see a nearby log with some error details, but it's not guaranteed. The second log is more of an internal debugging log. It just indicates that there is a pending request but it is not starting yet because a connection has not yet been established. It's not broken, just kind of lazy. The A large number of the Also, have you checked the DNS requests graph after upgrading to version 1.5.1? It should have a lower peak. |
Yeah, can confirm lower peak. So thank you for that fix! I'm getting a little more input from our compute team, and it sounds like in our case part of the issue might be the upstream services. Apparently there is a... pattern? ...of servers sending a GOAWAY to prevent sticky connections to pods which are no longer part of an active deployment. What happens to a subchannel in that case? i'm guessing it gets recreated and the host needs to be re-resolved. |
I don't think the GOAWAYs are the issue. Those should show up in the logs as If you search your logs for a single subchannel ID ("43986" in that first log line, for example), you can see the full lifecycle for a subchannel. If you look at a few random subchannels that have a |
So I filtered my logs for a few subchannels, in some cases I seem to be missing the subchannel initialization but they all seem to behave similarly. I'm a little confused why i'm seeing different IPs for the same subchannel, maybe a side effect of the service using cluster mode?
The other notable observation I made was if I filter for a single IP address, I'm seeing a lot of hits for grpc-js IP logs - grpc-logs.csv edit - another finding, I can see a very similar number of events if i filter for the following:
second edit - i'm realizing one thing which is complicating this investigation is i'm getting debug logs for clients which are using still pick-first. currently we've only enabled RRLB for a single client integration, so I think my earlier findings are probably a little flawed. 🤦 |
I had a chance to run another round of testing and have a clearer picture what is happening. Here is the trace logs for a single subchannel of a client configured to use round robin LB:
So basically every 2ish seconds this subchannel gets a
So it looks like the clients are basically running dns resolutions non-stop. I said originally the connection backoff patch helped reduce the peaks, but that doesn't seem to be the case. Even a small number of clients which aren't handling traffic are generating a huge number of DNS lookups. For this particular deployment I was testing with we have 2 pods, each running 4 workers processes, and 128 upstream hosts. According to our internal DNS metrics this was generating ~1.1k DNS lookups/sec. Does a subchannel going from edit - ok yeah i found the line. the RRLB state transition listener runs
This confirms what you originally said: subchannels keep getting |
I am sorry for the delayed response. This issue dropped off my radar for a little while. Yes, the idea is that when the backend sends a GOAWAY, that is a signal that the set of available backends may have changed, and the client should do name resolution again to check that. One thing I notice is that the last log looks like multiple separate channel objects. Each channel has its own separate DNS resolver object, which will do separate DNS requests. There is an option that you may be able to use to mitigate this: gRPC |
I just published grpc-js version 1.6.1 with support for the |
@murgatroid99 does this need to be enabled in any way? I’m not being able to the re-resolution timer with default config values. I’m using the round robin load balancing config option. |
It should just work by default. What exactly are you seeing that indicates that it is not working? Note that this setting only limits the frequency of successful DNS requests. Failing DNS requests are limited by a separate exponential backoff timer. |
Problem description
We are having some problems with the load balancing in
grpc-js
.We are seeing uneven distribution of calls to our service pods, which ends up sometimes overloading some of them, while keeping others at very low load.
We think this might be because of the default load balancing strategy of "pick first", so we tried enabling round robin but this caused a bunch of issues
Any ideas how we could address this uneven distribution issue, and what could be wrong with load balancing?
Reproduction steps
Our (singleton) clients get instantiated with the DNS address of the service.
The DNS returns the IP of all the available pods for the given service.
We enable round robin load balancing by providing this configuration to the client:
There was no other change to the clients besides the lb config.
Environment
Additional context
When we tried to deploy the mentioned config change this the behavior we saw:
(the baseline is for ~100 pods, while the spike is for just 4 canary pods where a single client configuration was changed)
CPU:
DNS requests:
The text was updated successfully, but these errors were encountered: