-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation in high throughput applications due to DNS resolution bug #2788
Comments
I don't think that setting is actually the problem, or that it is behaving incorrectly. Rather, I think it is a symptom of another problem: calls should essentially never wait on DNS resolution like that after DNS resolution succeeds once. The bug was probably introduced in 1.9.0, in #2511. The changes there were necessary, but I'm still working out some of the quirks. I think #2784 might actually help with this. It fixes a bug where the LB policy needs to request name resolution after a disconnect or it could stop connecting entirely. That could also cause this behavior, where it doesn't continue connecting until name resolution happens again, but there's just a delay instead of stopping entirely. |
@murgatroid99, thank you for your prompt response. |
Yes, that is what I suggest. I expect to be able to get that out today. |
It's out now. Please try it out. |
@murgatroid99 thank you for a quick fix. The latest release @grpc/grpc-js 1.10.11 seems to have resolved the issue. Test results using 1.10.3 to establish a baseline (bug exists)wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
16 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.70s 7.30s 30.34s 83.85%
Req/Sec 1.22 2.27 10.00 90.07%
Latency Distribution
50% 2.11s
75% 4.21s
90% 18.96s
99% 28.38s
3148 requests in 5.00m, 8.75MB read
Requests/sec: 10.49
Transfer/sec: 29.87KB
Test results using 1.10.11 (bug fixed)I ran 2 tests to make sure and both look OK. wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
16 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.93s 240.49ms 3.17s 77.45%
Req/Sec 1.35 2.36 10.00 88.60%
Latency Distribution
50% 1.90s
75% 2.03s
90% 2.23s
99% 2.64s
4954 requests in 5.00m, 13.77MB read
Requests/sec: 16.51
Transfer/sec: 47.00KB wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
16 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.89s 244.26ms 3.46s 78.44%
Req/Sec 1.38 2.32 10.00 87.98%
Latency Distribution
50% 1.88s
75% 2.00s
90% 2.17s
99% 2.58s
5046 requests in 5.00m, 14.03MB read
Requests/sec: 16.82
Transfer/sec: 47.87KB |
Problem Description
We want to report a performance bug in the @grpc/grpc-js library. It severely impacts the performance of our application. Some requests take 30 seconds to execute.
We suspect the bug was introduced with a faulty implementation of
grpc.dns_min_time_between_resolutions_ms
:initial implementation introduced with https://github.com/grpc/grpc-node/pull/2076/files to version 1.6.1
the PR https://github.com/grpc/grpc-node/pull/2571/files introduced this variable to the resolver in version 1.9.3
Please read the Additional Context section for more issues that could be related to this problem.
The bug was introduced somewhere between @grpc/grpc-js tags 1.8.22 and 1.9.3. Testing was done on our servers with multiple @grpc/grpc-js versions to try and pinpoint the version where it all started. The problem is that even the latest tag at the time of testing 1.10.9 has this same bug still present.
The only thing that changed during the testing was the version of this library. The infrastructure and load test parameters all stayed the same. We do not use any load balancing in the gRPC client. Our client code is initialized as follows:
Testing was done on a live test server with a tool called wrk that called the same endpoint continuously to simulate high throughput. Contents of the config.lua file are as follows:
Below are the results of testing for the latest version 1.10.9 and the last version where the problem does not occur 1.8.22:
results 1.10.9
wrk load test shows the Max request duration of 30.11s.
results 1.8.22
wrk load test shows the Max request duration of 2.59s.
Enabling
GRPC_VERBOSITY=debug GRPC_TRACE=all
revealed that in high throughput scenarios, the gRPC requests are stuck in QUEUE for a long time. When a request is stuck in QUEUE it takes a long time to resolve the DNS. This is not visible in the transactions per second metric but can be seen in the latency stats and distribution.Reproduction Steps
Not able to reproduce locally as there is DNS resolution involved. It can be reliably reproduced on our live system. To reproduce it locally it would be necessary to have a random closing of a session from the server side. This would cause the client to reconnect and trigger the bug. No load balancing is used on the client side.
Server: A high throughput gRPC server (in our case it is not in Node.js)
Client: Node.js application with @grpc/grpc-js library version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)
Environment
Linux Docker: node:16.20.1-alpine
Node version: v16.20.1
npm version: v8.19.4
@grpc/grpc-js library: version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)
Additional Context
Related Issues
We found multiple issues in the grpc-js library that could be related to this problem:
DNS storm with round robin load balancing in grpc-js #2023 [closed] -> Introduced
grpc.dns_min_time_between_resolutions_ms
in 1.6.1. We do not use any special load balancing on the client side. It could be related to the problem because this issue talks about round robin and a fix for that, but it may introduce a bug for other load-balancing strategies.Some requests wait many seconds to be performed #2567 [closed] -> Similar issue where requests are stuck in QUEUE. The issue reporter solved the issue by setting grpc.max_reconnect_backoff_ms to 1000 ms which was cautioned against by the project maintainer. The project maintainer suggested updating to version 1.9.3 where the issue was supposedly fixed.
Critical performance bottleneck with high-throughput GRPC stream handling #2669 [open] -> Talks about a performance degradation which is similar to what we noticed. The problem is that they only test transactions per second. But if you look at our tests (see Problem Description section) which show transactions per second as well as latency stats and distribution it can be seen that some transactions take significantly longer than the rest. This is not visible in the transactions per second metric.
Temporary Fix
We implemented a temporary fix by setting
grpc.dns_min_time_between_resolutions_ms
to 1000 ms.This is not a good solution as it is a global setting and it can affect other parts of the application. It also adds an unnecessary delay of 1 second to otherwise fast operation. We would like to see this issue fixed in the library.
The text was updated successfully, but these errors were encountered: