Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in high throughput applications due to DNS resolution bug #2788

Closed
plume-mklokocovnik opened this issue Jul 9, 2024 · 5 comments

Comments

@plume-mklokocovnik
Copy link

Problem Description

We want to report a performance bug in the @grpc/grpc-js library. It severely impacts the performance of our application. Some requests take 30 seconds to execute.

We suspect the bug was introduced with a faulty implementation of grpc.dns_min_time_between_resolutions_ms:

Please read the Additional Context section for more issues that could be related to this problem.

The bug was introduced somewhere between @grpc/grpc-js tags 1.8.22 and 1.9.3. Testing was done on our servers with multiple @grpc/grpc-js versions to try and pinpoint the version where it all started. The problem is that even the latest tag at the time of testing 1.10.9 has this same bug still present.

version tested performance issue exists
1.6.7 No
1.8.1 No
1.8.21 No
1.8.22 No
1.9.0 Unstable on our system, test inconclusive
1.9.1 Unstable on our system, test inconclusive
1.9.2 Unstable on our system, test inconclusive
1.9.3 Yes
1.9.7 Yes
1.9.14 Yes
1.10.2 Yes
1.10.3 Yes
1.10.9 Yes

The only thing that changed during the testing was the version of this library. The infrastructure and load test parameters all stayed the same. We do not use any load balancing in the gRPC client. Our client code is initialized as follows:

const client = new proto[serviceName](address, credentials, {});

Testing was done on a live test server with a tool called wrk that called the same endpoint continuously to simulate high throughput. Contents of the config.lua file are as follows:

-- auth.lua
wrk.method = "GET"
wrk.headers["Authorization"] = "Bearer [REDACTED]"

Below are the results of testing for the latest version 1.10.9 and the last version where the problem does not occur 1.8.22:

results 1.10.9

wrk load test shows the Max request duration of 30.11s.

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
    16 threads and 32 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.25s     7.78s   30.11s    83.46%
    Req/Sec     2.76      3.31    10.00     83.02%
    Latency Distribution
        50%    1.37s 
        75%    4.92s 
        90%   19.74s 
        99%   28.83s 
    4245 requests in 5.00m, 8.80MB read
Requests/sec:     14.15
Transfer/sec:     30.02KB

results 1.8.22

wrk load test shows the Max request duration of 2.59s.

    wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
    16 threads and 32 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.23s   287.10ms   2.59s    70.55%
    Req/Sec     2.69      3.19    10.00     84.39%
    Latency Distribution
        50%    1.23s 
        75%    1.41s 
        90%    1.56s 
        99%    2.13s 
    7789 requests in 5.00m, 16.14MB read
Requests/sec:     25.96
Transfer/sec:     55.08KB

Enabling GRPC_VERBOSITY=debug GRPC_TRACE=all revealed that in high throughput scenarios, the gRPC requests are stuck in QUEUE for a long time. When a request is stuck in QUEUE it takes a long time to resolve the DNS. This is not visible in the transactions per second metric but can be seen in the latency stats and distribution.

Reproduction Steps

Not able to reproduce locally as there is DNS resolution involved. It can be reliably reproduced on our live system. To reproduce it locally it would be necessary to have a random closing of a session from the server side. This would cause the client to reconnect and trigger the bug. No load balancing is used on the client side.

Server: A high throughput gRPC server (in our case it is not in Node.js)

Client: Node.js application with @grpc/grpc-js library version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)

Environment

  • Linux Docker: node:16.20.1-alpine

  • Node version: v16.20.1

  • npm version: v8.19.4

  • @grpc/grpc-js library: version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)

Additional Context

Related Issues

We found multiple issues in the grpc-js library that could be related to this problem:

  • DNS storm with round robin load balancing in grpc-js #2023 [closed] -> Introduced grpc.dns_min_time_between_resolutions_ms in 1.6.1. We do not use any special load balancing on the client side. It could be related to the problem because this issue talks about round robin and a fix for that, but it may introduce a bug for other load-balancing strategies.

  • Some requests wait many seconds to be performed #2567 [closed] -> Similar issue where requests are stuck in QUEUE. The issue reporter solved the issue by setting grpc.max_reconnect_backoff_ms to 1000 ms which was cautioned against by the project maintainer. The project maintainer suggested updating to version 1.9.3 where the issue was supposedly fixed.

  • Critical performance bottleneck with high-throughput GRPC stream handling #2669 [open] -> Talks about a performance degradation which is similar to what we noticed. The problem is that they only test transactions per second. But if you look at our tests (see Problem Description section) which show transactions per second as well as latency stats and distribution it can be seen that some transactions take significantly longer than the rest. This is not visible in the transactions per second metric.

Temporary Fix

We implemented a temporary fix by setting grpc.dns_min_time_between_resolutions_ms to 1000 ms.

const client = new proto[serviceName](address, credentials, {
    'grpc.dns_min_time_between_resolutions_ms': 1000,
});

This is not a good solution as it is a global setting and it can affect other parts of the application. It also adds an unnecessary delay of 1 second to otherwise fast operation. We would like to see this issue fixed in the library.

@murgatroid99
Copy link
Member

I don't think that setting is actually the problem, or that it is behaving incorrectly. Rather, I think it is a symptom of another problem: calls should essentially never wait on DNS resolution like that after DNS resolution succeeds once. The bug was probably introduced in 1.9.0, in #2511. The changes there were necessary, but I'm still working out some of the quirks.

I think #2784 might actually help with this. It fixes a bug where the LB policy needs to request name resolution after a disconnect or it could stop connecting entirely. That could also cause this behavior, where it doesn't continue connecting until name resolution happens again, but there's just a delay instead of stopping entirely.

@plume-mklokocovnik
Copy link
Author

@murgatroid99, thank you for your prompt response.
If I understand correctly, our next step is to wait for the merge of #2784 and the creation of a new release, @grpc/grpc-js 1.10.11. Once the new release is available, we will retest it on our servers to check if the bug is resolved.
Do you have any estimate on when the new tag will be available?

@murgatroid99
Copy link
Member

Yes, that is what I suggest. I expect to be able to get that out today.

@murgatroid99
Copy link
Member

It's out now. Please try it out.

@plume-mklokocovnik
Copy link
Author

@murgatroid99 thank you for a quick fix. The latest release @grpc/grpc-js 1.10.11 seems to have resolved the issue.
I ran a stress test on our live system and the results confirm that the issue no longer exists. Based on that I think we can close this issue.

Test results using 1.10.3 to establish a baseline (bug exists)

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.70s     7.30s   30.34s    83.85%
    Req/Sec     1.22      2.27    10.00     90.07%
  Latency Distribution
     50%    2.11s 
     75%    4.21s 
     90%   18.96s 
     99%   28.38s 
  3148 requests in 5.00m, 8.75MB read
Requests/sec:     10.49
Transfer/sec:     29.87KB

Test results using 1.10.11 (bug fixed)

I ran 2 tests to make sure and both look OK.

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.93s   240.49ms   3.17s    77.45%
    Req/Sec     1.35      2.36    10.00     88.60%
  Latency Distribution
     50%    1.90s 
     75%    2.03s 
     90%    2.23s 
     99%    2.64s 
  4954 requests in 5.00m, 13.77MB read
Requests/sec:     16.51
Transfer/sec:     47.00KB
wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.89s   244.26ms   3.46s    78.44%
    Req/Sec     1.38      2.32    10.00     87.98%
  Latency Distribution
     50%    1.88s 
     75%    2.00s 
     90%    2.17s 
     99%    2.58s 
  5046 requests in 5.00m, 14.03MB read
Requests/sec:     16.82
Transfer/sec:     47.87KB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants