Performance degradation in high throughput applications due to DNS resolution bug #2788

plume-mklokocovnik · 2024-07-09T10:01:04Z

Problem Description

We want to report a performance bug in the @grpc/grpc-js library. It severely impacts the performance of our application. Some requests take 30 seconds to execute.

We suspect the bug was introduced with a faulty implementation of grpc.dns_min_time_between_resolutions_ms:

initial implementation introduced with https://github.com/grpc/grpc-node/pull/2076/files to version 1.6.1
the PR https://github.com/grpc/grpc-node/pull/2571/files introduced this variable to the resolver in version 1.9.3

Please read the Additional Context section for more issues that could be related to this problem.

The bug was introduced somewhere between @grpc/grpc-js tags 1.8.22 and 1.9.3. Testing was done on our servers with multiple @grpc/grpc-js versions to try and pinpoint the version where it all started. The problem is that even the latest tag at the time of testing 1.10.9 has this same bug still present.

version tested	performance issue exists
1.6.7	No
1.8.1	No
1.8.21	No
1.8.22	No
1.9.0	Unstable on our system, test inconclusive
1.9.1	Unstable on our system, test inconclusive
1.9.2	Unstable on our system, test inconclusive
1.9.3	Yes
1.9.7	Yes
1.9.14	Yes
1.10.2	Yes
1.10.3	Yes
1.10.9	Yes

The only thing that changed during the testing was the version of this library. The infrastructure and load test parameters all stayed the same. We do not use any load balancing in the gRPC client. Our client code is initialized as follows:

const client = new proto[serviceName](address, credentials, {});

Testing was done on a live test server with a tool called wrk that called the same endpoint continuously to simulate high throughput. Contents of the config.lua file are as follows:

-- auth.lua
wrk.method = "GET"
wrk.headers["Authorization"] = "Bearer [REDACTED]"

Below are the results of testing for the latest version 1.10.9 and the last version where the problem does not occur 1.8.22:

results 1.10.9

wrk load test shows the Max request duration of 30.11s.

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
    16 threads and 32 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.25s     7.78s   30.11s    83.46%
    Req/Sec     2.76      3.31    10.00     83.02%
    Latency Distribution
        50%    1.37s 
        75%    4.92s 
        90%   19.74s 
        99%   28.83s 
    4245 requests in 5.00m, 8.80MB read
Requests/sec:     14.15
Transfer/sec:     30.02KB

results 1.8.22

wrk load test shows the Max request duration of 2.59s.

    wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
    16 threads and 32 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.23s   287.10ms   2.59s    70.55%
    Req/Sec     2.69      3.19    10.00     84.39%
    Latency Distribution
        50%    1.23s 
        75%    1.41s 
        90%    1.56s 
        99%    2.13s 
    7789 requests in 5.00m, 16.14MB read
Requests/sec:     25.96
Transfer/sec:     55.08KB

Enabling GRPC_VERBOSITY=debug GRPC_TRACE=all revealed that in high throughput scenarios, the gRPC requests are stuck in QUEUE for a long time. When a request is stuck in QUEUE it takes a long time to resolve the DNS. This is not visible in the transactions per second metric but can be seen in the latency stats and distribution.

Reproduction Steps

Not able to reproduce locally as there is DNS resolution involved. It can be reliably reproduced on our live system. To reproduce it locally it would be necessary to have a random closing of a session from the server side. This would cause the client to reconnect and trigger the bug. No load balancing is used on the client side.

Server: A high throughput gRPC server (in our case it is not in Node.js)

Client: Node.js application with @grpc/grpc-js library version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)

Environment

Linux Docker: node:16.20.1-alpine
Node version: v16.20.1
npm version: v8.19.4
@grpc/grpc-js library: version 1.9.3 (inclusive) up to 1.10.9 (latest at time of testing)

Additional Context

Related Issues

We found multiple issues in the grpc-js library that could be related to this problem:

DNS storm with round robin load balancing in grpc-js #2023 [closed] -> Introduced grpc.dns_min_time_between_resolutions_ms in 1.6.1. We do not use any special load balancing on the client side. It could be related to the problem because this issue talks about round robin and a fix for that, but it may introduce a bug for other load-balancing strategies.
Some requests wait many seconds to be performed #2567 [closed] -> Similar issue where requests are stuck in QUEUE. The issue reporter solved the issue by setting grpc.max_reconnect_backoff_ms to 1000 ms which was cautioned against by the project maintainer. The project maintainer suggested updating to version 1.9.3 where the issue was supposedly fixed.
Critical performance bottleneck with high-throughput GRPC stream handling #2669 [open] -> Talks about a performance degradation which is similar to what we noticed. The problem is that they only test transactions per second. But if you look at our tests (see Problem Description section) which show transactions per second as well as latency stats and distribution it can be seen that some transactions take significantly longer than the rest. This is not visible in the transactions per second metric.

Temporary Fix

We implemented a temporary fix by setting grpc.dns_min_time_between_resolutions_ms to 1000 ms.

const client = new proto[serviceName](address, credentials, {
    'grpc.dns_min_time_between_resolutions_ms': 1000,
});

This is not a good solution as it is a global setting and it can affect other parts of the application. It also adds an unnecessary delay of 1 second to otherwise fast operation. We would like to see this issue fixed in the library.

The text was updated successfully, but these errors were encountered:

murgatroid99 · 2024-07-09T16:56:07Z

I don't think that setting is actually the problem, or that it is behaving incorrectly. Rather, I think it is a symptom of another problem: calls should essentially never wait on DNS resolution like that after DNS resolution succeeds once. The bug was probably introduced in 1.9.0, in #2511. The changes there were necessary, but I'm still working out some of the quirks.

I think #2784 might actually help with this. It fixes a bug where the LB policy needs to request name resolution after a disconnect or it could stop connecting entirely. That could also cause this behavior, where it doesn't continue connecting until name resolution happens again, but there's just a delay instead of stopping entirely.

plume-mklokocovnik · 2024-07-10T06:18:57Z

@murgatroid99, thank you for your prompt response.
If I understand correctly, our next step is to wait for the merge of #2784 and the creation of a new release, @grpc/grpc-js 1.10.11. Once the new release is available, we will retest it on our servers to check if the bug is resolved.
Do you have any estimate on when the new tag will be available?

murgatroid99 · 2024-07-10T17:21:00Z

Yes, that is what I suggest. I expect to be able to get that out today.

murgatroid99 · 2024-07-10T17:53:00Z

It's out now. Please try it out.

plume-mklokocovnik · 2024-07-12T10:44:31Z

@murgatroid99 thank you for a quick fix. The latest release @grpc/grpc-js 1.10.11 seems to have resolved the issue.
I ran a stress test on our live system and the results confirm that the issue no longer exists. Based on that I think we can close this issue.

Test results using 1.10.3 to establish a baseline (bug exists)

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.70s     7.30s   30.34s    83.85%
    Req/Sec     1.22      2.27    10.00     90.07%
  Latency Distribution
     50%    2.11s 
     75%    4.21s 
     90%   18.96s 
     99%   28.38s 
  3148 requests in 5.00m, 8.75MB read
Requests/sec:     10.49
Transfer/sec:     29.87KB

Test results using 1.10.11 (bug fixed)

I ran 2 tests to make sure and both look OK.

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.93s   240.49ms   3.17s    77.45%
    Req/Sec     1.35      2.36    10.00     88.60%
  Latency Distribution
     50%    1.90s 
     75%    2.03s 
     90%    2.23s 
     99%    2.64s 
  4954 requests in 5.00m, 13.77MB read
Requests/sec:     16.51
Transfer/sec:     47.00KB

wrk -d 300s -t 16 -c 32 --timeout 60 --latency -s config.lua http://[REDACTED]
Running 5m test @ http://[REDACTED]
  16 threads and 32 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.89s   244.26ms   3.46s    78.44%
    Req/Sec     1.38      2.32    10.00     87.98%
  Latency Distribution
     50%    1.88s 
     75%    2.00s 
     90%    2.17s 
     99%    2.58s 
  5046 requests in 5.00m, 14.03MB read
Requests/sec:     16.82
Transfer/sec:     47.87KB

plume-mklokocovnik closed this as completed Jul 12, 2024

Farenheith mentioned this issue Sep 7, 2024

Critical performance bottleneck with high-throughput GRPC stream handling #2669

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation in high throughput applications due to DNS resolution bug #2788

Performance degradation in high throughput applications due to DNS resolution bug #2788

plume-mklokocovnik commented Jul 9, 2024

murgatroid99 commented Jul 9, 2024

plume-mklokocovnik commented Jul 10, 2024

murgatroid99 commented Jul 10, 2024

murgatroid99 commented Jul 10, 2024

plume-mklokocovnik commented Jul 12, 2024

Performance degradation in high throughput applications due to DNS resolution bug #2788

Performance degradation in high throughput applications due to DNS resolution bug #2788

Comments

plume-mklokocovnik commented Jul 9, 2024

Problem Description

results 1.10.9

results 1.8.22

Reproduction Steps

Environment

Additional Context

Related Issues

Temporary Fix

murgatroid99 commented Jul 9, 2024

plume-mklokocovnik commented Jul 10, 2024

murgatroid99 commented Jul 10, 2024

murgatroid99 commented Jul 10, 2024

plume-mklokocovnik commented Jul 12, 2024

Test results using 1.10.3 to establish a baseline (bug exists)

Test results using 1.10.11 (bug fixed)