Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

https?.Server.keepAliveTimeout introduced boundary condition that results in client-side socket hang-ups #20256

Open
ggoodman opened this issue Apr 24, 2018 · 10 comments
Labels
help wanted Issues that need assistance from volunteers or PRs that need help to proceed. http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem.

Comments

@ggoodman
Copy link
Contributor

  • Version: v8.11.1
  • Platform: Darwin C02PQ4SHFVH8 17.5.0 Darwin Kernel Version 17.5.0: Mon Mar 5 22:24:32 PST 2018; root:xnu-4570.51.1~1/RELEASE_X86_64 x86_64
  • Subsystem: http

With the introduction of #2534, there now appears to be a window of opportunity after the trailing edge of server.keepAliveTimeout whereby a client and server have an inconsistent view of the state of a keep-alive socket's connection.

In this situation, the server has closed the client's socket, however the client has not yet recognized this and attempts to re-use the persistent connection resulting in a read ECONNRESET.

Below is a reproduction. Notice that the interval of 5000ms coincides with the default value of keepAliveTimeout. This script may run for some time before producing the error as I don't know how to better time things to hit the window of inconsistent state.

const Http = require('http');

const server = Http.createServer((req, res) => {
    res.writeHead(204);
    res.end();
});

const agent = new Http.Agent({
    keepAlive: true,
    maxSockets: Infinity,
});

server.listen(0, '127.0.0.1', () => {
    const address = server.address();

    let requestCount = 0;

    const requestOnce = () => {
        const n = requestCount++;
        const clientReq = Http.get(
            {
                agent,
                hostname: address.address,
                path: '/',
                port: address.port,
            },
            clientRes => {
                clientRes.on('end', () => console.log(`Ended request ${n}`));
                clientRes.on('error', err => {
                    throw err;
                });
                clientRes.resume();
            }
        );

        clientReq.on('error', err => {
            throw err;
        });

        console.log(`Starting request ${n}`);
    };

    setInterval(requestOnce, 5000);
});

On my machine, this produced the following output:

% node node8-socket-timeout.js
Starting request 0
Ended request 0
Starting request 1
Ended request 1
Starting request 2
/path/to/node8-socket-timeout.js:31
        clientReq.on('error', err => { throw err; });
                                       ^

Error: read ECONNRESET
    at _errnoException (util.js:1022:11)
    at TCP.onread (net.js:615:25)

If the above is indeed the intended behaviour of this new feature, I think the community would benefit from a bit of a warning in the http / https docs to the effect that there is this boundary condition. Maybe users can be told that one of two things should be done:

  1. Servers should now deliberately set keepAliveTimeout to 0 and suffer the same memory inefficiencies as node <= 8; or
  2. Clients should implement retry logic specific to these boundary condition socket hang ups

In the 2nd case, I think there is quite a bit of room for doing it incorrectly, especially requests to services or endpoints that are not idempotent. From this perspective, I think it would be pretty helpful to have some guidance from Node Core Team on how to properly detect this specific class of error and how to work around this new behaviour.

@ggoodman
Copy link
Contributor Author

I think my biggest concern is that this is the new default behaviour and this behaviour will:

  1. Cause keep-alive clients to be much more likely to receive read ECONNRESET errors; and
  2. The server authors will likely not have any visibility into this

@bnoordhuis
Copy link
Member

Clients should already be prepared to handle network errors. That never changed, so I don't see what the documentation should specifically call out.

The server authors will likely not have any visibility into this

Handle the 'timeout' event manually if you want insight into that.

@ggoodman
Copy link
Contributor Author

ggoodman commented Apr 25, 2018

Hi @bnoordhuis thanks for the quick response.

That never changed, so I don't see what the documentation should specifically call out.

This is a change that was tagged as semver-minor in the release notes and really caught us off guard. The only reason we even knew it was happening is because we were operating both the client and server side and had noticed a sudden spike in socket hang ups. For us, it required dropping down to to the tcp layer to figure out what was going on and whether this was a regression in our codebase or in node core.

I think my point is that this new default behaviour introduced significant new errors that were totally invisible from the server-side and so we (and likely many others) would be totally blind to them if we didn't happen to be observing both the client and server side.

Handle the 'timeout' event manually if you want insight into that.

I don't see any mention of a 'timeout' event emitted by http.Server, https.Server or net.Server so I'm not sure how this would give the operator of a server-side component visibility into this behaviour. From what we saw, requests were being rejected at a lower level than node during this window.

At a high level, I agree that well-behaved clients should be written to handle retries when their socket hangs up before the headers are sent. In practice, though, I haven't seen any popular http clients that behave this way so I suspect that there are quite a few services that have now introduced significant increases in failed requests without even realizing it (because there is no event or error that can be monitored from the server-side).

At the very minimum, hopefully this issue will help others who find themselves struggling with the same issue when upgrading to 8.x. In the best-case scenario, I'd love to see the docs explain somewhere that there's an unavoidable window of time during which the server has closed the socket but the client doesn't realize it yet (and that this window will happen much more often due to keepAliveTimeout defaulting to 5000ms).

@bnoordhuis
Copy link
Member

I don't see any mention of a 'timeout' event emitted by http.Server, https.Server or net.Server

server.setTimeout() in http.md describes the event and what the default action is. Quoted here for posterity:

If there is a `'timeout'` event listener on the Server object, then it
will be called with the timed-out socket as an argument.

By default, the Server's timeout value is 2 minutes, and sockets are
destroyed automatically if they time out. However, if a callback is assigned
to the Server's `'timeout'` event, timeouts must be handled explicitly.

I'd love to see the docs explain somewhere [..]

You're welcome to open a pull request if you think there is room for improvement. I don't have suggestions on where to add that since you're writing from the perspective of the client.

@joshwilsdon
Copy link

I came to file an issue as I ran into this same problem in testing when we upgraded some services to node v6.17.0 which we didn't realize included breaking changes. But I found this GH issue so am just adding my notes here.

The reason this caused so much concern for us is that we have a client that polls a server on 5 second intervals to get status while a job is running. After almost 6 years of working across several previous versions of node with no known errors, we suddenly started having 20-50% of these polls failing after upgrading (from v6.15.1) to v6.17.0.

I had tracked the problem down to f23b3b6 which was added to v6.x in v6.17.0 after I narrowed my testing down to this example/test server:

var http = require('http');

console.log('running ' + process.version);

var server = http.createServer(function _onRequest(req, res) {
    res.end('hello world\r\n');
});

server.listen(8080, '127.0.0.1', function _onListening() {
    var ip = server.address().address;
    var port = server.address().port;

   console.log('ready on http://%s:%d', ip, port);
});

and this test client which attempts to emulate what our poller is doing as closely as possible:

var http = require('http');

var DELAY = 5000; // very sensitive to timing, +/-10ms does not reliably fail
var NUM_TESTS = 100;

var completed = 0;
var keepAliveAgent = new http.Agent({
    keepAlive: true,
    maxSockets: 1
});
var result_fail = 0;
var result_success = 0;

function _doneGet() {
    completed++;
    if (completed < NUM_TESTS) {
        setTimeout(doGet, DELAY);
    } else {
        console.log('SUCCESS: ' + result_success + ' FAIL: ' + result_fail);
    }
}

function doGet() {
    var req = http.request({
        agent: keepAliveAgent,
        host: '127.0.0.1',
        port: 8080,
        method: 'GET'
    });

    // To kick off the request
    req.end();

    req.on('response', function _onResponse(res) {
        res.on('data', function _onData(chunk) {
            // Do nothing, we don't care about data. But we need to read it
            // otherwise node will wait for us to do so.
        });

        res.on('end', function _doneResponse() {
            if (res.statusCode === 200) {
                result_success++;
            } else {
                console.log(completed + ' ERR: status code ' + res.statusCode);
                result_fail++;
            }
            _doneGet();
        });
    });

    req.on('error', function _onError(err) {
        console.log(completed + ' ERR: ' + err.message);
        result_fail++;
        _doneGet();
    });
}

doGet();

which produces output like (here using the "official" v6.17.0 binary on macOS):

$ ./node-v6.17.0-darwin-x64/bin/node client.js
2 ERR: socket hang up
4 ERR: socket hang up
7 ERR: read ECONNRESET
10 ERR: socket hang up
12 ERR: socket hang up
14 ERR: socket hang up
17 ERR: read ECONNRESET
19 ERR: socket hang up
21 ERR: read ECONNRESET
23 ERR: socket hang up
25 ERR: socket hang up
27 ERR: socket hang up
29 ERR: socket hang up
31 ERR: socket hang up
33 ERR: socket hang up
35 ERR: socket hang up
37 ERR: socket hang up
39 ERR: read ECONNRESET
41 ERR: socket hang up
43 ERR: read ECONNRESET
45 ERR: socket hang up
47 ERR: socket hang up
49 ERR: socket hang up
51 ERR: socket hang up
53 ERR: socket hang up
55 ERR: socket hang up
57 ERR: socket hang up
59 ERR: socket hang up
61 ERR: read ECONNRESET
63 ERR: socket hang up
65 ERR: socket hang up
67 ERR: socket hang up
69 ERR: read ECONNRESET
71 ERR: socket hang up
73 ERR: socket hang up
75 ERR: read ECONNRESET
77 ERR: socket hang up
79 ERR: socket hang up
81 ERR: socket hang up
83 ERR: socket hang up
86 ERR: socket hang up
88 ERR: socket hang up
90 ERR: socket hang up
92 ERR: socket hang up
94 ERR: socket hang up
96 ERR: read ECONNRESET
98 ERR: socket hang up
SUCCESS: 53 FAIL: 47

and looking at the traffic in Wireshark, I see that when we get either "socket hang up" or "read ECONNRESET" we're getting an RST immediately after the second GET request shows up at the server on a given TCP connection. When we have 2 consecutive requests that succeed, it's because the subsequent request(s) got lucky and the server closed the connection just before the next request was made.

I verified on macOS that node versions v0.10.48, v0.12.18, v4.9.1, v6.15.1 and v6.16.0 behave as expected with no errors but v6.17.0 has a high failure rate. On SmartOS v6.15.1 also had no errors but older versions were not tested. I also found that v10.15.3 on macOS has this same problem (and from the above I'd assume v8.x will as well but I did not test).

From the discussion above in this issue it sounds like this breakage is unlikely to get fixed, but I'm leaving this here just in case it helps someone else find this more quickly if/when they upgrade to v6.17.0 and are broken by this. I know it would have been helpful to me if I had found the comments from @ggoodman before I had independently tracked down the same issue.

Any client that does polling on 5s intervals against a server running >= v6.17.0 where keepalives are used will likely hit this a high percentage of the time.

@lattwood
Copy link

lattwood commented Sep 4, 2019

This is possibly related to #20256, and AWS ALBs consistently break from this as a result.

@BridgeAR
Copy link
Member

BridgeAR commented Jan 3, 2020

@ronag I guess you might be interested in this issue.

@BridgeAR BridgeAR added http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem. labels Jan 3, 2020
@ronag
Copy link
Member

ronag commented Jan 3, 2020

As far as my understanding this is a known race condition with HTTP keepAlive which the HTTP spec does not specify how to resolve.

It's always been a problem and from what I understand the changes mentioned here makes it more common. Preferably the timeout on the server should always be longer than on the client (but there is no way to enforce this).

It's very important to understand the limitation of keepAlive and keep in mind that a keepAlive request can fail even though everything seems "ok" (since the server may at anytime decide to kill what it considers is an unused connection).

We added https://nodejs.org/api/http.html#http_request_reusedsocket in order to slightly help with this situation and also provides some information and examples describing it.

In summary, my recommendation is if it you want to keep request failure rate to an absolute minimum, don't use keepAlive. Personally I only use keepAlive with idempotent requests that can be safely retried.

Maybe the docs should be even more explicit with this? Not sure. I don't feel confident enough on the topic to have a solid opinion.

@jasnell jasnell added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Jun 26, 2020
@Bikeman868
Copy link

I came to this thread because I get intermittent 502 responses from my AWS Application Load Balancer. Because the ALB keep alive timeout is 60 seconds and Nest defaults to 5 seconds, every 5 seconds the ALB returns a 502 to the caller.
I searched the web and cannot find anything about how to change the keep alive timeout in nest.
If the keep alive timeout in nest is fixed at 5 seconds, then it cannot be successfully used in any load balancing setup, which makes it unsuitable for most production applications.
Please change the default to something more reasonable, or provide a mechanism for overriding the default value.
Thanks.

@zorankova
Copy link

I came to this thread because I get intermittent 502 responses from my AWS Application Load Balancer. Because the ALB keep alive timeout is 60 seconds and Nest defaults to 5 seconds, every 5 seconds the ALB returns a 502 to the caller.
I searched the web and cannot find anything about how to change the keep alive timeout in nest.
If the keep alive timeout in nest is fixed at 5 seconds, then it cannot be successfully used in any load balancing setup, which makes it unsuitable for most production applications.
Please change the default to something more reasonable, or provide a mechanism for overriding the default value.
Thanks.

We have the same issue with 502s using NestJs and ALB.

You can change it with:

const app = await NestFactory.create(AppModule, { });
app.getHttpAdapter().getHttpServer().keepAliveTimeout = 75 * 1000;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need assistance from volunteers or PRs that need help to proceed. http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants