Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Feature] Keep-alive \n should be sent roughly every 30 seconds in Event Stream responses to prevent connection timeout errors for cases of very long ITL #2470

Open
josephrocca opened this issue Sep 14, 2024 · 8 comments
Assignees

Comments

@josephrocca
Copy link

josephrocca commented Sep 14, 2024

Note: This is half bug (since it causes unnecessary errors in certain situations), and half feature request (since LMDeploy itself is not responsible for connection timeouts). I wasn't sure which to categorize it as.

Describe the bug/feature

For event streams, it's valid to send \n or : ping\n every so often to prevent the request from timing out if there is, for some reason, a long delay between tokens. This can happen if the server is temporarily overloaded, and if there is a point along the connection between the server and the client which enforces timeouts on "dead-looking" requests - e.g. the browser, or a proxy server, or a service like Cloudflare.

In my case, I have Cloudflare in front, which enforces a maximum timeout of 100 seconds of stalled response stream, at which point it will close the connection.

Reproduction

In my case I'm running this test on a Runpod H100 machine:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --model-name "lmdeploy/llama2-chat-70b-4bit" --server-port 3000 --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4  --log-level INFO

And Runpod uses Cloudflare in front of their machines, so if you e.g. send 100 concurrent requests (in my case with 2000 tokens of random numbers, to ensure no prefix cache matches), then a subset of those will give an error, even though there was no error on the machine. The error was instead caused by connection timeout.

Solution

The solution, as mentioned above, is to send \n every 30 seconds or so for active streaming requests, so that the response stream is never "stalled" for more than 30 seconds.

@josephrocca josephrocca changed the title [Bug] Keep-alive \n should be sent roughly every 30 seconds in Event Stream responses to prevent connection timeout errors for cases of very long ITL [Bug/Feature] Keep-alive \n should be sent roughly every 30 seconds in Event Stream responses to prevent connection timeout errors for cases of very long ITL Sep 14, 2024
@AllentDan
Copy link
Collaborator

Will sending \n affect TTFT(time to the first token)? Besides, the maximum timeout of the client can be adjusted, right?

@josephrocca
Copy link
Author

Will sending \n affect TTFT

@AllentDan No, since the new line does not constitute an event stream data item, it is ignored by the client and serves only as a heartbeat/keepalive. The tokens come in data: <json>\n message, so TTFT is based on the first one of those that is received.

Sending heartbeat/keepalive newlines is common practice. The standard recommends/mentions 15 seconds but in practice 30 seconds is more than frequent enough. The standard also recommends using the : ping\n as the "semantically correct" approach, but in practice \n is better since it is practically equivalent and has more support in client libraries (it's also fewer bytes to send, but that is basically irrelevant).

Besides, the maximum timeout of the client can be adjusted, right?

Unfortunately sometimes this is not possible, since the developer may not control all the gateways upstream/downstream of them. For example, there is no way to raise the default 100 second limit in Cloudflare (unless you are an enterprise customer - costing ~thousands per month).

@AllentDan
Copy link
Collaborator

AllentDan commented Sep 19, 2024

I see. Is there an elegant way for it? Like what interval shall we send \n? Does 30 seconds meet all users' requirements? I did not see any similar feature in inference engines like vllm.

@josephrocca
Copy link
Author

Yes, 30 seconds is sufficient frequency. I'm not sure how to do this elegantly in Python, since I don't know the language well. In JavaScript I would do this (pseudo code):

function requestHandler(request) {
  let keepaliveInterval = setInterval(() => request.write("\n"), 1000*30);
  try {
    for await (let chunk of getChunks(request)) {
      request.write(`data: ${JSON.stringify(chunk)}\n`);
    }
  } catch(e) {
    // error handling
  } finally {
    clearInterval(keepaliveInterval);
    request.end();
  }
}

IIUC Python runtime is not based on an event loop, so I'm not sure of the equivalent code.

This is probably not a big issue for most users, since it's rare to get a TTFT/ITL long enough to trigger timeouts, so if this is very hard to implement in Python, then this issue could be deprioritized. (FWIW, I'm 100x more interested in speculative decoding) But if it's not too hard, then this is a nice addition for robustness. I hit the issue during evals when I was maximizing throughput and some request spikes caused long enough delay to hit connection timeouts of Cloudflare.

I did not see any similar feature in inference engines like vllm.

Yep, and I noticed that both vLLM and SGLang suffer from this issue even more than LMDeploy, since IIRC they sometimes have strange "cliff" in TTFT (instead of gradual rise) when processing many prefills - especially SGLang. Though I did not adjust their max_prefill_token_num/num_tokens_per_iter equivalent, so possibly unfair comparison.

@josephrocca
Copy link
Author

josephrocca commented Sep 19, 2024

Pseudo-code JavaScript for an alternate approach which may be easier to achieve in Python:

// Every 30 seconds, write a newline to all active requests:
let activeRequests = new Set();
setInterval(() => {
  for(let request of activeRequests) {
    request.write("\n");
  }
}, 1000*30);

function requestHandler(request) {
  try {
    activeRequests.add(request); // tag request as active
    for await (let chunk of getChunks(request)) {
      request.write(`data: ${JSON.stringify(chunk)}\n`);
    }
  } catch(e) {
    // error handling
  } finally {
    activeRequests.delete(request);  // un-tag request as active
    request.end();
  }
}

@AllentDan
Copy link
Collaborator

Hi, @josephrocca thanks for your scripts and efforts. I consulted our engineering team, and they said that in this case, it is generally considered that the response delay that users can accept has an upper limit. There is little practical meaning in keeping the client from timing out if it exceeds this limit. In addition, even if there will be timeout situations, it will be reported to the front end, allowing the front end to display the timeout.

@AllentDan
Copy link
Collaborator

As for speculative decoding, it is in our scope. Stay tuned.

@josephrocca
Copy link
Author

josephrocca commented Sep 20, 2024

the response delay that users can accept has an upper limit

In my case I ran into the problem while sending a large batch of requests for an eval framework, not in a normal "chat" situation where latency is important. In batch/bulk/eval type scenarios, TTFT is not a concern. And in that case, it makes sense for LMDeploy to follow standard practice for event streams, since many clients/proxies/gateways assume that standard practice is being followed. It makes LMDeploy work robustly "out of the box" in more situations with no downsides (aside from codebase complexity which I admit is a very important consideration).

But do feel free to close this if it's too low priority - it's not a big issue for me.

(I'm very excited to hear speculative decoding is in scope! Keenly looking forward to that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants