Skip to content

client-api: Rewrite websocket loop #2906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

client-api: Rewrite websocket loop #2906

wants to merge 24 commits into from

Conversation

kim
Copy link
Contributor

@kim kim commented Jun 26, 2025

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

  1. The select! loop is not blocked on sending messages, and can thus
    react to auxiliary events. Namely, when a module exits, we want to
    terminate the connection as soon as possible in order to release any
    database handles.

  2. Large outgoing messages should not occupy tokio worker threads, in
    particular when there are a large number of clients receiving large
    intial updates.

Expected complexity level and risk

4 - The state transitions remain hard to follow.

Testing

  • Ran a stress test with many clients and large initial updates,
    and observed no hangs / delays (which I did before this patch).
    In reconnection scenarios, all clients where disconnected timely, but
    could reconnect almost immediately.

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

1. The `select!` loop is not blocked on sending messages, and can thus
   react to auxiliary events. Namely, when a module exits, we want to
   terminate the connection as soon as possible in order to release any
   database handles.

2. Large outgoing messages should not occupy tokio worker threads, in
   particular when there are a large number of clients receiving large
   intial updates.
@kim kim requested review from Centril, gefjon and jsdt June 26, 2025 18:02
Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to figure out what's going on with the SerializeBuffer and fix it before merging, but otherwise this looks good to me.

kim added 2 commits June 27, 2025 10:07
Also close the messages queue after the close went through.
Accordingly, closed and exited are the same -- we can just drop incoming
messages when closed.
@kim
Copy link
Contributor Author

kim commented Jun 27, 2025

Updated to:

  • Reclaim the serialize buffer
  • Not send any more data after sending a Close frame (as mandated by the RFC)

I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change.

Copy link
Contributor

@jsdt jsdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.

kim added 2 commits June 29, 2025 11:21
Also fixes the actual resource hog, which is that the ws_actor never
terminated because all receive errors were ignored.
@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Update to:

  • split into smaller functions that mainly transform Streams. For readability and testability.
  • actually terminate the actor loop when recv from the socket returns an error

@kim kim changed the title client-api: Move websocket sender to its own tokio task client-api: Rewrite websocket loop Jun 30, 2025
@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Updated to:

  • consider that buffer reclamation can fail if the socket is already closed

  • re-introduce spawning the send loop

    This seems to be necessary in order to guarantee timely release of the database.
    I'm considering to spawn the receive end, too, so that we can get rid of the unbounded buffer + apply backpressure to clients instead.

@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Updated to:

  • spawn the receive end, too

@bfops bfops added the release-any To be landed in any release window label Jun 30, 2025
kim added 8 commits July 1, 2025 11:56
Pong frames sit in line with previously sent messages, and so may not be
received in time if the server is backlogging.

We also want to time out the connection in "cable plugged" scenarios,
where the kernel doesn't consider the connection gone until
`tcp_keepalive_time`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants