client-api: Rewrite websocket loop #2906

kim · 2025-06-26T18:02:14Z

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

The select! loop is not blocked on sending messages, and can thus
react to auxiliary events. Namely, when a module exits, we want to
terminate the connection as soon as possible in order to release any
database handles.
Large outgoing messages should not occupy tokio worker threads, in
particular when there are a large number of clients receiving large
intial updates.

Expected complexity level and risk

4 - The state transitions remain hard to follow.

Testing

Ran a stress test with many clients and large initial updates,
and observed no hangs / delays (which I did before this patch).
In reconnection scenarios, all clients where disconnected timely, but
could reconnect almost immediately.

Split the websocket stream into send and receive halves and spawns a new tokio task to handle the sending. Also move message serialization + compression to a blocking task if the message appears to be large. This addresses two issues: 1. The `select!` loop is not blocked on sending messages, and can thus react to auxiliary events. Namely, when a module exits, we want to terminate the connection as soon as possible in order to release any database handles. 2. Large outgoing messages should not occupy tokio worker threads, in particular when there are a large number of clients receiving large intial updates.

crates/client-api/src/routes/subscribe.rs

gefjon

I'd like to figure out what's going on with the SerializeBuffer and fix it before merging, but otherwise this looks good to me.

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

kim · 2025-06-27T11:09:02Z

Updated to:

Reclaim the serialize buffer
Not send any more data after sending a Close frame (as mandated by the RFC)

I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change.

jsdt

I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.

crates/client-api/src/routes/subscribe.rs

Also fixes the actual resource hog, which is that the ws_actor never terminated because all receive errors were ignored.

kim · 2025-06-30T07:02:01Z

Update to:

split into smaller functions that mainly transform Streams. For readability and testability.
actually terminate the actor loop when recv from the socket returns an error

kim · 2025-06-30T07:58:34Z

Updated to:

consider that buffer reclamation can fail if the socket is already closed
re-introduce spawning the send loop

This seems to be necessary in order to guarantee timely release of the database.
I'm considering to spawn the receive end, too, so that we can get rid of the unbounded buffer + apply backpressure to clients instead.

kim · 2025-06-30T08:53:20Z

Updated to:

spawn the receive end, too

…t message

Pong frames sit in line with previously sent messages, and so may not be received in time if the server is backlogging. We also want to time out the connection in "cable plugged" scenarios, where the kernel doesn't consider the connection gone until `tcp_keepalive_time`.

Also losen `Unpin` requirements and use long names for type variables denoting futures.

kim requested review from Centril, gefjon and jsdt June 26, 2025 18:02

kim commented Jun 26, 2025

View reviewed changes

crates/client-api/src/routes/subscribe.rs Show resolved Hide resolved

gefjon approved these changes Jun 26, 2025

View reviewed changes

kim added 2 commits June 27, 2025 10:07

Reclaim those bytes

038aeb0

Don't send more data after sending a close frame.

b67fbd3

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

jsdt reviewed Jun 27, 2025

View reviewed changes

kim added 2 commits June 29, 2025 11:21

Use rayon instead of tokio blocking thread for serialize

7e6df49

Rewrite to use more modular stream transformers for testability

4e75fc2

Also fixes the actual resource hog, which is that the ws_actor never terminated because all receive errors were ignored.

kim changed the title ~~client-api: Move websocket sender to its own tokio task~~ client-api: Rewrite websocket loop Jun 30, 2025

kim added 2 commits June 30, 2025 09:39

Reclaim can actually fail

1cf159d

Send loop needs to be spawned

787f050

kim added 2 commits June 30, 2025 10:05

Merge branch 'master' into kim/ws/unblock

a36d4f0

Spawn receiver

1e8f3c0

Apply timeout to draining the receiver until closed, not just the nex…

e88b76c

…t message

bfops added the release-any To be landed in any release window label Jun 30, 2025

kim added 8 commits July 1, 2025 11:56

Refactor

1396eac

Propagate task panics and abort tasks on exit / unwind.

55178e7

Terminate send loop on all send errors

3c982a2

Exit select loop when send task terminates -- connection is probably bad

76f3b6b

Add a batch of tests

abdcad1

fixup! Add a batch of tests

e5b0b96

Another batch of tests

5cf5d3b

kim added 6 commits July 2, 2025 11:22

Well, always return when the send task is gone.

a9aaf68

Final batch of tests

edda56d

Merge remote-tracking branch 'origin/master' into kim/ws/unblock

6a0d928

Hotswap future needs to be recreated after it completed.

2d1d335

Also losen `Unpin` requirements and use long names for type variables denoting futures.

Syntactically nicer

e5f42ca

Reset idle timeout only on recv

a855b1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

client-api: Rewrite websocket loop #2906

client-api: Rewrite websocket loop #2906

kim commented Jun 26, 2025

Uh oh!

Uh oh!

gefjon left a comment

Uh oh!

kim commented Jun 27, 2025

Uh oh!

jsdt left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kim commented Jun 30, 2025

Uh oh!

kim commented Jun 30, 2025 •

edited

Loading

Uh oh!

kim commented Jun 30, 2025

Uh oh!

Uh oh!

client-api: Rewrite websocket loop #2906

Are you sure you want to change the base?

client-api: Rewrite websocket loop #2906

Conversation

kim commented Jun 26, 2025

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

gefjon left a comment

Choose a reason for hiding this comment

Uh oh!

kim commented Jun 27, 2025

Uh oh!

jsdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kim commented Jun 30, 2025

Uh oh!

kim commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kim commented Jun 30, 2025

Uh oh!

Uh oh!

kim commented Jun 30, 2025 •

edited

Loading