Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/server: Don't block the main connection loop for transport IO #73

Closed
wants to merge 1 commit into from

Conversation

kevpar
Copy link
Member

@kevpar kevpar commented Dec 7, 2020

Restructures both the client and server connection management so that
sending messages on the transport is done by a separate "sender"
goroutine. The receiving end was already split out like this.

Without this change, it is possible for a send to block if the other end
isn't reading fast enough, which then would block the main connection
loop and prevent incoming messages from being processed.

Signed-off-by: Kevin Parsons kevpar@microsoft.com

Fixes #72

Note: I feel there may be other things that can be cleaned up in the client/server connection code, but with this PR I was focused on fixing this bug specifically since we are seeing it in production.

@kevpar
Copy link
Member Author

kevpar commented Dec 7, 2020

kevpar added a commit to kevpar/cri that referenced this pull request Dec 7, 2020
This pulls in a new version of github.com/containerd/ttrpc from a fork
to fix the deadlock issue in containerd/ttrpc#72. Will revert back to
the upstream ttrpc vendor once the fix is merged (containerd/ttrpc#73).

Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
kevpar added a commit to kevpar/cri that referenced this pull request Dec 7, 2020
This pulls in a new version of github.com/containerd/ttrpc from a fork
to fix the deadlock issue in containerd/ttrpc#72. Will revert back to
the upstream ttrpc vendor once the fix is merged (containerd/ttrpc#73).

This fix also included some vendoring cleanup from running "vndr".

Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
Copy link

@anmaxvl anmaxvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// the main loop will return and close done, which will cause us to exit as well.
case <-done:
return
case response := <-responses:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be slightly clearer to defer close(responses) and just have this be for response := range responses.

Copy link

@jstarks jstarks Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe that's not practical since responses might still be referenced in the call goroutine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally you don't want to close from the read side.

client.go Outdated
}

go func(streamID uint32, call *callRequest) {
requests <- streamCall{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call c.send here directly, rather than pop over to another goroutine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could result in multiple of this goroutines calling c.send concurrently, couldn't it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do keep this model, do you need to select here on ctx.Done() so that this goroutine doesn't hang forever?

(Alternatively maybe the other goroutine shouldn't select on ctx.Done() and should use some other scheme to determine when it's done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we don't have any way for these to be cleaned up if the connection closes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep a single sender goroutine that receives messages via channel and calls c.send. That will ensure we don't interleave the bits from multiple messages on the wire.

ctx.Done() seems to be the client's equivalent of the done channel on the server side, so I think that's probably most appropriate to select on to see when we should terminate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering your question, yes, that's true. I thought that was already possible but I was wrong. So that's out, I suppose.

A possible problem with this approach is that you've eliminated the backpressure on calls--if the socket is busy, we still keep processing messages from calls, allocating more goroutines without bound. Before, we would stop pulling messages off calls, which would allow someone select sending on calls (doubt this happens, though, didn't look yet). Also storing messages on calls is probably more memory and CPU efficient than storing them in blocked goroutines.

Not sure if that's a practical consideration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed we do select on sending to calls. So I think this is a problem worth solving.

I'd suggest trying to process calls directly in the new goroutine. You'll need to come up with a new scheme for synchronizing waiters--although you could play more games with channels, perhaps it's reasonable to just use a mutex in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the send to a select with a <-c.ctx.Done() case, so we at least won't leak goroutines. I'll look at refactoring the rest of the flow to add back-pressure soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to agree with @jstarks suggestion here, re: process calls in a new goroutine and use a mutex to sync waiters.

@fuweid fuweid self-requested a review December 8, 2020 02:18
Restructures both the client and server connection management so that
sending messages on the transport is done by a separate "sender"
goroutine. The receiving end was already split out like this.

Without this change, it is possible for a send to block if the other end
isn't reading fast enough, which then would block the main connection
loop and prevent incoming messages from being processed.

Signed-off-by: Kevin Parsons <kevpar@microsoft.com>
@thaJeztah
Copy link
Member

@fuweid @crosbymichael PTAL

@thaJeztah
Copy link
Member

@cpuguy83 @jstarks @katiewasnothere ptal (I see the PR was updated since your last review comments)

@kevpar
Copy link
Member Author

kevpar commented Oct 14, 2021

I (finally) revisited this PR and took a different approach. The new PR is #94. Going to close this PR, but PTAL at the new one. :)

@kevpar kevpar closed this Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock with multiple simultaneous requests from the same client
6 participants