Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when disposing of channels while cancelling calls. #2119

Closed
amanda-tarafa opened this issue May 10, 2023 · 2 comments · Fixed by #2120
Closed

Race condition when disposing of channels while cancelling calls. #2119

amanda-tarafa opened this issue May 10, 2023 · 2 comments · Fixed by #2120
Labels
bug Something isn't working

Comments

@amanda-tarafa
Copy link

I think there might be a race condition if a channel is disposed at the same time that one of its active calls is cancelled separately. I think this happens only for server or bidi streaming calls. The original report (googleapis/google-cloud-dotnet#10318) comes from a Google.Cloud.PubSub.V1 user who has provided a repro and logs. A couple of us on the Google side haven't been able to reproduce but I think my description below of the race conditions is accurate, and it's also supported by the user provided logs (ignoring the Google.Cloud.PubSub.V1 part of the logs.)

In the meantime the thread that first cancelled the GrpcCall has been able to:

  • Execute GrpcCall.CancelCallFromCancellationToken which is the callback for the call cancellation token.
  • CancelCallFromCancellationToken calls CancelCall which (same as Cleanup) sets the GrpcCall TaskCompletionSource to cancelled. The difference here is that this call doesn't block as it's being executed on the same thread (and as part of) the cancellation callback, so the GrpcCall.Task here is actually transtioned to cancelled.
  • But note that a TaskCompletionSource is being used explicitly so that it's continuations are executed inmediately.
    // Run the callTcs continuation immediately to keep the same context. Required for Activity.
    _callTcs = new TaskCompletionSource<Status>();
  • Which means that as soon as the GrpcCall.Task is transitioned to cancelled, and in the same context (thread) an OperationCancelledException is thrown and handled in GrpcCall.RunAsync because we are awaiting for the GrpcCall.Task to complete (in the case of server and bidi streaming).
    status = await CallTask.ConfigureAwait(false);
  • Handling of that exception ultimately calls Cleanup

    which in turn calls GrpcChannel.FinishActiveCall
    Channel.FinishActiveCall(this);

    which requires the channel lock that is being held by the channel disposing thread, which is waiting for this thread to finish executing the cancellation callback:
    internal void FinishActiveCall(IDisposable grpcCall)
    {
    lock (_lock)
    {
    ActiveCalls.Remove(grpcCall);
    }
    }
    internal GrpcMethodInfo GetCachedGrpcMethodInfo(IMethod method)
    {
    return _methodInfoCache.GetOrAdd(method, _createMethodInfoFunc);
    }

Possible solution:

In GrpcChannel.Dispose, release the lock inmediately after getting the local copy of the active calls collection. This will avoid the race condition and as far as I can see will introduce no side effect.

  • Note that new calls can currently be added to the channel active call collection right after the old calls have been disposed. By releasing the lock earlier we are just allowing that to possibly happen while we are disposing of the old calls. But this wouldn't affect old call disposal as we have a local copy of the active calls collection.
  • The call that would otherwise be involved in the race condition will be removed from the channel active call collection twice, but that collection is a Hashset anyway which does nothing if the element being removed is not present.
@amanda-tarafa
Copy link
Author

@JamesNK we were about to release a patched beta of Google.Cloud.PubSub.V1 . But if the release of this is more or less inminint, we'd probably want to wait. Do you have a rough idea on when this will be released? Thanks!

@JamesNK
Copy link
Member

JamesNK commented May 15, 2023

I'm going to start the 2.54.0 release process, but it will take a couple of weeks to have a preview release and then a final release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants