-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnection blocked in producer by request timed out #697
Comments
Hello @bschofield Can you provide more log info for this, thanks. The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side |
It is hard for me to know exactly which log lines are relevant as I have very many pods running, each with a large number of producers, but here is an example that looks interesting. This may not be the root cause of the issue.
I also see some other errors which might be unrelated problems, e.g.
I think there is a little miscommunication here, I was trying to suggest that you look at #691 and not #689. In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed My (untested) hypothesis is that one of the other
I was wondering if this could cause the freeze in Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied. If this turns out to be something unrelated, apologies! |
@bschofield Thanks for your work on this. The #691 change ignores some problems. When the producer is closed, it may cause the go rutine to leak, so I submitted a new pull request and listened to the action of the producer close by adding the In addition, the problem you encountered here seems to be somewhat different from the problem I encountered. Indeed, for some reason, the broker proactively notified and disconnected from the producer, and then in the stage of trying to reconnect, the reconnection failed due to
This seems to be that the broker returned a clear error. When the broker tried to read data from bookie, the read operation timed out. |
Thanks a lot for the reply, @wolfstudy. I agree that the logs I posted don't seem to show exactly the same issue, but I'm not sure that I got the root-cause logs -- as a broader outline of the problem, several of my producers shut down over a period of an hour or so, and the broker/bookie seemed to get into quite a confused state for the affected topics. I had to shut down the entire system and restart it to get things started again. I'll put some comments on PR #700. |
Signed-off-by: xiaolongran <rxl@apache.org> Fixes #697 ### Motivation As #697 said, In Go SDK, when the reconnection logic is triggered under certain conditions, the reconnection will not succeed due to request timeout. Comparing the implementation of the Java SDK, we can see that each time the reconnection logic is triggered, the original connection will be closed and a new connection will be created. ![image](https://user-images.githubusercontent.com/20965307/148906906-1cfc5c07-1836-4185-94ec-e43f5565a4a8.png) So in this pr, we introduced a new `reconnectFlag` field in the `connection` struct to mark the reconnection state. When the broker actively informs the client to close the connection to trigger the reconnection logic, we will store it from the `connections` cache of the `connectionPool`. The old connection object is deleted, and a new connection is created to complete the reconnection ### Modifications - Add `reconnectFlag` in `connection` struct
Expected behavior
When the reconnection logic is triggered, the reconnection can be successful.
Actual behavior
When the reconnection logic is triggered, Go SDK has been trying to reconnect, and it has been unable to reconnect successfully.
When the reconnection logic continues to be triggered, the log information of the Go SDK is as follows:
send
action of this topic has not been restored successfully, and the request timeout has been reported.request timed out
When this phenomenon occurs, it will continue to reconnect until the Go SDK is restarted.
And we can see that the error is that the
cmdProducer
was blocked when the producer reconnected, when triggergrabCnx()
func, the blocked happens on:Broker
During this time period, we checked the log information on the broker side and found that the broker side did not receive any information about processing cmdProducer at this time,
in
handleProduce
ofclientCnx
so it can be judged that the Go SDK is blocked and
cmdProducer
is not sent to the broker.Where the Go SDK is blocked occurs:
Steps to reproduce
This is not a stable recurring issue
System configuration
Pulsar version: 2.7.2
Go SDK: master branch
The text was updated successfully, but these errors were encountered: