-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafka/client: Mitigate std::errc::timed_out #6885
kafka/client: Mitigate std::errc::timed_out #6885
Conversation
Signed-off-by: Ben Pope <ben@redpanda.com>
Something that we were discussing on the zoom call, so want to ask here: do we see value in catching all unknown errors and forcing a reconnection? We have had a couple of PRs iterating on different flavors of errors that we are handling one after another and each error seems to be getting hit on customer clusters. |
The log message was designed to to stand out since its introduction 2 years ago: 99a1356#diff-4017a2e1dff1e18edcaa63b2447965a054f3a281996229eafa2e5aa966f0ffd6R100-R101 - I stopped short of introducing the assert, but at this point mitigation is nearly always the same. We should take the opportunity to trawl the logs we now have access to, and properly consider the mitigation strategy. I don't think there's a long tail here. My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand. |
aren't most of the internal users of our kafka client used in a such a way that 'always reconnect' is the right policy? i do agree that not retrying in a tight loop makes sense, and we can add a backoff strategy that doesn't have that property. @BenPope do you have an idea about how much coverage we have for error codes? since this is the kafka client we effectively know every possible return error. can you ping Travis and see what franz-go does for unknown error? generally i would expect that unknown error is immediately fatal in the context of a CLI, but the internal client is sort of unique because it is being used on behalf of services rather than a CLI user for instance. |
It's hard to say what the correct strategy is for an unknown error.
I think kafka error codes are mostly handled. This wasn't a kafka error code. |
/backport v22.2.x |
/backport v22.1.x |
/backport v21.11.x |
Cover letter
If a connection times out, this will now result in the client reconnecting, rather than getting stuck in a broken state.
The error message that drove this fix:
Which leads to a response of:
This is related: #6687
Signed-off-by: Ben Pope ben@redpanda.com
Backport Required
UX changes
Release notes
Improvements
std::errc::timed_out
.