Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka/client: Mitigate std::errc::timed_out #6885

Merged
merged 1 commit into from
Oct 24, 2022

Conversation

BenPope
Copy link
Member

@BenPope BenPope commented Oct 21, 2022

Cover letter

If a connection times out, this will now result in the client reconnecting, rather than getting stuck in a broken state.

The error message that drove this fix:

2022-10-21T22:42:46.777790351Z stderr F ERROR 2022-10-21 22:42:46,777 [shard 0] pandaproxy - reply.h:109 - exception_reply: std::__1::system_error (error system:110, sendmsg: Connection timed out)

Which leads to a response of:

{"error_code":500,"message":"HTTP 500 Internal Server Error"}

This is related: #6687

Signed-off-by: Ben Pope ben@redpanda.com

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

  • none

Release notes

Improvements

  • Improve robustness of Schema Registry and HTTP Proxy under std::errc::timed_out.

Signed-off-by: Ben Pope <ben@redpanda.com>
@piyushredpanda
Copy link
Contributor

Something that we were discussing on the zoom call, so want to ask here: do we see value in catching all unknown errors and forcing a reconnection? We have had a couple of PRs iterating on different flavors of errors that we are handling one after another and each error seems to be getting hit on customer clusters.

@BenPope
Copy link
Member Author

BenPope commented Oct 22, 2022

Something that we were discussing on the zoom call, so want to ask here: do we see value in catching all unknown errors and forcing a reconnection? We have had a couple of PRs iterating on different flavors of errors that we are handling one after another and each error seems to be getting hit on customer clusters.

The log message was designed to to stand out since its introduction 2 years ago: 99a1356#diff-4017a2e1dff1e18edcaa63b2447965a054f3a281996229eafa2e5aa966f0ffd6R100-R101 - I stopped short of introducing the assert, but at this point mitigation is nearly always the same.

We should take the opportunity to trawl the logs we now have access to, and properly consider the mitigation strategy. I don't think there's a long tail here.

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

@piyushredpanda piyushredpanda added this to the v22.2.7 milestone Oct 22, 2022
@dotnwat
Copy link
Member

dotnwat commented Oct 23, 2022

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

aren't most of the internal users of our kafka client used in a such a way that 'always reconnect' is the right policy? i do agree that not retrying in a tight loop makes sense, and we can add a backoff strategy that doesn't have that property.

@BenPope do you have an idea about how much coverage we have for error codes? since this is the kafka client we effectively know every possible return error. can you ping Travis and see what franz-go does for unknown error? generally i would expect that unknown error is immediately fatal in the context of a CLI, but the internal client is sort of unique because it is being used on behalf of services rather than a CLI user for instance.

@BenPope
Copy link
Member Author

BenPope commented Oct 24, 2022

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

aren't most of the internal users of our kafka client used in a such a way that 'always reconnect' is the right policy? i do agree that not retrying in a tight loop makes sense, and we can add a backoff strategy that doesn't have that property.

It's hard to say what the correct strategy is for an unknown error.

@BenPope do you have an idea about how much coverage we have for error codes? since this is the kafka client we effectively know every possible return error. can you ping Travis and see what franz-go does for unknown error? generally i would expect that unknown error is immediately fatal in the context of a CLI, but the internal client is sort of unique because it is being used on behalf of services rather than a CLI user for instance.

I think kafka error codes are mostly handled. This wasn't a kafka error code.

@BenPope BenPope merged commit 2764e4c into redpanda-data:dev Oct 24, 2022
@BenPope
Copy link
Member Author

BenPope commented Oct 24, 2022

/backport v22.2.x

@BenPope
Copy link
Member Author

BenPope commented Oct 24, 2022

/backport v22.1.x

@BenPope
Copy link
Member Author

BenPope commented Oct 24, 2022

/backport v21.11.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda area/schema-registry Schema Registry service within Redpanda kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants