kafka/client: Mitigate std::errc::timed_out #6885

BenPope · 2022-10-21T22:57:01Z

Cover letter

If a connection times out, this will now result in the client reconnecting, rather than getting stuck in a broken state.

The error message that drove this fix:

2022-10-21T22:42:46.777790351Z stderr F ERROR 2022-10-21 22:42:46,777 [shard 0] pandaproxy - reply.h:109 - exception_reply: std::__1::system_error (error system:110, sendmsg: Connection timed out)

Which leads to a response of:

{"error_code":500,"message":"HTTP 500 Internal Server Error"}

This is related: #6687

Signed-off-by: Ben Pope ben@redpanda.com

Backport Required

UX changes

none

Release notes

Improvements

Improve robustness of Schema Registry and HTTP Proxy under std::errc::timed_out.

Signed-off-by: Ben Pope <ben@redpanda.com>

piyushredpanda · 2022-10-22T01:23:43Z

Something that we were discussing on the zoom call, so want to ask here: do we see value in catching all unknown errors and forcing a reconnection? We have had a couple of PRs iterating on different flavors of errors that we are handling one after another and each error seems to be getting hit on customer clusters.

BenPope · 2022-10-22T02:12:30Z

Something that we were discussing on the zoom call, so want to ask here: do we see value in catching all unknown errors and forcing a reconnection? We have had a couple of PRs iterating on different flavors of errors that we are handling one after another and each error seems to be getting hit on customer clusters.

The log message was designed to to stand out since its introduction 2 years ago: 99a1356#diff-4017a2e1dff1e18edcaa63b2447965a054f3a281996229eafa2e5aa966f0ffd6R100-R101 - I stopped short of introducing the assert, but at this point mitigation is nearly always the same.

We should take the opportunity to trawl the logs we now have access to, and properly consider the mitigation strategy. I don't think there's a long tail here.

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

dotnwat · 2022-10-23T01:41:05Z

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

aren't most of the internal users of our kafka client used in a such a way that 'always reconnect' is the right policy? i do agree that not retrying in a tight loop makes sense, and we can add a backoff strategy that doesn't have that property.

@BenPope do you have an idea about how much coverage we have for error codes? since this is the kafka client we effectively know every possible return error. can you ping Travis and see what franz-go does for unknown error? generally i would expect that unknown error is immediately fatal in the context of a CLI, but the internal client is sort of unique because it is being used on behalf of services rather than a CLI user for instance.

BenPope · 2022-10-24T11:04:27Z

My main concern with switching the default error handling to the usual mechanism of reconnect and refresh metadata risks entering a tight loop that could end up impacting Redpanda proper. I think switching the default to "always reconnect" carries some risks. Those can be mitigated with a backoff strategy for the mitigation, but in general attempting to handle errors that are unknown is a fools errand.

aren't most of the internal users of our kafka client used in a such a way that 'always reconnect' is the right policy? i do agree that not retrying in a tight loop makes sense, and we can add a backoff strategy that doesn't have that property.

It's hard to say what the correct strategy is for an unknown error.

@BenPope do you have an idea about how much coverage we have for error codes? since this is the kafka client we effectively know every possible return error. can you ping Travis and see what franz-go does for unknown error? generally i would expect that unknown error is immediately fatal in the context of a CLI, but the internal client is sort of unique because it is being used on behalf of services rather than a CLI user for instance.

I think kafka error codes are mostly handled. This wasn't a kafka error code.

BenPope · 2022-10-24T11:09:05Z

/backport v22.2.x

BenPope · 2022-10-24T11:09:16Z

/backport v22.1.x

BenPope · 2022-10-24T11:09:23Z

/backport v21.11.x

kafka/client: Mitigate std::errc::timed_out

0bf04e8

Signed-off-by: Ben Pope <ben@redpanda.com>

github-actions bot added the area/redpanda label Oct 21, 2022

BenPope requested a review from dotnwat October 21, 2022 23:32

andrwng approved these changes Oct 21, 2022

View reviewed changes

piyushredpanda added this to the v22.2.7 milestone Oct 22, 2022

BenPope merged commit 2764e4c into redpanda-data:dev Oct 24, 2022

This was referenced Oct 24, 2022

[v22.1.x] kafka/client: Mitigate std::errc::timed_out #6896

Merged

[v22.2.x] kafka/client: Mitigate std::errc::timed_out #6897

Merged

[v21.11.x] kafka/client: Mitigate std::errc::timed_out #6898

Merged

mmedenjak added kind/enhance New feature or request area/schema-registry Schema Registry service within Redpanda labels Oct 24, 2022

BenPope mentioned this pull request Feb 24, 2023

Schema Registry caching bad responses eternally #7628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka/client: Mitigate std::errc::timed_out #6885

kafka/client: Mitigate std::errc::timed_out #6885

BenPope commented Oct 21, 2022 •

edited by andrewhsu

Loading

piyushredpanda commented Oct 22, 2022

BenPope commented Oct 22, 2022 •

edited

Loading

dotnwat commented Oct 23, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

kafka/client: Mitigate std::errc::timed_out #6885

kafka/client: Mitigate std::errc::timed_out #6885

Conversation

BenPope commented Oct 21, 2022 • edited by andrewhsu Loading

Cover letter

Backport Required

UX changes

Release notes

Improvements

piyushredpanda commented Oct 22, 2022

BenPope commented Oct 22, 2022 • edited Loading

dotnwat commented Oct 23, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 24, 2022

BenPope commented Oct 21, 2022 •

edited by andrewhsu

Loading

BenPope commented Oct 22, 2022 •

edited

Loading