The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

BewareMyPower · 2024-08-14T05:21:29Z

Expected behavior

For a producer, when the retry time exceeds the left time when a message would fail with timeout error, it should be reduced to be less than the left time.

See ProducerImpl's constructor in Java client:

        this.connectionHandler = new ConnectionHandler(this,
            new BackoffBuilder()
                .setInitialTime(client.getConfiguration().getInitialBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
                .setMax(client.getConfiguration().getMaxBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
                .setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)
                .create(),

There is a mandatoryStop field.

Actual behavior

The default backoff policy simply increases the next delay until the max retry time (60s).

Steps to reproduce

func TestBackoff_NextMinValue(t *testing.T) {
        backoff := &DefaultBackoff{}
        for i := 0; i < 10; i++ {
                delay := backoff.Next()
                fmt.Println(delay)
        }
}

Modify the test above in backoff_test.go and run go test -run TestBackoff_NextMinValue in the same directory.

111.555037ms
213.388077ms
466.046518ms
940.55396ms
1.614542567s
3.383705308s
6.806575903s
15.159677039s
25.637999048s
1m0.834600978s

System configuration

Pulsar version: x.y

The text was updated successfully, but these errors were encountered:

nodece · 2024-08-14T10:39:35Z

Go and Java design are different. In the Go, we have a gorotinue to check the timeout.

Maybe your application needs to wait for 60 seconds, is it right? I'm improving this: #1256

What problem have you encountered?

BewareMyPower · 2024-08-14T11:58:03Z

There is a case that the server side took 25 seconds to recover while the client side failed with timeout.

we have a gorotinue to check the timeout.

I didn't mean the Golang client does not check the timeout, let's take a look at the following case (reusing the delay I ran before)

0.000: reconnect after 0.111s
0.111: reconnect after 0.213s
0.324: reconnect after 0.466s
0.790: reconnect after 0.940s
1.730: reconnect after 1.614s
3.344: reconnect after 3.383s
6.727: reconnect after 6.806s
13.533: reconnect after 15.159s
28.692: reconnect after 25.637s
29.500: assume the server was recovered at this moment
30.000: the message in the pending queue failed with timeout
54.329: the reconnection succeeds. However, the message failed 20 seconds ago

Let's see how Java client solves the issue.

.setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)

        Backoff backoff = new Backoff(100, TimeUnit.MILLISECONDS, 60, TimeUnit.SECONDS, 30000 - 100, TimeUnit.MILLISECONDS);
        for (int i = 0; i < 10; i++) {
            System.out.println(backoff.next());
        }

Though it's a little different than I've thought. But generally, having a delay greater than the send timeout does not make sense.

nodece · 2024-08-14T15:02:57Z

It looks like you want to reduce the maximum retry time and retry interval.

The problem I see with this change is that it allows misconfigured clients that retry too quickly, to overwhelm brokers with a flood of requests.

Please see #454

We can consider https://aws.amazon.com/cn/blogs/architecture/exponential-backoff-and-jitter/ design.

BewareMyPower · 2024-08-15T03:00:23Z

Yes I want to limit the max retry time according to the send timeout config.

The exponential backoff mechanism is good that we can support it in future. But currently we'd better adopt the same behavior with the Java client.

nodece · 2024-08-15T04:24:09Z

Ok, I see. We need to reduce the retry interval, not limit the max retry time, this is the root cause.

https://github.com/cenkalti/backoff/blob/v4/exponential.go#L39-L50 provides an example, could you checkout?

BTW, the setMandatoryStop seems equal to the max retry, is it right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

BewareMyPower commented Aug 14, 2024

nodece commented Aug 14, 2024

BewareMyPower commented Aug 14, 2024

nodece commented Aug 14, 2024

BewareMyPower commented Aug 15, 2024

nodece commented Aug 15, 2024

The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

Comments

BewareMyPower commented Aug 14, 2024

Expected behavior

Actual behavior

Steps to reproduce

System configuration

nodece commented Aug 14, 2024

BewareMyPower commented Aug 14, 2024

nodece commented Aug 14, 2024

BewareMyPower commented Aug 15, 2024

nodece commented Aug 15, 2024