Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default backoff policy does not have mandatory stop, which could cause unexpected send timeout #1270

Open
BewareMyPower opened this issue Aug 14, 2024 · 5 comments

Comments

@BewareMyPower
Copy link
Contributor

Expected behavior

For a producer, when the retry time exceeds the left time when a message would fail with timeout error, it should be reduced to be less than the left time.

See ProducerImpl's constructor in Java client:

        this.connectionHandler = new ConnectionHandler(this,
            new BackoffBuilder()
                .setInitialTime(client.getConfiguration().getInitialBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
                .setMax(client.getConfiguration().getMaxBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
                .setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)
                .create(),

There is a mandatoryStop field.

Actual behavior

The default backoff policy simply increases the next delay until the max retry time (60s).

Steps to reproduce

func TestBackoff_NextMinValue(t *testing.T) {
        backoff := &DefaultBackoff{}
        for i := 0; i < 10; i++ {
                delay := backoff.Next()
                fmt.Println(delay)
        }
}

Modify the test above in backoff_test.go and run go test -run TestBackoff_NextMinValue in the same directory.

111.555037ms
213.388077ms
466.046518ms
940.55396ms
1.614542567s
3.383705308s
6.806575903s
15.159677039s
25.637999048s
1m0.834600978s

System configuration

Pulsar version: x.y

@nodece
Copy link
Member

nodece commented Aug 14, 2024

Go and Java design are different. In the Go, we have a gorotinue to check the timeout.

Maybe your application needs to wait for 60 seconds, is it right? I'm improving this: #1256

What problem have you encountered?

@BewareMyPower
Copy link
Contributor Author

There is a case that the server side took 25 seconds to recover while the client side failed with timeout.

we have a gorotinue to check the timeout.

I didn't mean the Golang client does not check the timeout, let's take a look at the following case (reusing the delay I ran before)

  • 0.000: reconnect after 0.111s
  • 0.111: reconnect after 0.213s
  • 0.324: reconnect after 0.466s
  • 0.790: reconnect after 0.940s
  • 1.730: reconnect after 1.614s
  • 3.344: reconnect after 3.383s
  • 6.727: reconnect after 6.806s
  • 13.533: reconnect after 15.159s
  • 28.692: reconnect after 25.637s
  • 29.500: assume the server was recovered at this moment
  • 30.000: the message in the pending queue failed with timeout
  • 54.329: the reconnection succeeds. However, the message failed 20 seconds ago

Let's see how Java client solves the issue.

.setMandatoryStop(Math.max(100, conf.getSendTimeoutMs() - 100), TimeUnit.MILLISECONDS)
        Backoff backoff = new Backoff(100, TimeUnit.MILLISECONDS, 60, TimeUnit.SECONDS, 30000 - 100, TimeUnit.MILLISECONDS);
        for (int i = 0; i < 10; i++) {
            System.out.println(backoff.next());
        }
100
195
398
799
1525
3136
6272
12157
24279
29355

Though it's a little different than I've thought. But generally, having a delay greater than the send timeout does not make sense.

@nodece
Copy link
Member

nodece commented Aug 14, 2024

It looks like you want to reduce the maximum retry time and retry interval.

The problem I see with this change is that it allows misconfigured clients that retry too quickly, to overwhelm brokers with a flood of requests.

Please see #454

We can consider https://aws.amazon.com/cn/blogs/architecture/exponential-backoff-and-jitter/ design.

@BewareMyPower
Copy link
Contributor Author

Yes I want to limit the max retry time according to the send timeout config.

The exponential backoff mechanism is good that we can support it in future. But currently we'd better adopt the same behavior with the Java client.

@nodece
Copy link
Member

nodece commented Aug 15, 2024

Ok, I see. We need to reduce the retry interval, not limit the max retry time, this is the root cause.

https://github.com/cenkalti/backoff/blob/v4/exponential.go#L39-L50 provides an example, could you checkout?

BTW, the setMandatoryStop seems equal to the max retry, is it right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants