Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 1027][producer] fix: split sendRequest and make reconnectToBroker and other operate in the same coroutine #1029

Merged
merged 3 commits into from
Jun 16, 2023

Conversation

graysonzeng
Copy link
Contributor

@graysonzeng graysonzeng commented Jun 12, 2023

Fixes #1027

Motivation

fix send timeout error cause by reconnection failures

Modifications

After #691 , client avoid blocking the reconnection logic, but it also make reconnection and internalSend/internalFlush have concurrency issues。

when the client connection is block, it run into p.grabCnx() to get new connection and resend pendingQueue messages. At the same time, the internalSend/internalFlush keep running by use pendingQueue and send mssage by p._getConn().WriteData. Therefore, it received an ack larger than expected, and trigger the ConnectionClosed logic, then the client enter p.reconnectToBroker() and repeat the concurrency issues with internalSend/internalFlush, fall into a reconnect-resend-connectionClosed loop

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable / docs / GoDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

@graysonzeng graysonzeng changed the title [Issue 1027][producer] fix: add read-write lock to makesure pendingQueue sending order [Issue 1027][producer] fix: add read-write lock to avoid continuous reconnection Jun 12, 2023
@graysonzeng
Copy link
Contributor Author

graysonzeng commented Jun 12, 2023

BTW, I think @bschofield about some guess said in #697 is right.

@shibd
Copy link
Member

shibd commented Jun 13, 2023

@zengguan Thanks for your contribute.

I have a question, As #687. Can we let the sendRequest event alone in one channel?

In this way, Let reconnectToBroker and other operate are kept in the same coroutine.

	for {
		select {
		case sendRequest <-p.sendRequestChan:
		    p.internalSend(sendRequest)
		case i := <-p.eventsChan:
			switch v := i.(type) {
			case *flushRequest:
				p.internalFlush(v)
			case *closeProducer:
				p.internalClose(v)
				return
			}
		case <-p.connectClosedCh:
			p.reconnectToBroker()
		case <-p.batchFlushTicker.C:
			if p.batchBuilder.IsMultiBatches() {
				p.internalFlushCurrentBatches()
			} else {
				p.internalFlushCurrentBatch()
			}
		}
	}

I think we can discuss clearly here, using channels should avoid using locks.

@graysonzeng graysonzeng changed the title [Issue 1027][producer] fix: add read-write lock to avoid continuous reconnection [Issue 1027][producer] fix: split sendRequest and make reconnectToBroker and other operate in the same coroutine Jun 14, 2023
@graysonzeng
Copy link
Contributor Author

graysonzeng commented Jun 14, 2023

@shibd great idea ! Now I split sendRequest and changed eventsChan to dataChan and cmdChan to handle sendRequest and flushRequest/closeProducer separately。I've tested it and it doesn't reproduce anymore

@shibd
Copy link
Member

shibd commented Jun 15, 2023

@BewareMyPower @gaoran10 Can you help take a look?

pulsar/producer_partition.go Outdated Show resolved Hide resolved
@BewareMyPower BewareMyPower added this to the v0.11.0 milestone Jun 15, 2023
@BewareMyPower
Copy link
Contributor

Could you update your PR description since you don't use RWLock now?

@BewareMyPower
Copy link
Contributor

It seems the CI is broken.

@graysonzeng
Copy link
Contributor Author

graysonzeng commented Jun 16, 2023

Could you update your PR description since you don't use RWLock now?

Done.

@BewareMyPower
Copy link
Contributor

Please rebase to master to have tests fixed.

@shibd shibd merged commit 7f91b2b into apache:master Jun 16, 2023
RobertIndie pushed a commit that referenced this pull request Jul 20, 2023
…to flush all messages (#1058)

Fixes #1057 

### Motivation

`dataChan` is introduced by #1029  to fix the problem of reconnectToBroker. But it missed that if a flush operation excuted, there may still be some messages in `dataChan`. And these messages can't be flushed.

### Modifications

- Fix the producer flush opertion is not guarantee to flush all messages
RobertIndie pushed a commit that referenced this pull request Jul 27, 2023
### Motivation

After #1029 , `eventChan` is split into `dataChan` and `cmdChan`.  But the description of `SendAsync()` is not modified.

https://github.com/apache/pulsar-client-go/blob/9867c29ca329302e97ddd9c6a99f66853c7f447f/pulsar/producer.go#L226-L231

### Modifications

- Correct the description of SendAsync() description
graysonzeng pushed a commit to graysonzeng/pulsar-client-go that referenced this pull request Jul 31, 2023
### Motivation

After apache#1029 , `eventChan` is split into `dataChan` and `cmdChan`.  But the description of `SendAsync()` is not modified.

https://github.com/apache/pulsar-client-go/blob/9867c29ca329302e97ddd9c6a99f66853c7f447f/pulsar/producer.go#L226-L231

### Modifications

- Correct the description of SendAsync() description
RobertIndie pushed a commit that referenced this pull request Sep 7, 2023
…to flush all messages (#1058)

Fixes #1057

### Motivation

`dataChan` is introduced by #1029  to fix the problem of reconnectToBroker. But it missed that if a flush operation excuted, there may still be some messages in `dataChan`. And these messages can't be flushed.

### Modifications

- Fix the producer flush opertion is not guarantee to flush all messages

(cherry picked from commit 9867c29)
RobertIndie pushed a commit that referenced this pull request Sep 7, 2023
### Motivation

After #1029 , `eventChan` is split into `dataChan` and `cmdChan`.  But the description of `SendAsync()` is not modified.

https://github.com/apache/pulsar-client-go/blob/9867c29ca329302e97ddd9c6a99f66853c7f447f/pulsar/producer.go#L226-L231

### Modifications

- Correct the description of SendAsync() description

(cherry picked from commit 50015d3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Continuous reconnection failures lead to message send timeout
3 participants