Possible data race in internalFlush #1258

Gilthoniel · 2024-07-26T07:22:30Z

Expected behavior

When calling Flush or FlushWithCtx, enqueued messages should all be sent.

Actual behavior

When flushing, switching to a new channel can lead to a message loss:

	if len(p.dataChan) != 0 {
		oldDataChan := p.dataChan
		p.dataChan = make(chan *sendRequest, p.options.MaxPendingMessages)
		for len(oldDataChan) != 0 {
			pendingData := <-oldDataChan
			p.internalSend(pendingData)
		}
	}

If internalSendAsync is sending the request to the channel at the same time as the switch, it may happen that the length will be zero while flushing, and then become one so the message will be stuck in the channel.

Steps to reproduce

I never actually observed this in practice but I noticed this bug while reading the code for a different issue. I'm confident this can happen but I'd like your opinion.

System configuration

Pulsar version: v3.0.5

The text was updated successfully, but these errors were encountered:

Gilthoniel · 2024-07-26T07:29:18Z

A different approach would be something like that:

// flushDataChan first empties the data channel as much as possible, then send
// the different pending requests.
func (p *partitionProducer) flushDataChan() {
	var reqs []*sendRequest

	for {
		select {
		case pendingData := <-p.dataChan:
			reqs = append(reqs, pendingData)
		default:
			for _, req := range reqs {
				p.internalSend(req)
			}
                        return
		}
	}
}

It would also avoid the channel allocation for every flush in high load scenarios.

gunli · 2024-07-29T02:11:36Z

Hmm, internalSendAsync() is called by Send() or SendAsync(), IMO, you should not call Flush() and Send()/SendAsync() at the same time, I think it should be a convention.

Fixes #1258 ### Motivation While flushing, the data channel is switched if a new allocated one which can cause the loss of messages because the length can be zero which would stop the procedure and at the same time a new message can be sent to the channel. ### Modifications Instead of allocating a new channel, it empties the existing one up to the length of the buffer of the channel before proceeding with the flush.

Fixes #1258 ### Motivation While flushing, the data channel is switched if a new allocated one which can cause the loss of messages because the length can be zero which would stop the procedure and at the same time a new message can be sent to the channel. ### Modifications Instead of allocating a new channel, it empties the existing one up to the length of the buffer of the channel before proceeding with the flush. (cherry picked from commit 8dd4ed1)

Gilthoniel added a commit to Gilthoniel/pulsar-client-go that referenced this issue Jul 26, 2024

Avoid a data race when flushing with load (apache#1258)

f2831a4

Gilthoniel added a commit to Gilthoniel/pulsar-client-go that referenced this issue Jul 26, 2024

Avoid a data race when flushing with load (apache#1258)

b364322

Gilthoniel mentioned this issue Jul 26, 2024

[Issue 1258][producer] Avoid a data race when flushing with load #1261

Merged

1 task

RobertIndie closed this as completed in #1261 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data race in internalFlush #1258

Possible data race in internalFlush #1258

Gilthoniel commented Jul 26, 2024 •

edited

Loading

Gilthoniel commented Jul 26, 2024 •

edited

Loading

gunli commented Jul 29, 2024

Possible data race in internalFlush #1258

Possible data race in internalFlush #1258

Comments

Gilthoniel commented Jul 26, 2024 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce

System configuration

Gilthoniel commented Jul 26, 2024 • edited Loading

gunli commented Jul 29, 2024

Gilthoniel commented Jul 26, 2024 •

edited

Loading

Gilthoniel commented Jul 26, 2024 •

edited

Loading