newPartitionProducer() occasionally hangs, waiting for a ready connection #662

bschofield · 2021-11-08T15:19:45Z

Expected behavior

When creating a partitioned producer, producers for all partitions should be created.

Actual behavior

I am writing to a topic with six partitions. Very occasionally, I experience a process which is stuck because the producer for one of the partitions is not creating.

I recently had a process in this state, and I attached delve to it in order to see exactly where it was stuck. Looking first at the main goroutine, then we can see that it is stuck in internalCreatePartitionsProducers():

(dlv) bt
 0  0x00000000004425b6 in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:367
 1  0x0000000000411bec in runtime.chanrecv
    at /usr/local/go/src/runtime/chan.go:576
 2  0x0000000000411658 in runtime.chanrecv2
    at /usr/local/go/src/runtime/chan.go:444
 3  0x000000000095e9ec in github.com/apache/pulsar-client-go/pulsar.(*producer).internalCreatePartitionsProducers
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_impl.go:230
 4  0x000000000095dfaf in github.com/apache/pulsar-client-go/pulsar.newProducer
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_impl.go:127
 5  0x0000000000945ba9 in github.com/apache/pulsar-client-go/pulsar.(*client).CreateProducer
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/client_impl.go:154
[...]

Line 230 of producer_impl.go is waiting for a channel read:

pe, ok := <-c

The goroutine on the other end of this channel is launched at line 219 of producer_impl.go, and the issue lies in the call to newPartitionProducer() on line 220. The full stack trace for that other goroutine is:

(dlv) bt
 0  0x00000000004425b6 in runtime.gopark
    at /usr/local/go/src/runtime/proc.go:367
 1  0x000000000046c3dd in runtime.goparkunlock
    at /usr/local/go/src/runtime/proc.go:372
 2  0x000000000046c3dd in sync.runtime_notifyListWait
    at /usr/local/go/src/runtime/sema.go:513
 3  0x0000000000477fac in sync.(*Cond).Wait
    at /usr/local/go/src/sync/cond.go:56
 4  0x00000000008e33cd in github.com/apache/pulsar-client-go/pulsar/internal.(*connection).waitUntilReady
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/internal/connection.go:339
 5  0x00000000008e8668 in github.com/apache/pulsar-client-go/pulsar/internal.(*connectionPool).GetConnection
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/internal/connection_pool.go:108
 6  0x00000000008f37ed in github.com/apache/pulsar-client-go/pulsar/internal.(*rpcClient).Request
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/internal/rpc_client.go:120
 7  0x0000000000960d63 in github.com/apache/pulsar-client-go/pulsar.(*partitionProducer).grabCnx
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_partition.go:228
 8  0x00000000009602fa in github.com/apache/pulsar-client-go/pulsar.newPartitionProducer
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_partition.go:161
 9  0x000000000095edc9 in github.com/apache/pulsar-client-go/pulsar.(*producer).internalCreatePartitionsProducers.func1
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_impl.go:220
10  0x000000000095ed55 in github.com/apache/pulsar-client-go/pulsar.(*producer).internalCreatePartitionsProducers·dwrap·38
    at /go/pkg/mod/github.com/apache/pulsar-client-go@v0.7.0-candidate-1.0.20211104025721-d80a722ac1ab/pulsar/producer_impl.go:226
11  0x000000000046ff01 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1581

For some reason, the call to waitUntilReady() in connection.go never completes.

The other five partitions seem to have created fine: I see creation log messages for all five, and in delve I can see five running copies of failTimeoutMessages and runEventsLoop goroutines.

Steps to reproduce

This is not an easily bug to reproduce. It seems to occur around once per 10,000 invocations of the code in question, which means it happens around once per week for me.

It may be relevant that there are many new connections being made to the broker, at the time the bug occurs -- could a packet loss somewhere cause this, perhaps?

I do not know how to debug this further. However, if there is debug code I can insert, please let me know and I will do so.

System configuration

Pulsar version: 2.8.1
Client version: 0.7.0-candidate-1.0.20211104025721-d80a722ac1ab

The connection here is direct to the brokers, not via a proxy.

The text was updated successfully, but these errors were encountered:

bschofield · 2021-11-08T15:44:23Z

In this code:

	for c.getState() != connectionReady {
		c.log.Debugf("Wait until connection is ready state=%s", c.getState().String())
		if c.getState() == connectionClosed {
			return errors.New("connection error")
		}
		// wait for a new connection state change
		c.cond.Wait()
	}

If the state of the connection changes after the for statement, but before c.cond.Wait() is called, what prevents a race condition? It looks to me like the cond.Broadcast() could happen before the cond.Wait() registers the listener.

sijie mentioned this issue Nov 8, 2021

ISSUE-662: newPartitionProducer() occasionally hangs, waiting for a ready connection streamnative/pulsar-client-go#240

Open

bschofield mentioned this issue Nov 8, 2021

[Issue 662] Fix race in connection.go waitUntilReady() #663

Merged

cckellogg closed this as completed in #663 Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

newPartitionProducer() occasionally hangs, waiting for a ready connection #662

newPartitionProducer() occasionally hangs, waiting for a ready connection #662

bschofield commented Nov 8, 2021

bschofield commented Nov 8, 2021 •

edited

Loading

newPartitionProducer() occasionally hangs, waiting for a ready connection #662

newPartitionProducer() occasionally hangs, waiting for a ready connection #662

Comments

bschofield commented Nov 8, 2021

Expected behavior

Actual behavior

Steps to reproduce

System configuration

bschofield commented Nov 8, 2021 • edited Loading

bschofield commented Nov 8, 2021 •

edited

Loading