Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck Write Message when there is a issue "multiple Read calls return no data or error" #426

Open
anhtuanluu36 opened this issue Apr 9, 2020 · 11 comments
Assignees
Labels

Comments

@anhtuanluu36
Copy link

anhtuanluu36 commented Apr 9, 2020

I have a strange issue
Reproduce Steps:

  1. I call w.WriteMessages(ctx, msgs...)
  2. I have an error "multiple Read calls return no data or error"
  3. I call w.WriteMessages(ctx, msgs...) and it is stuck here

Here is my code:

 kafka.NewWriter(kafka.WriterConfig{
				Brokers:  brokers,
				Topic:    topic,
				Balancer: &kafka.Hash{},
				Dialer: &kafka.Dialer{
					Timeout:   10 * time.Second,
					DualStack: true,
					TLS:       tlsConfig,
				},
			})

Sometimes I see this error on Kafka:

 [2020-04-09 14:05:40,353] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Broken pipe
	at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
	at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
	at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:466)
	at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:210)
	at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:173)
	at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:718)
	at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
	at org.apache.kafka.common.network.Selector.doClose(Selector.java:748)
	at org.apache.kafka.common.network.Selector.close(Selector.java:736)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:534)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
	at kafka.network.Processor.poll(SocketServer.scala:665)
	at kafka.network.Processor.run(SocketServer.scala:582)
	at java.base/java.lang.Thread.run(Thread.java:834)

Do you have any ideas?

@anhtuanluu36
Copy link
Author

I think it relate to use SSL and send a lot of messages at the same time. It works fine when I only push 5 messages with the method "w.WriteMessages"

@pandaadb
Copy link
Contributor

What version are you on?

I ran into a recent problem where any error on writing caused the driver to retry all messages. However the driver also wrote all errors to a dedicated error channel. Since the channel was read by the main routine, the error was immediately returned, however the go routine retrying the write request on the next error was stuck waiting on the channel to be read (which would never happen, because the reading routine had returned). This caused the writing to hang forever.

@orn1983
Copy link

orn1983 commented Aug 25, 2020

The same thing just happened to me, I believe. I had a network fault where the leader data for Kafka was stale. Then I saw this error, and my writer thread seemed to hang forever.

2020-08-25 14:47:37.713482500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date
2020-08-25 14:47:37.814596500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): multiple Read calls return no data or error
2020-08-25 14:47:38.214899500 ERROR: writer.go:745: error writing messages to TopicRedacted (partition 0): multiple Read calls return no data or error

Presumably I've encountered the same error as you have. I haven't been able to dig down into it just yet, but it looks like the same problem.

Edit: It was the same problem. Got fixed by updating.

@achille-roussel
Copy link
Contributor

Hello, thank you all for reporting.

Would you be able to mention which version of kafka-go you're experiencing the issue with?

@achille-roussel achille-roussel self-assigned this Sep 29, 2020
@NitroCao
Copy link

NitroCao commented Aug 26, 2021

I also encountered this problem. The version of kafka-go is 0.4.17, the version of kafka is 2.13-2.8.0. I triggered the problem when writing plenty of data. My kafka doesn't config TLS.

@achille-roussel
Copy link
Contributor

@pandaadb do you have more details to share about this issue? Is the channel you are referring to managed within kafka-go?

I ran into a recent problem where any error on writing caused the driver to retry all messages. However the driver also wrote all errors to a dedicated error channel. Since the channel was read by the main routine, the error was immediately returned, however the go routine retrying the write request on the next error was stuck waiting on the channel to be read (which would never happen, because the reading routine had returned). This caused the writing to hang forever.

@vmihailenco
Copy link
Contributor

Hard to say if it is the same issue, but I get multiple Read calls return no data or error from batch.Close when closing a kafka.Batch with unread messages.

@orn1983
Copy link

orn1983 commented Jan 30, 2022

This is a recurring issue for me, though I am not able to reproduce it manually. It seems to always be the came cadence. Starts out with:

2022-01-29 16:47:30.393884500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date

Followed by:

2022-01-29 16:47:30.496844500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): multiple Read calls return no data or error
2022-01-29 16:47:30.897083500 ERROR: writer.go:745: error writing messages to TOPIC (partition N): multiple Read calls return no data or error

There seems to have been a network fault and one of the zookeeper nodes lost connectivity to the cluster, to reconnect seconds later. This triggered a re-election and for some reason the client seems to be unaware of it and tries to write to a follower. Upon failing, it just gives up.

When I try to recreate this, the client always seems to just be aware of all of this and just recover. I can't figure out what's different.

Am I missing some obvious thing where I should be performing some manual action?

@orn1983
Copy link

orn1983 commented Jan 30, 2022

I see I wrote above (August 25th) that it got fixed by updating, but apparently I just wasn't able to recreate it.

@achille-roussel
Copy link
Contributor

Network connectivity issues can manifest in complex ways, I think the work in #846 will help better surface the conditions that triggered those errors.

@NehaSachan
Copy link

NehaSachan commented Jan 9, 2023

i also got this same issue-
[6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date },

github.com/segmentio/kafka-go v0.4.38

There is no load in the kafka consumer OR publisher as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants