Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

Closed
HristoMohamed opened this issue Dec 7, 2017 · 15 comments · Fixed by #3560
Labels
area/kafka bug unexpected problem or unintended behavior

Comments

@HristoMohamed
Copy link

Telegraf version telegraf-1.4.5-1.x86_64 running on 3.10.0-327.36.3.el7.x86_64, Kafka is 1.0.0, Scala for Kafka is 2.12.

Everything runs fine, until:

Dec 07 18:15:53 telecons02 telegraf[2820]: 2017-12-07T17:15:53Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/26: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

After this, the specified partition is not consumed anymore and messages pile up.
This is fixable by restating telegraf.

Interesting fact is that I have a few telegraf instances consuming kafka messages and all of them hit this issue on a few partitions (random partitions, cannot localize it to one partition). When I restart one telegraf instance the entire consumer group goes back to normal and messages are flowing (even on partitions served by the other instances that were stuck).

Pls help ;(

@HristoMohamed
Copy link
Author

HristoMohamed commented Dec 7, 2017

This is hw my config ooks like
[[inputs.kafka_consumer]]
brokers = ["kafka01:9092", "kafka02:9092", "kafka03:9092", "kafka04:9092", "kafka05:9092", "kafka06:9092", "kafka07:9092"]
consumer_group = "telegraf_metrics_consumers"
data_format = "influx"
max_message_len = 65536
offset = "oldest"
topics = ["telegrafcommon"]

@danielnelson
Copy link
Contributor

@seuf Have you seen this error before?

@seuf
Copy link
Contributor

seuf commented Dec 8, 2017

@danielnelson Sorry, nope.. Maybe we should switch to the confluent kafka client : https://github.com/confluentinc/confluent-kafka-go

@HristoMohamed
Copy link
Author

HristoMohamed commented Dec 8, 2017

I tried restarting my entire Kafka cluster, to no avail.
At the same time I cannot see this on another topic (same partitioning and replication as the one that my Telegraf instances consume from) that is being consumed by Logstash.
At the moment my only solution reload/restart of the telegraf daemon :(

@HristoMohamed
Copy link
Author

Could the high number of partitions somehow play role? I have 140 partitions in the topic.
Funny, I also have over 70 FreeBSD 11.0 Telegraf hosts that play as Consumers from various other topics, but those topics only have 7 partitions each.

@danielnelson danielnelson added area/kafka bug unexpected problem or unintended behavior labels Dec 8, 2017
@danielnelson
Copy link
Contributor

I think this might be bsm/sarama-cluster#121, which has been fixed upstream. I will update to the latest upstream version v2.1.10. I'm a little nervous to put this into 1.5.0 since I'm hoping to release next week, do you think you could test it if I add it to 1.5.0-rc2?

@danielnelson danielnelson added this to the 1.5.0 milestone Dec 8, 2017
@HristoMohamed
Copy link
Author

Yep, no problemo =)

@danielnelson
Copy link
Contributor

Here are the builds of 1.5.0-rc2.

@HristoMohamed
Copy link
Author

I will test now and report back

@HristoMohamed
Copy link
Author

Running for an hour everything is Ok so far.
I will report tomorrow as well +)

@HristoMohamed
Copy link
Author

HristoMohamed commented Dec 13, 2017

Right I still get this error, albeit on a much smaller scale.
https://pastebin.com/BJRY2n8D

Since the last 24 hours, I have only had this happene to 4 partitions with a total loss of ~20k messages ( I produce around 120 0000 messages per minute from my producers).
Dec 13 10:36:37 telecons01 telegraf[30152]: 2017-12-13T09:36:37Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/110: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:01:09 telecons01 telegraf[30152]: 2017-12-13T10:01:09Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/99: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:11:09 telecons01 telegraf[30152]: 2017-12-13T10:11:09Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/105: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:16:33 telecons01 telegraf[30152]: 2017-12-13T10:16:33Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/103: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

@HristoMohamed
Copy link
Author

I notice something interesting.
So the partitions that are affected do indeed change.
Now my affected partitions are 105,109, 99, 76, when in the morning it was 110,99,105,103.
Also the affected partitions got consumed at some point.

@danielnelson danielnelson reopened this Dec 13, 2017
@danielnelson
Copy link
Contributor

So now it is recovering without a restart, but you are still losing messages?

@danielnelson danielnelson removed this from the 1.5.0 milestone Dec 14, 2017
@HristoMohamed
Copy link
Author

I am not losing messages as much as they are "late".
To put it in this way at:
Moment A
Partition --- Lag
Part 0 0
Part 1 10000
Part 2 5

Moment A1:
Partition --- Lag
Part 0 10
Part 1 11000
Part 2 0

Now some time elapses and we get to:
Moment B:
Partition --- Lag
Part 0 10000
Part 1 0
Part 2 0

It is not a major bug now that they are getting consumed(while before the 1.5 rc2 they were just stuck forever), just a tad annoying :)

@sspaink
Copy link
Contributor

sspaink commented May 12, 2022

Closing as there hasn't been any activity in this bug report for a long time, if you are still facing this issue I recommend using the latest version of Telegraf and posting your configuration and any debug information. Thanks!

@sspaink sspaink closed this as completed May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants