Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

HristoMohamed · 2017-12-07T17:27:18Z

Telegraf version telegraf-1.4.5-1.x86_64 running on 3.10.0-327.36.3.el7.x86_64, Kafka is 1.0.0, Scala for Kafka is 2.12.

Everything runs fine, until:

Dec 07 18:15:53 telecons02 telegraf[2820]: 2017-12-07T17:15:53Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/26: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

After this, the specified partition is not consumed anymore and messages pile up.
This is fixable by restating telegraf.

Interesting fact is that I have a few telegraf instances consuming kafka messages and all of them hit this issue on a few partitions (random partitions, cannot localize it to one partition). When I restart one telegraf instance the entire consumer group goes back to normal and messages are flowing (even on partitions served by the other instances that were stuck).

Pls help ;(

HristoMohamed · 2017-12-07T17:28:34Z

This is hw my config ooks like
[[inputs.kafka_consumer]]
brokers = ["kafka01:9092", "kafka02:9092", "kafka03:9092", "kafka04:9092", "kafka05:9092", "kafka06:9092", "kafka07:9092"]
consumer_group = "telegraf_metrics_consumers"
data_format = "influx"
max_message_len = 65536
offset = "oldest"
topics = ["telegrafcommon"]

danielnelson · 2017-12-07T18:59:34Z

@seuf Have you seen this error before?

seuf · 2017-12-08T09:26:56Z

@danielnelson Sorry, nope.. Maybe we should switch to the confluent kafka client : https://github.com/confluentinc/confluent-kafka-go

HristoMohamed · 2017-12-08T10:17:58Z

I tried restarting my entire Kafka cluster, to no avail.
At the same time I cannot see this on another topic (same partitioning and replication as the one that my Telegraf instances consume from) that is being consumed by Logstash.
At the moment my only solution reload/restart of the telegraf daemon :(

HristoMohamed · 2017-12-08T10:32:00Z

Could the high number of partitions somehow play role? I have 140 partitions in the topic.
Funny, I also have over 70 FreeBSD 11.0 Telegraf hosts that play as Consumers from various other topics, but those topics only have 7 partitions each.

danielnelson · 2017-12-08T19:13:02Z

I think this might be bsm/sarama-cluster#121, which has been fixed upstream. I will update to the latest upstream version v2.1.10. I'm a little nervous to put this into 1.5.0 since I'm hoping to release next week, do you think you could test it if I add it to 1.5.0-rc2?

HristoMohamed · 2017-12-08T19:49:20Z

Yep, no problemo =)

danielnelson · 2017-12-09T03:43:51Z

Here are the builds of 1.5.0-rc2.

HristoMohamed · 2017-12-12T09:47:15Z

I will test now and report back

HristoMohamed · 2017-12-12T10:35:55Z

Running for an hour everything is Ok so far.
I will report tomorrow as well +)

HristoMohamed · 2017-12-13T10:23:59Z

Right I still get this error, albeit on a much smaller scale.
https://pastebin.com/BJRY2n8D

Since the last 24 hours, I have only had this happene to 4 partitions with a total loss of ~20k messages ( I produce around 120 0000 messages per minute from my producers).
Dec 13 10:36:37 telecons01 telegraf[30152]: 2017-12-13T09:36:37Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/110: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:01:09 telecons01 telegraf[30152]: 2017-12-13T10:01:09Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/99: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:11:09 telecons01 telegraf[30152]: 2017-12-13T10:11:09Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/105: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
Dec 13 11:16:33 telecons01 telegraf[30152]: 2017-12-13T10:16:33Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/103: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.

HristoMohamed · 2017-12-13T15:08:00Z

I notice something interesting.
So the partitions that are affected do indeed change.
Now my affected partitions are 105,109, 99, 76, when in the morning it was 110,99,105,103.
Also the affected partitions got consumed at some point.

danielnelson · 2017-12-14T01:15:17Z

So now it is recovering without a restart, but you are still losing messages?

HristoMohamed · 2017-12-18T10:24:41Z

I am not losing messages as much as they are "late".
To put it in this way at:
Moment A
Partition --- Lag
Part 0 0
Part 1 10000
Part 2 5

Moment A1:
Partition --- Lag
Part 0 10
Part 1 11000
Part 2 0

Now some time elapses and we get to:
Moment B:
Partition --- Lag
Part 0 10000
Part 1 0
Part 2 0

It is not a major bug now that they are getting consumed(while before the 1.5 rc2 they were just stuck forever), just a tad annoying :)

sspaink · 2022-05-12T15:27:46Z

Closing as there hasn't been any activity in this bug report for a long time, if you are still facing this issue I recommend using the latest version of Telegraf and posting your configuration and any debug information. Thanks!

danielnelson added area/kafka bug unexpected problem or unintended behavior labels Dec 8, 2017

danielnelson added this to the 1.5.0 milestone Dec 8, 2017

danielnelson mentioned this issue Dec 8, 2017

Update sarama-cluster to latest release #3560

Merged

3 tasks

danielnelson closed this as completed in #3560 Dec 9, 2017

danielnelson reopened this Dec 13, 2017

danielnelson removed this from the 1.5.0 milestone Dec 14, 2017

sspaink closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

HristoMohamed commented Dec 7, 2017

HristoMohamed commented Dec 7, 2017 •

edited

Loading

danielnelson commented Dec 7, 2017

seuf commented Dec 8, 2017

HristoMohamed commented Dec 8, 2017 •

edited

Loading

HristoMohamed commented Dec 8, 2017

danielnelson commented Dec 8, 2017

HristoMohamed commented Dec 8, 2017

danielnelson commented Dec 9, 2017

HristoMohamed commented Dec 12, 2017

HristoMohamed commented Dec 12, 2017

HristoMohamed commented Dec 13, 2017 •

edited

Loading

HristoMohamed commented Dec 13, 2017

danielnelson commented Dec 14, 2017

HristoMohamed commented Dec 18, 2017

sspaink commented May 12, 2022

Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553

Comments

HristoMohamed commented Dec 7, 2017

HristoMohamed commented Dec 7, 2017 • edited Loading

danielnelson commented Dec 7, 2017

seuf commented Dec 8, 2017

HristoMohamed commented Dec 8, 2017 • edited Loading

HristoMohamed commented Dec 8, 2017

danielnelson commented Dec 8, 2017

HristoMohamed commented Dec 8, 2017

danielnelson commented Dec 9, 2017

HristoMohamed commented Dec 12, 2017

HristoMohamed commented Dec 12, 2017

HristoMohamed commented Dec 13, 2017 • edited Loading

HristoMohamed commented Dec 13, 2017

danielnelson commented Dec 14, 2017

HristoMohamed commented Dec 18, 2017

sspaink commented May 12, 2022

HristoMohamed commented Dec 7, 2017 •

edited

Loading

HristoMohamed commented Dec 8, 2017 •

edited

Loading

HristoMohamed commented Dec 13, 2017 •

edited

Loading