[RFC] Implement a fixed-size buffer of points #285

daviesalex · 2015-10-19T13:54:51Z

We are running many thousands of telegraf clients with a frequency of collection in the low seconds. We would like to be able to run some collectors at least once per second and potentially more frequently (not currently supported by telegraf, and irrelevant until other scaling issues are resolved).

From my reading of the code, telegraf batches points from a given run, one per schedule. So if all your plugins run every second except for 5 that run every minute, you will have one batched post to InfluxDB every second and 6 on the 60th second.

From our observations, should InfluxDB be down, these POSTs will fail with minimal data stored locally and most if not all data collected during a downtime window (or network partition) will not make it into the data store.

We would like to propose two changes:

Writing all data that is being sent to InfluxDB into a fixed size, in memory, ring buffer on the client.
Setting a minimum time to flush to InfluxDB. For example, if you set it to 60 seconds, every 60 seconds all data in the buffer would be posted to InfluxDB in one batch and the buffer cleared.

We believe that this would enormously increase the average number of points per post, which appears at this point in time critical to scale InfluxDB (particularly with clustering). Furthermore, this would minimize data loss in the case of network issues between clients and the InfluxDB server and is in keeping with the "eventually consistent" InfluxDB design.

For environments that require data almost immediately in the InfluxDB server, this minimal post time could be set very low. For environments more concerned with eventually consistent data (within a few minutes), this could be set much higher. The default should likely be relatively low as the increase would be a tuning optimization for scale.

sparrc · 2015-10-19T18:19:49Z

I agree that this would be a better design at scale.

The first thing that stands out to me about the design your suggesting is a configurable minimum interval, with a circular buffer. This would mean you would like Telegraf to just overwrite and drop metrics if more metrics are coming in than the buffer has available?

cc @pauldix

daviesalex · 2015-10-19T18:27:56Z

@sparrc , that is certainly an accurate summary of my suggestion (perhaps adding "with POST attempts retried at that minimum interval until they succeed, or the writes fall out the ring buffer (FIFO)".

sparrc · 2015-10-19T20:36:43Z

Do you mean the ring buffer would be for a batches of points? or for individual points?

Right now I happen to be working on a change that gets us closer to what you're suggesting, but without sophisticated buffering.

Basically the accumulator is changing to only track a channel of points, which a separate gorouting reads into a slice. Then there will be a configurable flush_interval parameter that will also default to 10s. So this would satisfy the desire to separate flushing from gathering, but more work will need to be done to add the buffering that you're describing here.

I will try to add support for the write retries while I'm at it

daviesalex · 2015-10-19T21:59:03Z

That channel will be what, a buffered one? Presumably if flush_interval != the interval the points are being collected, they have to be buffered somewhere.

Buffering is mostly to allow us to survive a brief network outage (or InfluxDB service event) without loosing data (what happens today) or a memory leak (what would happen if we had no buffer size!). Other approaches may well get us to the same place though... a buffered channel that multiple goroutines write into and that drop if the buffer is full would achieve this (except FIFO, but thats really just a detail).

sparrc · 2015-10-20T00:13:35Z

We can buffer the channel, but flusher will make every effort to get it off the channel and into a slice, which can be passed to and read by multiple output sinks. My thought was that the code that writes to the output sinks (see agent.flush) would retry on failures a few times.

I really don't want to see telegraf buffering a large amount of data, I feel that having a moderate number of retries should be enough to get the data written.

One other thing I wanted to get your input on...I was going to add an option in Telegraf for adding a random jitter to the flushing interval. This way you could setup thousands of Telegraf instances all with the same flushing interval and config file, but they would write to the database at different times.

ie, you would have something like

flushing_interval = "60s"
flushing_jitter = "10s"

which means that the output would flush after anywhere from 50-70s

daviesalex · 2015-10-20T00:24:02Z

So we did briefly go up to thousands of clients (although are now back at ~1k). We actually found that overall the distribution was pretty good (slightly to our surprise - we expected a huge micro-burst every second or ten seconds), because they synchronize on their startup time which when n is large enough is almost entirely random.

So the jitter idea is a nice touch, but not as necessary as I had expected.

I agree that telegraf should not capture much data, although I cant see a problem with a few megabytes of data being buffered. In real world environments, nodes get disconnected and InfluxDB servers go down briefly; its not essential to not miss data but it would be nice to handle this gracefully.

sparrc · 2015-10-20T01:06:21Z

Agreed, totally understand that non-100% uptime should be handled gracefully (this is the real world ;-))

I hope that simple retries can get us there, and @pauldix also had the idea that after a certain number of retries the lines could get written to disk and read on restart.

daviesalex · 2015-10-20T10:41:59Z

Writing to disk is a further detail - I still think that there should be a fixed size buffer (I can think of lots of reasons why you would not want it to grow unbounded, even on disk!) and in most real world environments I suspect that buffer will be so small you can keep it in memory and yet still achieve its goals. You could write it to disk (either regularly, or on shutdown) but I almost wonder if we are not going a bit over the top... setting a few tens of MB buffer in RAM would be just fine for us and allow pretty much any real world outage that we would expect to be handled gracefully to be handled and we are quite happy to devote a bit of RAM to it. A on disk buffer would allow us to survive a network event and a simultaneous telegraf restart, but this (certainly for us) isnt something we are too worried about.

One detail that is essential: currently telegraf is synchronized based on time from start-up (as mentioned above, we tested and verified this using packet captures). This new feature needs to keep this; you dont want thousands of clients to do anything exactly on the second. We have timesync across all our nodes within tens of microseconds so this would trigger a monumental microburst (even with the jitter idea - if it happened exactly on the second). Many environments will be within 10ms so the effect would also be bad on them. Doing everything based on the startup time of the process seems to work well, or just randomizing the fraction of a second.

Fixes #285

sparrc · 2015-10-22T19:36:30Z

@daviesalex I'm going to re-open this case with a new title, since I still think there is a good argument for having a fixed-size buffer rather than simply retrying writes.

acherunilam · 2015-12-02T10:07:30Z

@sparrc Have we considered having a on-disk ring buffer TCollector style?
Two concurrent threads - one writing the metrics to the buffer, the other one from the buffer to the configured output. What say?

sparrc · 2015-12-02T17:33:59Z

@AdithyaBenny Yes I have some thoughts on this, we probably need to make a buffer package within telegraf that would handle the ring buffer. Currently the accumulator just throws all metrics into a single channel that gets filtered into a single list for all the outputs to write.

What we probably want is to have a buffer that will track a channel for each output, that way we can filter metrics (output filters) as they are added. This would also allow for each output to not flush it's metrics until it successfully writes them.

Once each output is filtering and holding it's own buffer of points, it won't be very difficult to make that buffer on-disk rather than in-memory, although I think the default behavior should be to keep it in-memory.

Also moved some objects out of config.go and put them in their own package, internal/models fixes #568 closes #285

sparrc added a commit that referenced this issue Oct 21, 2015

Add support for retrying output writes, using independent threads

0483980

Fixes #285

sparrc added a commit that referenced this issue Oct 21, 2015

Add support for retrying output writes, using independent threads

58cca8c

Fixes #285

sparrc mentioned this issue Oct 21, 2015

Add support for retrying output writes, using independent threads #298

Merged

sparrc added a commit that referenced this issue Oct 21, 2015

Add support for retrying output writes, using independent threads

48e9480

Fixes #285

sparrc added a commit that referenced this issue Oct 21, 2015

Add support for retrying output writes, using independent threads

dfc5986

Fixes #285

sparrc closed this as completed in #298 Oct 21, 2015

sparrc mentioned this issue Oct 22, 2015

Normalize telegraf collection interval on even intervals #301

Closed

sparrc changed the title ~~[RFC] Provide additional batching support to increase efficiency server side~~ [RFC] Implement a fixed-size buffer of points Oct 22, 2015

sparrc reopened this Oct 22, 2015

sparrc added a commit that referenced this issue Jan 22, 2016

Make each output keep it's own slice of points

1c0968a

Also moved some objects out of config.go and put them in their own package, internal/models fixes #568 closes #285

sparrc added a commit that referenced this issue Jan 22, 2016

Make each output keep it's own slice of points

ac86cb9

Also moved some objects out of config.go and put them in their own package, internal/models fixes #568 closes #285

sparrc added a commit that referenced this issue Jan 22, 2016

Make each output keep it's own slice of points

9d8b4a4

Also moved some objects out of config.go and put them in their own package, internal/models fixes #568 closes #285

sparrc added a commit that referenced this issue Jan 22, 2016

Make each output keep it's own slice of points

b350ef9

Also moved some objects out of config.go and put them in their own package, internal/models fixes #568 closes #285

sparrc closed this as completed in 5349a3b Jan 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Implement a fixed-size buffer of points #285

[RFC] Implement a fixed-size buffer of points #285

daviesalex commented Oct 19, 2015

sparrc commented Oct 19, 2015

daviesalex commented Oct 19, 2015

sparrc commented Oct 19, 2015

daviesalex commented Oct 19, 2015

sparrc commented Oct 20, 2015

daviesalex commented Oct 20, 2015

sparrc commented Oct 20, 2015

daviesalex commented Oct 20, 2015

sparrc commented Oct 22, 2015

acherunilam commented Dec 2, 2015

sparrc commented Dec 2, 2015

[RFC] Implement a fixed-size buffer of points #285

[RFC] Implement a fixed-size buffer of points #285

Comments

daviesalex commented Oct 19, 2015

sparrc commented Oct 19, 2015

daviesalex commented Oct 19, 2015

sparrc commented Oct 19, 2015

daviesalex commented Oct 19, 2015

sparrc commented Oct 20, 2015

daviesalex commented Oct 20, 2015

sparrc commented Oct 20, 2015

daviesalex commented Oct 20, 2015

sparrc commented Oct 22, 2015

acherunilam commented Dec 2, 2015

sparrc commented Dec 2, 2015