Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement a fixed-size buffer of points #285

Closed
daviesalex opened this issue Oct 19, 2015 · 11 comments
Closed

[RFC] Implement a fixed-size buffer of points #285

daviesalex opened this issue Oct 19, 2015 · 11 comments

Comments

@daviesalex
Copy link

We are running many thousands of telegraf clients with a frequency of collection in the low seconds. We would like to be able to run some collectors at least once per second and potentially more frequently (not currently supported by telegraf, and irrelevant until other scaling issues are resolved).

From my reading of the code, telegraf batches points from a given run, one per schedule. So if all your plugins run every second except for 5 that run every minute, you will have one batched post to InfluxDB every second and 6 on the 60th second.

From our observations, should InfluxDB be down, these POSTs will fail with minimal data stored locally and most if not all data collected during a downtime window (or network partition) will not make it into the data store.

We would like to propose two changes:

  • Writing all data that is being sent to InfluxDB into a fixed size, in memory, ring buffer on the client.
  • Setting a minimum time to flush to InfluxDB. For example, if you set it to 60 seconds, every 60 seconds all data in the buffer would be posted to InfluxDB in one batch and the buffer cleared.

We believe that this would enormously increase the average number of points per post, which appears at this point in time critical to scale InfluxDB (particularly with clustering). Furthermore, this would minimize data loss in the case of network issues between clients and the InfluxDB server and is in keeping with the "eventually consistent" InfluxDB design.

For environments that require data almost immediately in the InfluxDB server, this minimal post time could be set very low. For environments more concerned with eventually consistent data (within a few minutes), this could be set much higher. The default should likely be relatively low as the increase would be a tuning optimization for scale.

@sparrc
Copy link
Contributor

sparrc commented Oct 19, 2015

I agree that this would be a better design at scale.

The first thing that stands out to me about the design your suggesting is a configurable minimum interval, with a circular buffer. This would mean you would like Telegraf to just overwrite and drop metrics if more metrics are coming in than the buffer has available?

cc @pauldix

@daviesalex
Copy link
Author

@sparrc , that is certainly an accurate summary of my suggestion (perhaps adding "with POST attempts retried at that minimum interval until they succeed, or the writes fall out the ring buffer (FIFO)".

@sparrc
Copy link
Contributor

sparrc commented Oct 19, 2015

Do you mean the ring buffer would be for a batches of points? or for individual points?

Right now I happen to be working on a change that gets us closer to what you're suggesting, but without sophisticated buffering.

Basically the accumulator is changing to only track a channel of points, which a separate gorouting reads into a slice. Then there will be a configurable flush_interval parameter that will also default to 10s. So this would satisfy the desire to separate flushing from gathering, but more work will need to be done to add the buffering that you're describing here.

I will try to add support for the write retries while I'm at it

@daviesalex
Copy link
Author

That channel will be what, a buffered one? Presumably if flush_interval != the interval the points are being collected, they have to be buffered somewhere.

Buffering is mostly to allow us to survive a brief network outage (or InfluxDB service event) without loosing data (what happens today) or a memory leak (what would happen if we had no buffer size!). Other approaches may well get us to the same place though... a buffered channel that multiple goroutines write into and that drop if the buffer is full would achieve this (except FIFO, but thats really just a detail).

@sparrc
Copy link
Contributor

sparrc commented Oct 20, 2015

We can buffer the channel, but flusher will make every effort to get it off the channel and into a slice, which can be passed to and read by multiple output sinks. My thought was that the code that writes to the output sinks (see agent.flush) would retry on failures a few times.

I really don't want to see telegraf buffering a large amount of data, I feel that having a moderate number of retries should be enough to get the data written.

One other thing I wanted to get your input on...I was going to add an option in Telegraf for adding a random jitter to the flushing interval. This way you could setup thousands of Telegraf instances all with the same flushing interval and config file, but they would write to the database at different times.

ie, you would have something like

flushing_interval = "60s"
flushing_jitter = "10s"

which means that the output would flush after anywhere from 50-70s

@daviesalex
Copy link
Author

So we did briefly go up to thousands of clients (although are now back at ~1k). We actually found that overall the distribution was pretty good (slightly to our surprise - we expected a huge micro-burst every second or ten seconds), because they synchronize on their startup time which when n is large enough is almost entirely random.

So the jitter idea is a nice touch, but not as necessary as I had expected.

I agree that telegraf should not capture much data, although I cant see a problem with a few megabytes of data being buffered. In real world environments, nodes get disconnected and InfluxDB servers go down briefly; its not essential to not miss data but it would be nice to handle this gracefully.

@sparrc
Copy link
Contributor

sparrc commented Oct 20, 2015

Agreed, totally understand that non-100% uptime should be handled gracefully (this is the real world ;-))

I hope that simple retries can get us there, and @pauldix also had the idea that after a certain number of retries the lines could get written to disk and read on restart.

@daviesalex
Copy link
Author

Writing to disk is a further detail - I still think that there should be a fixed size buffer (I can think of lots of reasons why you would not want it to grow unbounded, even on disk!) and in most real world environments I suspect that buffer will be so small you can keep it in memory and yet still achieve its goals. You could write it to disk (either regularly, or on shutdown) but I almost wonder if we are not going a bit over the top... setting a few tens of MB buffer in RAM would be just fine for us and allow pretty much any real world outage that we would expect to be handled gracefully to be handled and we are quite happy to devote a bit of RAM to it. A on disk buffer would allow us to survive a network event and a simultaneous telegraf restart, but this (certainly for us) isnt something we are too worried about.

One detail that is essential: currently telegraf is synchronized based on time from start-up (as mentioned above, we tested and verified this using packet captures). This new feature needs to keep this; you dont want thousands of clients to do anything exactly on the second. We have timesync across all our nodes within tens of microseconds so this would trigger a monumental microburst (even with the jitter idea - if it happened exactly on the second). Many environments will be within 10ms so the effect would also be bad on them. Doing everything based on the startup time of the process seems to work well, or just randomizing the fraction of a second.

@sparrc sparrc changed the title [RFC] Provide additional batching support to increase efficiency server side [RFC] Implement a fixed-size buffer of points Oct 22, 2015
@sparrc
Copy link
Contributor

sparrc commented Oct 22, 2015

@daviesalex I'm going to re-open this case with a new title, since I still think there is a good argument for having a fixed-size buffer rather than simply retrying writes.

@sparrc sparrc reopened this Oct 22, 2015
@acherunilam
Copy link

@sparrc Have we considered having a on-disk ring buffer TCollector style?
Two concurrent threads - one writing the metrics to the buffer, the other one from the buffer to the configured output. What say?

@sparrc
Copy link
Contributor

sparrc commented Dec 2, 2015

@AdithyaBenny Yes I have some thoughts on this, we probably need to make a buffer package within telegraf that would handle the ring buffer. Currently the accumulator just throws all metrics into a single channel that gets filtered into a single list for all the outputs to write.

What we probably want is to have a buffer that will track a channel for each output, that way we can filter metrics (output filters) as they are added. This would also allow for each output to not flush it's metrics until it successfully writes them.

Once each output is filtering and holding it's own buffer of points, it won't be very difficult to make that buffer on-disk rather than in-memory, although I think the default behavior should be to keep it in-memory.

sparrc added a commit that referenced this issue Jan 22, 2016
Also moved some objects out of config.go and put them in their own
package, internal/models

fixes #568
closes #285
sparrc added a commit that referenced this issue Jan 22, 2016
Also moved some objects out of config.go and put them in their own
package, internal/models

fixes #568
closes #285
sparrc added a commit that referenced this issue Jan 22, 2016
Also moved some objects out of config.go and put them in their own
package, internal/models

fixes #568
closes #285
sparrc added a commit that referenced this issue Jan 22, 2016
Also moved some objects out of config.go and put them in their own
package, internal/models

fixes #568
closes #285
@sparrc sparrc closed this as completed in 5349a3b Jan 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants