Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf needs better support for partial failure/success #1446

Closed
phemmer opened this issue Jul 3, 2016 · 24 comments
Closed

telegraf needs better support for partial failure/success #1446

phemmer opened this issue Jul 3, 2016 · 24 comments

Comments

@phemmer
Copy link
Contributor

phemmer commented Jul 3, 2016

Feature Request

Proposal:

The telgraf code needs a better way of supporting plugins with partial success/failure.

Current behavior:

Currently partial success/failure is handled, or not handled, in numerous ways. Many of which are very buggy. I started going through the plugins looking for issues (see #1440, #1441, #1442, #1443, #1444, #1445), but stopped because of how prevalent they are.

There seems to be 4 general categories which plugins fall into when they have the possibility of partial success/failure.

  1. The plugin captures the first error, returns it, and discards all the rest.
  2. The plugin gathers all errors and concatenates them together.
  3. The plugin attempts to capture one error, but fails to do so properly (race condition in dovecot plugin #1440, Race condition in mongodb plugin #1442, Race condition in nginx plugin #1444, race condition in nsq plugin #1445)
  4. The plugin returns on the first error it encounters, skipping any further meaasurements which might actually succeed (dns_query fails all hosts when just one fails #1439, Memcached plugin fails all servers if one fails #1441, mysql plugin fails all servers if one fails #1443).

Desired behavior:

We instead need some simplified way for plugins to report multiple errors back to telegraf core.
I think the simplest way might be to just add a AddError() method to telegraf.Accumulator.

The other option might be to expose a general purpose logger on the accumulator. This might be better as then plugins could log things at different log levels (error, warning, info, debug, etc), and when a user wants to troubleshoot an issue, they can just bump up the log level.

Use case: [Why is this important (helps with prioritizing requests)]

Make it easier (and more consistent) for plugin developers to handle errors, and less likely for them to introduce buggy code.

@sparrc
Copy link
Contributor

sparrc commented Jul 4, 2016

I like the idea of having a function on the accumulator. PRs are welcome.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 5, 2016

Have a (very lightly tested) implementation at: master...phemmer:add_error

I'm not sure about it though. It's a little messy because where accumulator channels (shutdown, ticker, metrics) are created & read, and where input errors are handled are scattered around, and isn't implemented the same way between service inputs, normal inputs, and agent.Test.

This also maintains support for the existing Accumulator.Gather() error interface. Should that be kept, or should we force all plugins to use Accumulator.AddError()?

@kaey
Copy link
Contributor

kaey commented Jul 5, 2016

This also maintains support for the existing Accumulator.Gather() error interface. Should that be kept, or should we force all plugins to use Accumulator.AddError()?

Latter seems better to avoid confusion about what error should plugin return from Gather if it already used AddError().

I will help updating plugins as necessary.

@phemmer phemmer mentioned this issue Jul 8, 2016
3 tasks
@phemmer
Copy link
Contributor Author

phemmer commented Jul 8, 2016

@sparrc Does influxdata/influxdb#6976 affect your decision any? If influxdb supports a logger with multiple log levels per-subsystem, should telegraf also support such a thing to be consistent between projects? Would this replace the need for AddError if each plugin has its own logger?

@sparrc
Copy link
Contributor

sparrc commented Jul 18, 2016

@phemmer I think one way you can cleanup your code would be to make the errorC channel internal to the accumulator. That way you don't need to change the call signature of NewAccumulator. You can simply spin off a goroutine when NewAccumulator is called, which monitors an internal chan error.

Also, two things to note is that (1) Gather() of service inputs is always a dummy call, errors are handled internally to each plugin, and (2) don't worry at all about the Test() function, that has no real utility so it can just be made to conform to whatever makes sense for the real gathering routines.

I would support replacing the interface to return errors from Gather with an AddError function call.

I'm not sure each plugin should have it's own logger, and I'm also not sure if this change would prevent us from doing that in the future anyways.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 19, 2016

I'm not sure each plugin should have it's own logger, and I'm also not sure if this change would prevent us from doing that in the future anyways.

Wouldn't prevent it no, but it would make AddError unnecessary. The only thing done with the error that is received is to log it.

I'll formalize the code and put up a PR (with AddError).

@phemmer
Copy link
Contributor Author

phemmer commented Jul 20, 2016

Just adding another note, more as a reminder for myself: The jolokia plugin will need to be adjusted to remove the fmt.Printf() calls and use the AddError interface instead. Currently it's logging messages such as:

Error handling response: Not expected status value in response body: 400

Which is very confusing as nothing in the message indicates that it came from jolokia.
Will need to scan through other plugins for usages of fmt.Printf() and fmt.Fprintf(). Perhaps even usages of the log package as well.

@sparrc
Copy link
Contributor

sparrc commented Jul 20, 2016

@phemmer if you are working on this code, please submit it in stages.

ie, just submit the addition of the AddError function before changing every plugin please :)

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

re: changing the Gather(acc) error interface:

I'm not sure we should change this interface, I think it's OK to leave it as-is and document that returning an error will be treated as a "fatal" error, exiting telegraf.

Otherwise users should be encouraged to use acc.AddError. I think this will be easy to do with documentation and examples.

The reason behind this is that there needs to be a way to indicate "fatal" errors, and returning them to the accumulator doesn't seem like the right way to do it. To me they should get returned directly to the "agent", which is calling the Gather() function.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 25, 2016

document that returning an error will be treated as a "fatal" error, exiting telegraf.

I don't think I'm fond of this idea. If just one plugin goes bad, this would stop all plugins, not just the one. What about reinitializing that plugin instead?

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

If a plugin is misconfigured or experiences a fatal error, I think telegraf should fail and exit. Not sure what you mean by "going bad", but if telegraf is being run as a service it will get reloaded anyways.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 25, 2016

If a plugin is misconfigured or experiences a fatal error, I think telegraf should fail and exit.

If a plugin is misconfigured, telegraf shouldn't even start. A separate mechanism should be provided for a config check (#1438). Having Gather() return a fatal error allows telegraf to die after it's already up and running.

Not sure what you mean by "going bad"

Your idea sounds like you want to support if a plugin goes into an unrecoverable state that telegraf should exit.

if telegraf is being run as a service it will get reloaded anyways.

Not all service managers automatically restart telegraf if it dies.

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

seems to me like it would be difficult to anticipate all runtime errors ahead-of-time.

"reloading" of a plugin sounds like a simple concept but is actually quite difficult and dangerous. What if the plugin has spun off hung processes? What if it has unclosable connections still open?

The only way to guarantee that some of these get cleaned up is by restarting the process. This is one of the drawbacks of a "monolith" style of architecture, which Telegraf certainly is. We are not spinning up plugins as independent processes. They are not dependent but they are part of the same process, and thus fatal errors can happen, and it's impossible to completely isolate & protect one plugin from another.

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

and by the way, I'm not advocating for a change in behavior at the moment, currently we do not exit on fatal errors and we can continue with the same behavior.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 25, 2016

What if the plugin has spun off hung processes?

A restart won't fix anything. The processes will still be hung.

What if it has unclosable connections still open?

There's no such thing as an unclosable connection. A connection can always be closed. It might go into a TIME_WAIT state, but this only lasts a little while.

 

I just don't like the idea of one part of my monitoring system failing to be able to cause the whole thing to fail. I would rather have some metrics than no metrics.

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

Restarting the process will kill child processes, so that would cleanup the hung processes.

From a kernel perspective there are no closable connections. From a programming perspective there are many client libraries, and even Go idioms, which don't provide proper Close() methods.

The benefit of failing is that users cannot complain about "bad" or "wrong" telegraf behavior. If the process is failing then the user obviously has the responsibility to fix the error. If the process is not failing, and is just logging or internally trying to clean itself up, then it's telegraf's fault for not fixing itself. Mainly I worry that it gets into an "unwinnable" state where it can't really do anything to fix itself.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 25, 2016

Restarting the process will kill child processes

No it won't. Restarting will abandon the child processes.

From a kernel perspective there are no closable connections.

This doesn't make any sense. Can you clarify what you mean?
From man 2 close, the only possible failures of a close() call are:

EBADF - fd isn't a valid open file descriptor.
EINTR - The close() call was interrupted by a signal; see signal(7).
EIO - An I/O error occurred.

EBADF - obviously already closed
EINTR - just have to retry the call
EIO - Only relevant to file based IO. Never heard of socket IO generating this error.

The benefit of failing is that users cannot complain about "bad" or "wrong" telegraf behavior.

I disagree. If a telegraf plugin has some sort of issue where it continuously fails, preventing telegraf from staying running, I would describe this as bad behavior.

If the process is failing then the user obviously has the responsibility to fix the error.

By "failing" I assume you mean exiting with error code. And if so, I would argue that it's the users responsibility to restart telegraf, not telegraf restart itself.

If the process is not failing, and is just logging or internally trying to clean itself up, then it's telegraf's fault for not fixing itself.

I don't think one can argue that it's ever a processes responsibility for fixing itself. Self-heal mechanisms are next to unheard of. When applications get into a bad state, but are still partially operational, they log it. The admin then has to come along and fix it. I think that trying to self-heal will cause more problems than it solves.

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

sorry, I meant to say "from a kernel perspective, there are no _un_closable connections"

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

I disagree. If a telegraf plugin has some sort of issue where it continuously fails, preventing telegraf from staying running, I would describe this as bad behavior.

that's my entire point, if it's failing, it's a bug that should be fixed or a user-error. Why do you think it should cover up a bug? Telegraf should fail if it's failing.

By "failing" I assume you mean exiting with error code. And if so, I would argue that it's the users responsibility to restart telegraf, not telegraf restart itself.

I'm not advocating for telegraf to exit on every failure, I'm advocating for Telegraf to exit on fatal errors.

I don't think one can argue that it's ever a processes responsibility for fixing itself. Self-heal mechanisms are next to unheard of. When applications get into a bad state, but are still partially operational, they log it. The admin then has to come along and fix it. I think that trying to self-heal will cause more problems than it solves.

CORRECT! that is what AddError does. There are still fatal errors that can exist upon initialization. There are also panics which can be recovered from properly and then exited upon.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 25, 2016

Obviously neither of us is going to back down from our viewpoint. So how about an option to the control the behavior. As long as I have the ability to disable it, I'll be happy.

@kaey
Copy link
Contributor

kaey commented Jul 25, 2016

returning an error will be treated as a "fatal" error, exiting telegraf.

Please don't. I'm running telegraf on freebsd, which is not very well supported. ZFS zpool plugin for example:
Freebsd 10.3 - works
10.0 - no -p flag
9.3 - zpool list shows less columns (no frag and expandsz)
9.1 - no -p and less columns

You can't possibly catch all those cases in development on all different os versions I have.
Exiting monitoring process because of one failing plugin is wrong.
What it should do is report error to central server (influxdb) and stop running Gather() on fatally failed plugin.
It's admins job to clean up any possible mess later.

@sparrc
Copy link
Contributor

sparrc commented Jul 25, 2016

Correct, I am talking about fatal errors....In some ways we are arguing for the same thing. I am saying that plugins should be encouraged to use AddError(err) (and probably should almost always use it).

But as @kaey said, stopping Gather on a particular plugin cannot be done via AddError.

So returning error from Gather would be a way to handle "fatal" errors. I was suggesting that failing the process could be one way to deal with them, but simply exiting a particular plugin could be another way as well. Leaving the interface as it is leaves it open to finding the best solution, because I don't think we know what that is at the moment.

@kaey
Copy link
Contributor

kaey commented Aug 14, 2016

@sparrc So what's the resolution? Return err on fatal error? Should we start updating plugins or wait until after 1.0 is released?

@sparrc
Copy link
Contributor

sparrc commented Aug 15, 2016

wait until after 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants