Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability overhaul of [inputs.vsphere] #5113

Merged
merged 20 commits into from
Dec 28, 2018
Merged

Scalability overhaul of [inputs.vsphere] #5113

merged 20 commits into from
Dec 28, 2018

Conversation

prydin
Copy link
Contributor

@prydin prydin commented Dec 5, 2018

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

This PR resolves some issues found during scalability testing and testing under heavy load. The following changes have been made:

  • Add an absolute maximum of 100,000 metrics per query to avoid query size overrun errors at vCenter
  • Use a lookback when querying for metrics to avoid missing metrics that were posted late by vCenter due to high load.
  • Moved all calls to vCenter to client.go and made sure they have the correct timeout value configured in their contexts.
  • Instead of querying every resource for metric metadata, only a subset (100) is sampled, since all metadata is VERY likely to be identical.
  • When no wildcards are present in the include clause and no excludes are specified, we just look up the metric names and don't have to run metadata discovery at all.
  • Removed the WorkerPool and replaced it with ThrottledExecutor to simplify concurrency logic
  • Fixed a bug that caused a deadlock when more than 10 errors were detected during a single round of collections.
  • Changed the default value for cluster_instances to false since it was causing problems and clusters don't have instance metrics anyway.
  • Changed to run collection of different resource types in separate goroutines.
  • Cleaned up error handling to funnel all error reporting through Accumulator.AddError
  • Changed default timeout value to 60s from 20s
  • Slightly simplified the chunking code

Addresses the following issues:

#5057 Vphere-Input for Telegraf stops working
#5041 [Inputs.vsphere] Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000
#5037 [inputs.vsphere]: Error in plugin: ServerFaultCode: This operation is restricted by the administrator
#5034 [input.vsphere] Option query for maxQueryMetrics failed. Using default

@danielnelson danielnelson modified the milestones: 1.9.1, 1.10.0 Dec 10, 2018
plugins/inputs/vsphere/client.go Outdated Show resolved Hide resolved
plugins/inputs/vsphere/client.go Outdated Show resolved Hide resolved
plugins/inputs/vsphere/client.go Outdated Show resolved Hide resolved
plugins/inputs/vsphere/endpoint.go Outdated Show resolved Hide resolved
plugins/inputs/vsphere/vsphere_test.go Show resolved Hide resolved
if p := recover(); p != nil {
log.Printf("E! [input.vsphere] PANIC (recovered): %s", p)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a strict no-panic handling rule in the plugins, if this is required in order to work it will need updated to work without panicing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because much of the data collection happens in separate go-routines, which, if they panic, will crash the entire process if the panic isn't recovered. I can remove this, but there's a chance panics will crash the entire Telegraf process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more detail: Telegraf gracefully handles panics in the main thread, which is great. However, panics in goroutines will crash Telegraf for the reason that's illustrated in this example: https://play.golang.org/p/jQHzZm5zMQB

This is caused by the fact that panic handlers don't cross goroutine boundaries.

I get that you want to handle panics in a single place so you can log them correctly, but I also would like to make sure Telegraf doesn't crash due to temporary glitches. The reason I implemented this in the first place was a strange, very intermittent problem related to vCenter. It's supposed to return two arrays that are guaranteed to have the same length. At some point, the arrays came back with one being longer than the other, which caused a panic. Of course, I should have checked the length and not assumed that they would be equal, but this illustrates the problem with intermittent panics that are hard to predict and that should, in my opinion, be handled in a graceful and recoverable way.

OK. I'll get off the soap box now. If you want me to toss out the panic handler, I will do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovery in many cases isn't really possible, due to resource leaks and there will always be goroutines that you don't control in libraries. I also don't want to hide programming errors by handling them. I think Go best practice here is to only handle panics if you are expecting them to occur for some reason.

I'm also torn on if I should change the existing panic handling to log where to report bugs and then re-panic to crash Telegraf, or even just remove it altogether. For now I've left it as is, but I remember noticing that it doesn't actually work very well.

So yeah, could you please remove the panic handling?

Copy link
Contributor Author

@prydin prydin Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. What you're saying makes sense. My concern was consistency, since removing the panic handler means that some parts of the code are "safe" from panics and some aren't (i.e. code that runs in goroutines).

Anyway, I'll drop the handling now. Stand by!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

plugins/inputs/vsphere/throttled_exec.go Outdated Show resolved Hide resolved
plugins/inputs/vsphere/endpoint.go Show resolved Hide resolved
@danielnelson danielnelson merged commit 78c1ffb into influxdata:master Dec 28, 2018
trevorwhitney pushed a commit to trevorwhitney/telegraf that referenced this pull request Feb 14, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants