vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

prydin · 2018-09-27T18:55:53Z

Relevant telegraf.conf:

# Read metrics from VMware vCenter
[[inputs.vsphere]]
  interval = "20s"
  ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://example.com:8989/sdk" ]
  username = "user@corp.local"
  password = "secret"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  vm_metric_include = ["*"]
  
  # vm_instances = true ## true by default

  ## Hosts
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  host_metric_include = [ "*" ]
  
# host_metric_exclude = [] ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters
  cluster_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # cluster_metric_exclude = [] ## Nothing excluded by default
  # cluster_instances = true ## true by default

  ## Datastores
  datastore_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default for Datastores only

  ## Datacenters
  datacenter_metric_include = [ "*" ] ## if omitted or empty, all metrics are collected
  # datacenter_instances = false ## false by default for Datastores only

  ## Plugin Settings
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
  collect_concurrency = 3
  discover_concurrency = 3

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  # object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  # timeout = "20s"

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
  insecure_skip_verify = true

System info:

Ubuntu 16.04 AWS "Small" configuration.
Telegraf 1.18

Steps to reproduce:

Very hard to reproduce. You have to get a network error at the exact right time. This happened when I deliberately was trying to overload a undersized system.

See logfile for information how this happened.

Expected behavior:

Data should be collected without error.

Actual behavior:

Panic in workerpool.go, due to unlocking of an unlocked Mutex. See attached logfile!

Additional info:

This bug is due to a typo in the code that handles errors from the goroutine querying for metadata. If you lose your network connection at the exact moment when the metadata query is issued, you will hit a section of code where an mutex.Lock was accidentally mistyped as an unlock.

telegraf/plugins/inputs/vsphere/endpoint.go

Lines 660 to 666 in af0ef55

    
           var mux sync.Mutex 
        
           err := make(multiError, 0) 
        
           wp.Drain(ctx, func(ctx context.Context, in interface{}) bool { 
        
           	if in != nil { 
        
           		mux.Unlock() 
        
           		defer mux.Unlock() 
        
           		err = append(err, in.(error))

Logfile: https://gist.github.com/prydin/82976e39378434bc2cc97cbdddf806fc

The text was updated successfully, but these errors were encountered:

sbengo · 2018-09-28T06:34:48Z

Hi again @prydin !

What a coincidence! Just yesterday I tried the new plugin and it gave me a panic when trying to collect data and created the gist with the panic, but didn't have enough time to write up the issue!

Reviewing your log, it seems that is the same panic, but to be sure, here it is:

Panic log

It seems that was happening every 300s, when the agent was trying to retrieve metrics from cluster resources:

...
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359, elapsed: 304.788621, resource: datacenter
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.22902964 +0200 CEST m=+53.726632359
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type datacenter for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 20 metrics
2018-09-27T08:43:00Z D! [input.vsphere]: Latest: 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115, elapsed: 305.015600, resource: cluster
2018-09-27T08:43:00Z D! [input.vsphere]: Start of sample period deemed to be 2018-09-27 10:38:00.229495562 +0200 CEST m=+53.727098115
2018-09-27T08:43:00Z D! [input.vsphere]: Collecting metrics for 1 objects of type cluster for myvcenter.mydomain.com
2018-09-27T08:43:00Z D! [input.vsphere]: Query returned 0 metrics

[telegraf stopped]

I will try your fix and will give you some feedback!

prydin · 2018-09-28T12:18:55Z

@sbengo Hmmm... I thought it would take longer for someone to hit that bug, but OK. :) Do you get any other error just before the panic? This bug is in the error handling code, so some other issue must have triggered it.

Also, if you want, I can build you a "hotfix" binary with that bug fixed.

prydin · 2018-10-02T19:37:35Z

Unofficial hotfix. Linux only. Let me know if you need anything else. (Also fixes #4783)
https://github.com/prydin/telegraf/releases/tag/PRYDIN-HOTFIX-4783

russorat added this to the 1.8.1 milestone Sep 27, 2018

russorat added the bug unexpected problem or unintended behavior label Sep 27, 2018

prydin mentioned this issue Sep 27, 2018

Fixed #4764 (Panic when error during call to GetAvailableMetrics) #4765

Merged

danielnelson closed this as completed Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

prydin commented Sep 27, 2018 •

edited

Loading

sbengo commented Sep 28, 2018

prydin commented Sep 28, 2018

prydin commented Oct 2, 2018

vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

vSphere input plugin: Panic if network error when calling a specific vCenter method #4764

Comments

prydin commented Sep 27, 2018 • edited Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

sbengo commented Sep 28, 2018

prydin commented Sep 28, 2018

prydin commented Oct 2, 2018

prydin commented Sep 27, 2018 •

edited

Loading