Statsd registry stops polling meters #2543

matthewlowry · 2021-03-31T23:50:18Z

I'm seeing this problem in a large Spring Boot application - Spring Boot 2.4.3 and Micrometer 1.6.4. We are using the StatsD registry to send metrics to a local Telegraf process over UDP and from there on to an InfluxDB + Graphana setup. We have many instances of these applications running, and we have observed that some (but not all) of the applications are failing to send gauge metrics.

If I snoop on the UDP traffic from the application to Telegraf, I can see that all the counter and timer meters that we are using internally, and the ones that Spring Boot have created (logback log event counters et cetera), are flowing as expected. But there are no lines for polled meters. When I do a thread dump on the affected applications, I can see that the threads that would normally be there for the Reactor machinery (the Exec Stream Pumper, Exec Default Executor threads) are missing.

The application is subject to occasional restarts, and in the Influx database we can see a few points for a handful (but not all) of our gauge meters where the timestamps corresponding to the application restarts.

Its seems to me as if the Reactor machinery for doing the polling is dying early in the application's run and not coming back. Perhaps we are tickling a bug somewhere deep in the Reactor machinery? I'm not familiar with the Reactor library so I'm not really sure how to diagnose any further. It's difficult for me to get more clues about what is going on because these are production instances and we don't observe the same behaviour in our CI/CD and test/staging environments.

UPDATE:

Just spotted something in the logs of affected instances: when they shut down the logs are showing:

2021-03-19 16:42:34.498  WARN 3182 --- [SpringContextShutdownHook] o.s.b.f.support.DisposableBeanAdapter    : Destroy method 'close' on bean with name 'statsdMeterRegistry' threw an exception: java.lang.NullPointerException

I'm not seeing that on the instances which are not affected by this problem.

The text was updated successfully, but these errors were encountered:

shakuzen · 2021-04-01T03:03:45Z

If you could get a line number or more of the stack trace for the NPE, that would help dig into this.

shakuzen · 2021-04-01T03:16:52Z

I would also note: I'm assuming you have your reasons for using statsd and telegraf, but we do offer an influxdb registry so you could publish metrics directly to InfluxDB, which probably avoids the issue in the meantime.

matthewlowry · 2021-04-01T03:44:28Z

@shakuzen Thanks

I'll see if I can get some stack traces for the NPE. The Spring class in question will dump them to the log if I crank its logging level to DEBUG but it may be a week or two before I can get that.

To be honest I don't think there's anything stopping us using the InfluxDB registry we went with StatsD mostly because we already had Telegraf with a variety of input plugins to collect metrics on other things. I'll look at switching over and see how that goes.

matthewlowry · 2021-04-06T08:16:29Z

OK, got a stack trace and that makes it clear the problem is with our code. Specifically, we have a gauge with a custom value function and that function is a bit dodgy - it can throw an NPE.

java.lang.NullPointerException: null
	at [ourCode]
	at io.micrometer.statsd.StatsdGauge.value(StatsdGauge.java:49)
	at io.micrometer.statsd.StatsdGauge.poll(StatsdGauge.java:54)
	at io.micrometer.statsd.StatsdMeterRegistry.poll(StatsdMeterRegistry.java:178)
	at io.micrometer.statsd.StatsdMeterRegistry.close(StatsdMeterRegistry.java:312)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:339)
        at [...Spring machinery...]

It's a bit unfortunate that a misbehaving value function can completely and silently(!) kill the publication of pollable meters.
It causes the the loop over pollable meters in StatsdMeterRegistry.poll() to terminate early, and then propagates up to the Reactor machinery causing the flux that triggers periodic polling to terminate with an error. None of this is logged or in any way surfaced in the application. The only reason we are able to see it in the above stack trace is because of the way the registry does one last poll while closing, so for this final poll the NPE is bubbling up the stack of a thread managed by Spring which can log it for you.

I think it would be good if the StatsdMeterRegistry.poll() method isolated the polling of each pollable meter, so if one is misbehaving then that doesn't affect ongoing polling and publication of the others.

matthewlowry · 2021-04-06T08:29:04Z

@shakuzen Happy for this issue to be closed and/or relabelled to reflect that the bug is not really in Micrometer code, but perhaps the Statsd registry could handle misbehaving pollables a bit better.

My suggestion for that in PR #2549. For our application, if the polling of each pollable was isolated as in that PR, then all our other gauges would have continued publishing so we would have observed that the metrics not flowing through to Influx were all from one specific gauge and we would have been able to track it down a lot more easily.

Thanks!

shakuzen added the registry: statsd A StatsD Registry related issue label Apr 1, 2021

shakuzen added the bug A general bug label Apr 1, 2021

matthewlowry mentioned this issue Apr 6, 2021

Isolate polling of each pollable meter in StatsdMeterRegistry poll() matthewlowry/micrometer#1

Merged

matthewlowry mentioned this issue Apr 6, 2021

Isolate polling of each pollable meter in StatsdMeterRegistry poll() #2549

Merged

shakuzen added this to the 1.3.19 milestone Apr 13, 2021

shakuzen changed the title ~~Statsd registry not polling gauge meters~~ Statsd registry stops polling meters Apr 14, 2021

shakuzen closed this as completed in 17024fb Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd registry stops polling meters #2543

Statsd registry stops polling meters #2543

matthewlowry commented Mar 31, 2021 •

edited

Loading

shakuzen commented Apr 1, 2021

shakuzen commented Apr 1, 2021

matthewlowry commented Apr 1, 2021

matthewlowry commented Apr 6, 2021 •

edited

Loading

matthewlowry commented Apr 6, 2021

Statsd registry stops polling meters #2543

Statsd registry stops polling meters #2543

Comments

matthewlowry commented Mar 31, 2021 • edited Loading

shakuzen commented Apr 1, 2021

shakuzen commented Apr 1, 2021

matthewlowry commented Apr 1, 2021

matthewlowry commented Apr 6, 2021 • edited Loading

matthewlowry commented Apr 6, 2021

matthewlowry commented Mar 31, 2021 •

edited

Loading

matthewlowry commented Apr 6, 2021 •

edited

Loading