Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd registry stops polling meters #2543

Closed
matthewlowry opened this issue Mar 31, 2021 · 5 comments
Closed

Statsd registry stops polling meters #2543

matthewlowry opened this issue Mar 31, 2021 · 5 comments
Labels
bug A general bug registry: statsd A StatsD Registry related issue
Milestone

Comments

@matthewlowry
Copy link
Contributor

matthewlowry commented Mar 31, 2021

I'm seeing this problem in a large Spring Boot application - Spring Boot 2.4.3 and Micrometer 1.6.4. We are using the StatsD registry to send metrics to a local Telegraf process over UDP and from there on to an InfluxDB + Graphana setup. We have many instances of these applications running, and we have observed that some (but not all) of the applications are failing to send gauge metrics.

If I snoop on the UDP traffic from the application to Telegraf, I can see that all the counter and timer meters that we are using internally, and the ones that Spring Boot have created (logback log event counters et cetera), are flowing as expected. But there are no lines for polled meters. When I do a thread dump on the affected applications, I can see that the threads that would normally be there for the Reactor machinery (the Exec Stream Pumper, Exec Default Executor threads) are missing.

The application is subject to occasional restarts, and in the Influx database we can see a few points for a handful (but not all) of our gauge meters where the timestamps corresponding to the application restarts.

Its seems to me as if the Reactor machinery for doing the polling is dying early in the application's run and not coming back. Perhaps we are tickling a bug somewhere deep in the Reactor machinery? I'm not familiar with the Reactor library so I'm not really sure how to diagnose any further. It's difficult for me to get more clues about what is going on because these are production instances and we don't observe the same behaviour in our CI/CD and test/staging environments.

UPDATE:

Just spotted something in the logs of affected instances: when they shut down the logs are showing:

2021-03-19 16:42:34.498  WARN 3182 --- [SpringContextShutdownHook] o.s.b.f.support.DisposableBeanAdapter    : Destroy method 'close' on bean with name 'statsdMeterRegistry' threw an exception: java.lang.NullPointerException

I'm not seeing that on the instances which are not affected by this problem.

@shakuzen shakuzen added the registry: statsd A StatsD Registry related issue label Apr 1, 2021
@shakuzen
Copy link
Member

shakuzen commented Apr 1, 2021

If you could get a line number or more of the stack trace for the NPE, that would help dig into this.

@shakuzen shakuzen added the bug A general bug label Apr 1, 2021
@shakuzen
Copy link
Member

shakuzen commented Apr 1, 2021

I would also note: I'm assuming you have your reasons for using statsd and telegraf, but we do offer an influxdb registry so you could publish metrics directly to InfluxDB, which probably avoids the issue in the meantime.

@matthewlowry
Copy link
Contributor Author

@shakuzen Thanks

I'll see if I can get some stack traces for the NPE. The Spring class in question will dump them to the log if I crank its logging level to DEBUG but it may be a week or two before I can get that.

To be honest I don't think there's anything stopping us using the InfluxDB registry we went with StatsD mostly because we already had Telegraf with a variety of input plugins to collect metrics on other things. I'll look at switching over and see how that goes.

@matthewlowry
Copy link
Contributor Author

matthewlowry commented Apr 6, 2021

OK, got a stack trace and that makes it clear the problem is with our code. Specifically, we have a gauge with a custom value function and that function is a bit dodgy - it can throw an NPE.

java.lang.NullPointerException: null
	at [ourCode]
	at io.micrometer.statsd.StatsdGauge.value(StatsdGauge.java:49)
	at io.micrometer.statsd.StatsdGauge.poll(StatsdGauge.java:54)
	at io.micrometer.statsd.StatsdMeterRegistry.poll(StatsdMeterRegistry.java:178)
	at io.micrometer.statsd.StatsdMeterRegistry.close(StatsdMeterRegistry.java:312)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:339)
        at [...Spring machinery...]

It's a bit unfortunate that a misbehaving value function can completely and silently(!) kill the publication of pollable meters.
It causes the the loop over pollable meters in StatsdMeterRegistry.poll() to terminate early, and then propagates up to the Reactor machinery causing the flux that triggers periodic polling to terminate with an error. None of this is logged or in any way surfaced in the application. The only reason we are able to see it in the above stack trace is because of the way the registry does one last poll while closing, so for this final poll the NPE is bubbling up the stack of a thread managed by Spring which can log it for you.

I think it would be good if the StatsdMeterRegistry.poll() method isolated the polling of each pollable meter, so if one is misbehaving then that doesn't affect ongoing polling and publication of the others.

@matthewlowry
Copy link
Contributor Author

@shakuzen Happy for this issue to be closed and/or relabelled to reflect that the bug is not really in Micrometer code, but perhaps the Statsd registry could handle misbehaving pollables a bit better.

My suggestion for that in PR #2549. For our application, if the polling of each pollable was isolated as in that PR, then all our other gauges would have continued publishing so we would have observed that the metrics not flowing through to Influx were all from one specific gauge and we would have been able to track it down a lot more easily.

Thanks!

@shakuzen shakuzen added this to the 1.3.19 milestone Apr 13, 2021
@shakuzen shakuzen changed the title Statsd registry not polling gauge meters Statsd registry stops polling meters Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: statsd A StatsD Registry related issue
Projects
None yet
Development

No branches or pull requests

2 participants