-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statsd registry stops polling meters #2543
Comments
If you could get a line number or more of the stack trace for the NPE, that would help dig into this. |
I would also note: I'm assuming you have your reasons for using statsd and telegraf, but we do offer an influxdb registry so you could publish metrics directly to InfluxDB, which probably avoids the issue in the meantime. |
@shakuzen Thanks I'll see if I can get some stack traces for the NPE. The Spring class in question will dump them to the log if I crank its logging level to DEBUG but it may be a week or two before I can get that. To be honest I don't think there's anything stopping us using the InfluxDB registry we went with StatsD mostly because we already had Telegraf with a variety of input plugins to collect metrics on other things. I'll look at switching over and see how that goes. |
OK, got a stack trace and that makes it clear the problem is with our code. Specifically, we have a gauge with a custom value function and that function is a bit dodgy - it can throw an NPE.
It's a bit unfortunate that a misbehaving value function can completely and silently(!) kill the publication of pollable meters. I think it would be good if the |
@shakuzen Happy for this issue to be closed and/or relabelled to reflect that the bug is not really in Micrometer code, but perhaps the Statsd registry could handle misbehaving pollables a bit better. My suggestion for that in PR #2549. For our application, if the polling of each pollable was isolated as in that PR, then all our other gauges would have continued publishing so we would have observed that the metrics not flowing through to Influx were all from one specific gauge and we would have been able to track it down a lot more easily. Thanks! |
I'm seeing this problem in a large Spring Boot application - Spring Boot 2.4.3 and Micrometer 1.6.4. We are using the StatsD registry to send metrics to a local Telegraf process over UDP and from there on to an InfluxDB + Graphana setup. We have many instances of these applications running, and we have observed that some (but not all) of the applications are failing to send gauge metrics.
If I snoop on the UDP traffic from the application to Telegraf, I can see that all the counter and timer meters that we are using internally, and the ones that Spring Boot have created (logback log event counters et cetera), are flowing as expected. But there are no lines for polled meters. When I do a thread dump on the affected applications, I can see that the threads that would normally be there for the Reactor machinery (the
Exec Stream Pumper
,Exec Default Executor
threads) are missing.The application is subject to occasional restarts, and in the Influx database we can see a few points for a handful (but not all) of our gauge meters where the timestamps corresponding to the application restarts.
Its seems to me as if the Reactor machinery for doing the polling is dying early in the application's run and not coming back. Perhaps we are tickling a bug somewhere deep in the Reactor machinery? I'm not familiar with the Reactor library so I'm not really sure how to diagnose any further. It's difficult for me to get more clues about what is going on because these are production instances and we don't observe the same behaviour in our CI/CD and test/staging environments.
UPDATE:
Just spotted something in the logs of affected instances: when they shut down the logs are showing:
I'm not seeing that on the instances which are not affected by this problem.
The text was updated successfully, but these errors were encountered: