Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

movingAverage movingMedian very slow in cluster #1521

Closed
arielnh56 opened this issue May 19, 2016 · 7 comments
Closed

movingAverage movingMedian very slow in cluster #1521

arielnh56 opened this issue May 19, 2016 · 7 comments

Comments

@arielnh56
Copy link
Contributor

These seem to re-pull the data using the _fetchWithBootstrap() function.

For a specific query that was inserted into one of our grafana dashboards, this was mind-bogglingly slow.

In our environment we have 5 storage nodes. The original data query uses a wildcard which in this case matches 1411 metrics, (which are munged then combined with averageSeries) so each query to the storage nodes returns about 282 metrics. This returns pretty quick.
Someone encapsulated one of these queries with a movingAverage(x,100). This generated an additional 1411 separate data queries to the storage nodes (logged in rendering.log) each time. As the result took more than the grafana refresh time to generate (60s - actually took >200s sometimes) and is displayed in multiple locations, something like 90% of the queries to the storage nodes ended up being this one thing, and the whole cluster responds like a slug.

We are working around it by not using movingAverage on large queries, but it wouldn't it be better if _fetchWithBootstrap() could match the wildcard originally used?

I'm going to stare at it a bit and see if I can figure a fix, but my python foo may not be up to it.

@arielnh56
Copy link
Contributor Author

or maybe pre-process the query to look for the need for a bootstrap and pull the correct data the first time?

@arielnh56
Copy link
Contributor Author

this seems to apply whenever movingAverage is applies to a series that already has a function applied

for instance

movingAverage(services.real-time-bidder.pnc.*.Heap.heap.percent,11)

generates two retrievals

Thu May 19 15:40:51 2016 :: Retrieval of services.real-time-bidder.pnc..Heap.heap.percent took 0.182778
Thu May 19 15:40:52 2016 :: Retrieval of services.real-time-bidder.pnc.
.Heap.heap.percent took 0.130155

while

movingAverage(nonNegativeDerivative(services.real-time-bidder.pnc.*.Heap.heap.percent),11)

gets the initial wildcard, then a hit for each node, and takes an age. As most of our interesting metrics are counters that reset to zero at daemon restart, not taking nonNegativeDerivative() on them first would generate silly numbers in a movingAverage

@arielnh56
Copy link
Contributor Author

I dug deep and I think I get it now - and I think I see the fix - and I have tested it narrowly.

It appears to be an inconsistent way that TimeSeries.pathExpression is handled.

the field is initially set in fetchData as follows:

series.pathExpression = pathExpr #hack to pass expressions through to render functions

some functions (all of the combination functions) pass this uphill nicely.

Many of the functions set it to be the same as 'name', after putting their own wrapper on name

I tweaked one of these - nonNegativeDerivative() as follows and got the problem to go away. I still get the double-tap on the storage nodes, but only double, not 300X.

as follows:

series.name = "nonNegativeDerivative(%s)" % series.name
series.pathExpression = "nonNegativeDerivative(%s)" % series.pathExpression
# series.pathExpression = series.name

@arielnh56
Copy link
Contributor Author

The more I dig into this, the more work it seems to be to fix. It appears that the intent of pathExpression is to be able to re-pull the data with different parameters (typically different time periods). Therefor it should be consistently updated in every function but it is not.

it is initiated in fetchData, and then is rolled up through the stack of functions.

Some function e.g. sumSeries(), derive it from the pathExpression values in the received seriesList - those seem to work correctly.

Some functions derive it from the name values (many just set it equal to name) in the received seriesList. As name can be populated with arbitrary strings, that's going to generate odd behavior in many cases and result in a pathExpression that cannot be used to re-pull the data.

Some functions e.g. maxSeries() don't update it at all so the resulting pathExpression will no generate the original result if used.

I see two approaches to fixing this:

  1. Fix all the functions to consistently update pathExpression as it walks back up the tree. This is a lot of code, and would take quite a bit of maintenance going forward..

  2. Top down approach. When a seriesList is passed to a function, pass the pathExpression that was used to generate it along with it. I'm not sure how to achieve this best (weak python foo), maybe a tuple, or make seriesList a class? This would eliminate the need for every function to syntactically correctly update the pathExpression in each series. If it is a function that needs to re-piull the data, it can just pull the pathExpression out of its arguments.

Am I completely off-base with this? It's entirely possible I am missing some design history on this one.

@arielnh56
Copy link
Contributor Author

Argh. It occurs to me that the current behaviour has some advantages.

e.g. maxSeries() does not need to add itself, if the series it is filtering are already split out with expanded pathExpressions to individual datapoints. If you time-shift you might get a different set of series from the maxSeries filter if you just used the original wildcard. Which would make that graph invalid.

Dang this is hard.

@arielnh56
Copy link
Contributor Author

had fun digging and coding.

fixed. filed pull request

#1523

This is against 0.9.x as we are still on 0.9.15.

@obfuscurity
Copy link
Member

@arielnh56 Thank you for all of your work into this fix. Much appreciated. 🍰 💯 🍭 👍 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants