movingAverage movingMedian very slow in cluster #1521

arielnh56 · 2016-05-19T17:56:51Z

These seem to re-pull the data using the _fetchWithBootstrap() function.

For a specific query that was inserted into one of our grafana dashboards, this was mind-bogglingly slow.

In our environment we have 5 storage nodes. The original data query uses a wildcard which in this case matches 1411 metrics, (which are munged then combined with averageSeries) so each query to the storage nodes returns about 282 metrics. This returns pretty quick.
Someone encapsulated one of these queries with a movingAverage(x,100). This generated an additional 1411 separate data queries to the storage nodes (logged in rendering.log) each time. As the result took more than the grafana refresh time to generate (60s - actually took >200s sometimes) and is displayed in multiple locations, something like 90% of the queries to the storage nodes ended up being this one thing, and the whole cluster responds like a slug.

We are working around it by not using movingAverage on large queries, but it wouldn't it be better if _fetchWithBootstrap() could match the wildcard originally used?

I'm going to stare at it a bit and see if I can figure a fix, but my python foo may not be up to it.

arielnh56 · 2016-05-19T17:59:31Z

or maybe pre-process the query to look for the need for a bootstrap and pull the correct data the first time?

arielnh56 · 2016-05-19T19:48:39Z

this seems to apply whenever movingAverage is applies to a series that already has a function applied

for instance

movingAverage(services.real-time-bidder.pnc.*.Heap.heap.percent,11)

generates two retrievals

Thu May 19 15:40:51 2016 :: Retrieval of services.real-time-bidder.pnc..Heap.heap.percent took 0.182778
Thu May 19 15:40:52 2016 :: Retrieval of services.real-time-bidder.pnc..Heap.heap.percent took 0.130155

while

movingAverage(nonNegativeDerivative(services.real-time-bidder.pnc.*.Heap.heap.percent),11)

gets the initial wildcard, then a hit for each node, and takes an age. As most of our interesting metrics are counters that reset to zero at daemon restart, not taking nonNegativeDerivative() on them first would generate silly numbers in a movingAverage

arielnh56 · 2016-05-20T00:22:04Z

I dug deep and I think I get it now - and I think I see the fix - and I have tested it narrowly.

It appears to be an inconsistent way that TimeSeries.pathExpression is handled.

the field is initially set in fetchData as follows:

series.pathExpression = pathExpr #hack to pass expressions through to render functions

some functions (all of the combination functions) pass this uphill nicely.

Many of the functions set it to be the same as 'name', after putting their own wrapper on name

I tweaked one of these - nonNegativeDerivative() as follows and got the problem to go away. I still get the double-tap on the storage nodes, but only double, not 300X.

as follows:

series.name = "nonNegativeDerivative(%s)" % series.name
series.pathExpression = "nonNegativeDerivative(%s)" % series.pathExpression
# series.pathExpression = series.name

arielnh56 · 2016-05-20T16:47:13Z

The more I dig into this, the more work it seems to be to fix. It appears that the intent of pathExpression is to be able to re-pull the data with different parameters (typically different time periods). Therefor it should be consistently updated in every function but it is not.

it is initiated in fetchData, and then is rolled up through the stack of functions.

Some function e.g. sumSeries(), derive it from the pathExpression values in the received seriesList - those seem to work correctly.

Some functions derive it from the name values (many just set it equal to name) in the received seriesList. As name can be populated with arbitrary strings, that's going to generate odd behavior in many cases and result in a pathExpression that cannot be used to re-pull the data.

Some functions e.g. maxSeries() don't update it at all so the resulting pathExpression will no generate the original result if used.

I see two approaches to fixing this:

Fix all the functions to consistently update pathExpression as it walks back up the tree. This is a lot of code, and would take quite a bit of maintenance going forward..
Top down approach. When a seriesList is passed to a function, pass the pathExpression that was used to generate it along with it. I'm not sure how to achieve this best (weak python foo), maybe a tuple, or make seriesList a class? This would eliminate the need for every function to syntactically correctly update the pathExpression in each series. If it is a function that needs to re-piull the data, it can just pull the pathExpression out of its arguments.

Am I completely off-base with this? It's entirely possible I am missing some design history on this one.

arielnh56 · 2016-05-20T16:59:01Z

Argh. It occurs to me that the current behaviour has some advantages.

e.g. maxSeries() does not need to add itself, if the series it is filtering are already split out with expanded pathExpressions to individual datapoints. If you time-shift you might get a different set of series from the maxSeries filter if you just used the original wildcard. Which would make that graph invalid.

Dang this is hard.

arielnh56 · 2016-05-24T00:19:48Z

had fun digging and coding.

fixed. filed pull request

#1523

This is against 0.9.x as we are still on 0.9.15.

obfuscurity · 2016-06-09T17:00:14Z

@arielnh56 Thank you for all of your work into this fix. Much appreciated. 🍰 💯 🍭 👍 😎

cbowman0 mentioned this issue Jun 1, 2016

Fix data rereads #1529

Merged

obfuscurity mentioned this issue Jun 9, 2016

replace _fetchWithBootstrap #1523

Merged

obfuscurity closed this as completed Jun 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

movingAverage movingMedian very slow in cluster #1521

movingAverage movingMedian very slow in cluster #1521

arielnh56 commented May 19, 2016

arielnh56 commented May 19, 2016

arielnh56 commented May 19, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 24, 2016

obfuscurity commented Jun 9, 2016

movingAverage movingMedian very slow in cluster #1521

movingAverage movingMedian very slow in cluster #1521

Comments

arielnh56 commented May 19, 2016

arielnh56 commented May 19, 2016

arielnh56 commented May 19, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 20, 2016

arielnh56 commented May 24, 2016

obfuscurity commented Jun 9, 2016