[Stack Monitoring] Improve Missing Monitoring Data rule #126709

neptunian · 2022-03-02T19:05:21Z

After investigating the slow performance of this rule when created with the default values of looking back 1 day we found this rule has some shortcomings. The way this rule works is we query for all data in the range of now - lookback. Per each cluster, per each node, we subtract now from the last document's timestamp and if that value is greater than duration then we fire an alert. duration and lookback are configurable by the user and when we create an OOTB rule of this type for the user we set the defaults below:

When it alerts it specifies which node has the issue. The problem with this approach is once the time range has passed and data no longer exists it will no longer report missing data on a node. Some changes we could make:

similar to Metrics Threshold rule where we keep track of the groups (nodes) from one execution to the next, we could do the same thing here
remove the lookback option if we can track the groups
consider changing this rule to only alert on a product basis (this was changed for ES only due to issues with other products). So in the case of ES alert me when there is no elasticsearch data instead of having to track nodes. Or this could be a different rule.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-02T19:05:23Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

ravikesarwani · 2022-03-02T20:20:28Z

I like the idea of 2 separate rule. One focused on the whole cluster and one focused on node.

Cluster missing monitoring data
Node missing monitoring data

For both of these rules I feel we will require some concept of alerting only when there was data before and now data is missing for a little while. We should handle the scenario gracefully where nodes are taken out of the cluster, a scenario that will happen all the time in the field in the lifetime of a Elasticsearch cluster.
When customers are working with a Elasticsearch cluster of 40/60/100 nodes we need to think how often we should generate the alert such that customers don’t get over-warmed.
A cluster level rule/alert provides value but my take is it provides only a very limited value where our architecture requires metricbeat/Agent running on each node of the cluster.

simianhacker · 2022-03-02T22:31:06Z

I would recommend we create a seperate missing data rule for every entity in the system: Kibana, Metricbeat, Filebeat, APM Server, Nodes, and Clusters. As a customer, I would expect to be notified when any of these disappear from the cluster. As for the rule evaluation, we should use an Elasticsearch query to push the missing entity detection to Elasticsearch.

The following example is for detecting nodes when they drop out of the cluster or stop reporting. The idea is to query Elasticsearch using a range filter that spans across the last rule execution and the current rule execution. To determine if a node has gone missing or is new/recovered we need to create two buckets using a filter aggregation that represents lastPeriod and currentPeriod (using a range filter); this will give us a document count for each period.

Once we have the document count for each period, we can use a bucket_script, named isNodeMissing, to evaluate if the node is missing by checking if the document count for the lastPeriod is greater than 0 and the currentPeriod is less than 1. To determine if a node is recovered or new, we can use a second bucket_script, named isNodeRecoveredOrNew, to see if the lastPeriod is less than 1 and the currentPeriod is greater than 0. For each of these bucket scripts, we will return either 1 or 0 since the bucket script can not return a boolean.

With isNodeMissing and isNodeRecoveredOrNew, we can use a bucket_selector to only return the nodes where isNodeMissing > 0 or isNodeRecoveredOrNew > 0. In Kibana, we will need to keep track of only the nodes where isNodeMissing === 1 in the rule state. If a node recovers, isNodeRecoveredOrNew === 1, we need delete the node from the rule state. Finally, for every node we are tracking in the rule state from past executions and the current, we need to trigger a "NO DATA" alert every time the rule executes.

Along with the missing nodes, we also need to track the last execution time of the previous execution so we can use it to create the range query that covers both. For most of the monitoring data, looking at a 5 minute window for each period should be sufficient. This means we would actually query for approximately 10 minutes of data, from the start of the last execution to the end of the current. In a perfect world, we could simply create 2 equal sized buckets but unfortunately the Kibana Alerting system has some drift which is why we need to use the timestamp of last execution rather than assuming it never drifts.

In the example query below, I'm just using a 10 minute time range with two equal 5 minute periods but in the final implementation, the lastPeriod bucket should use the last execution time minus the window size, 5m; the range query should span from the last execution timestamp, minus the window, to now.

POST .monitoring-es-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-10m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "type": "node_stats"
          }
        }
      ]
    }
  },
  "aggs": {
    "nodes": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "cluster": {
              "terms": {
                "field": "cluster_uuid"
              }
            }
          },
          {
            "node": {
              "terms": {
                "field": "node_stats.node_id"
              }
            }
          }
        ]
      },
      "aggs": {
        "lastPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-10m",
                "lte": "now-5m"
              }
            }
          }
        },
        "currentPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-5m",
                "lte": "now"
              }
            }
          }
        },
        "isNodeMissing": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod > 0 && params.currentPeriod < 1 ? 1 : 0"
          }
        },
        "isNodeRecoveredOrNew": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod < 1 && params.currentPeriod > 0 ? 1 : 0"
          }
        },
        "evaluation": {
          "bucket_selector": {
            "buckets_path": {
              "isNodeMissing": "isNodeMissing",
              "isNodeRecoveredOrNew": "isNodeRecoveredOrNew"
            },
            "script": "params.isNodeMissing > 0 || params.isNodeRecoveredOrNew > 0"
          }
        }
      }
      
    }
  }
}

This should simplify the Kibana code to just a few parts:

Build the query DSL
Query Elasticsearch for every page of the composite agg
Add/Delete missing entities from rule state
Save the current execution timestamp in the rule state
Trigger alerts for all the missing entities in the rule state

This will also improve the performance of these rules because we only need to query approximately 10 minutes of data instead of looking back 24 hours every time it runs. It also eliminate the bug where after 24 hours missing nodes recover because they are no longer showing up in the query.

miltonhultgren · 2022-03-03T08:06:22Z

I'm wondering how these kinds of rule intersect with the planned Health and Topology APIs?

neptunian added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" labels Mar 2, 2022

miltonhultgren mentioned this issue Mar 3, 2022

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

Closed

neptunian mentioned this issue Mar 3, 2022

[Stack Monitoring] [Alerting] Investigate why "Missing monitoring data" rule is much slower than other rules on the same cluster #123844

Closed

simianhacker mentioned this issue Mar 16, 2022

[Logs UI][Rules] Refactor Logs Threshold Rule to push evaluations to Elasticsearch #127925

Open

simianhacker mentioned this issue May 16, 2022

Stack Monitoring rule types failing due to empty buckets #120111

Closed

roshan-elastic added the Feature:Stack Monitoring label Jan 24, 2023

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

neptunian commented Mar 2, 2022 •

edited

Loading

elasticmachine commented Mar 2, 2022

ravikesarwani commented Mar 2, 2022

simianhacker commented Mar 2, 2022 •

edited

Loading

miltonhultgren commented Mar 3, 2022

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

Comments

neptunian commented Mar 2, 2022 • edited Loading

elasticmachine commented Mar 2, 2022

ravikesarwani commented Mar 2, 2022

simianhacker commented Mar 2, 2022 • edited Loading

miltonhultgren commented Mar 3, 2022

neptunian commented Mar 2, 2022 •

edited

Loading

simianhacker commented Mar 2, 2022 •

edited

Loading