Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

Open
neptunian opened this issue Mar 2, 2022 · 4 comments
Open

[Stack Monitoring] Improve Missing Monitoring Data rule #126709

neptunian opened this issue Mar 2, 2022 · 4 comments
Labels
Feature:Stack Monitoring Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Monitoring Stack Monitoring team

Comments

@neptunian
Copy link
Contributor

neptunian commented Mar 2, 2022

After investigating the slow performance of this rule when created with the default values of looking back 1 day we found this rule has some shortcomings. The way this rule works is we query for all data in the range of now - lookback. Per each cluster, per each node, we subtract now from the last document's timestamp and if that value is greater than duration then we fire an alert. duration and lookback are configurable by the user and when we create an OOTB rule of this type for the user we set the defaults below:

Screen Shot 2022-03-02 at 1 45 34 PM

When it alerts it specifies which node has the issue. The problem with this approach is once the time range has passed and data no longer exists it will no longer report missing data on a node. Some changes we could make:

  • similar to Metrics Threshold rule where we keep track of the groups (nodes) from one execution to the next, we could do the same thing here
  • remove the lookback option if we can track the groups
  • consider changing this rule to only alert on a product basis (this was changed for ES only due to issues with other products). So in the case of ES alert me when there is no elasticsearch data instead of having to track nodes. Or this could be a different rule.
@neptunian neptunian added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" labels Mar 2, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@ravikesarwani
Copy link
Contributor

I like the idea of 2 separate rule. One focused on the whole cluster and one focused on node.

  • Cluster missing monitoring data
  • Node missing monitoring data

For both of these rules I feel we will require some concept of alerting only when there was data before and now data is missing for a little while. We should handle the scenario gracefully where nodes are taken out of the cluster, a scenario that will happen all the time in the field in the lifetime of a Elasticsearch cluster.
When customers are working with a Elasticsearch cluster of 40/60/100 nodes we need to think how often we should generate the alert such that customers don’t get over-warmed.
A cluster level rule/alert provides value but my take is it provides only a very limited value where our architecture requires metricbeat/Agent running on each node of the cluster.

@simianhacker
Copy link
Member

simianhacker commented Mar 2, 2022

I would recommend we create a seperate missing data rule for every entity in the system: Kibana, Metricbeat, Filebeat, APM Server, Nodes, and Clusters. As a customer, I would expect to be notified when any of these disappear from the cluster. As for the rule evaluation, we should use an Elasticsearch query to push the missing entity detection to Elasticsearch.

The following example is for detecting nodes when they drop out of the cluster or stop reporting. The idea is to query Elasticsearch using a range filter that spans across the last rule execution and the current rule execution. To determine if a node has gone missing or is new/recovered we need to create two buckets using a filter aggregation that represents lastPeriod and currentPeriod (using a range filter); this will give us a document count for each period.

Once we have the document count for each period, we can use a bucket_script, named isNodeMissing, to evaluate if the node is missing by checking if the document count for the lastPeriod is greater than 0 and the currentPeriod is less than 1. To determine if a node is recovered or new, we can use a second bucket_script, named isNodeRecoveredOrNew, to see if the lastPeriod is less than 1 and the currentPeriod is greater than 0. For each of these bucket scripts, we will return either 1 or 0 since the bucket script can not return a boolean.

With isNodeMissing and isNodeRecoveredOrNew, we can use a bucket_selector to only return the nodes where isNodeMissing > 0 or isNodeRecoveredOrNew > 0. In Kibana, we will need to keep track of only the nodes where isNodeMissing === 1 in the rule state. If a node recovers, isNodeRecoveredOrNew === 1, we need delete the node from the rule state. Finally, for every node we are tracking in the rule state from past executions and the current, we need to trigger a "NO DATA" alert every time the rule executes.

Along with the missing nodes, we also need to track the last execution time of the previous execution so we can use it to create the range query that covers both. For most of the monitoring data, looking at a 5 minute window for each period should be sufficient. This means we would actually query for approximately 10 minutes of data, from the start of the last execution to the end of the current. In a perfect world, we could simply create 2 equal sized buckets but unfortunately the Kibana Alerting system has some drift which is why we need to use the timestamp of last execution rather than assuming it never drifts.

In the example query below, I'm just using a 10 minute time range with two equal 5 minute periods but in the final implementation, the lastPeriod bucket should use the last execution time minus the window size, 5m; the range query should span from the last execution timestamp, minus the window, to now.

POST .monitoring-es-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-10m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "type": "node_stats"
          }
        }
      ]
    }
  },
  "aggs": {
    "nodes": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "cluster": {
              "terms": {
                "field": "cluster_uuid"
              }
            }
          },
          {
            "node": {
              "terms": {
                "field": "node_stats.node_id"
              }
            }
          }
        ]
      },
      "aggs": {
        "lastPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-10m",
                "lte": "now-5m"
              }
            }
          }
        },
        "currentPeriod": {
          "filter": {
            "range": {
              "timestamp": {
                "gte": "now-5m",
                "lte": "now"
              }
            }
          }
        },
        "isNodeMissing": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod > 0 && params.currentPeriod < 1 ? 1 : 0"
          }
        },
        "isNodeRecoveredOrNew": {
          "bucket_script": {
            "buckets_path": {
              "lastPeriod": "lastPeriod>_count",
              "currentPeriod": "currentPeriod>_count"
            },
            "script": "params.lastPeriod < 1 && params.currentPeriod > 0 ? 1 : 0"
          }
        },
        "evaluation": {
          "bucket_selector": {
            "buckets_path": {
              "isNodeMissing": "isNodeMissing",
              "isNodeRecoveredOrNew": "isNodeRecoveredOrNew"
            },
            "script": "params.isNodeMissing > 0 || params.isNodeRecoveredOrNew > 0"
          }
        }
      }
      
    }
  }
}

This should simplify the Kibana code to just a few parts:

  • Build the query DSL
  • Query Elasticsearch for every page of the composite agg
  • Add/Delete missing entities from rule state
  • Save the current execution timestamp in the rule state
  • Trigger alerts for all the missing entities in the rule state

This will also improve the performance of these rules because we only need to query approximately 10 minutes of data instead of looking back 24 hours every time it runs. It also eliminate the bug where after 24 hours missing nodes recover because they are no longer showing up in the query.

@miltonhultgren
Copy link
Contributor

I'm wondering how these kinds of rule intersect with the planned Health and Topology APIs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

7 participants