Alert statuses #51099

mikecote · 2019-11-19T19:19:31Z

To enrich the user experience within the alerts table (under Kibana management section), we should display the status for each alert.

To make sure we're on the same page on what alert statuses we should have, I've opened this issue for discussion. The UI would display them as a column within the alerts table and there would be a filter for the status. The statuses would be calculated on read based on the result of a few queries (activity log, alert instances, etc).

As a starting point, the mockups contain four potential statuses:

Active: The alert is actively firing
OK: The alert is running periodically and not firing anything
Error: The alert is throwing errors during execution
No Data: I'm thinking this is when the alert didn't run yet?

Is there any proposal for different statuses?

cc @elastic/kibana-stack-services @alexfrancoeur @peterschretlen

peterschretlen · 2019-11-20T15:42:53Z

I think those 4 are sufficient and having a small number is preferable.

No Data will depend on the alert type, but I think for a timeseries metric it would mean there's no data points in the period being checked (which can happen if a beat is removed or stops sending data for example)

Say for a CPU usage alert, if none of my metricbeat have send data 1 hour, and my alert is "when avg CPU is above 90% over the last 5 minutes" - there'd be no documents in elasticsearch and I would expect this to show "No Data" state.

pmuellr · 2019-11-20T18:03:10Z

No Data sounds like:

from @mikecote: the alert type function has not yet run for this alert
from @peterschretlen: the meaning depends on the alert type; the alert type function may have run, but not done anything "semantically" because it hasn't gotten enough data yet

Both are actually interesting, but we don't have a mechanism to allow an alert type to return a "No Data" condition as in Peter's definition, that I'm aware of.

I'd say get rid of No Data for now, or change to something like has not run yet (Mike's definition). No Data sounds a bit confusing and vague to me.

For the remaining, how do we determine these values - the last state when the alert function ran? It either threw an error (Error), ran but scheduled no actions (OK), or ran and scheduled actions (Active). Just the last state seen? If so, perhaps storing that in the alert itself would be appropriate.

Presumably things like muted, *throttled show up in a separate column/icon/property indicating those states, so that kind of state isn't appropriate for this "status".

peterschretlen · 2019-11-20T22:44:01Z

If someone is authoring an alert, what do we expect them to do in the case where they don't have enough data to evaluate the condition? Throw an error? Return and treat it as normal?

No data/missing data is a pretty common scenario and I think it's an important cue. Data often arrives late, and it's not really an error but I wouldn't consider it ok either. Some systems will also let you notify on no data. Few examples:

If we don't treat it as a state here, we need account for it somewhere. I understand if we don't have a mechanism for it, but we could create one. It could be an expected type of error for example, thrown by an alert execution?

mikecote · 2019-11-21T03:36:31Z

One option I can see to add the mechanism to handle the "no data" scenario is to change the return structure of the alert type executor.

Currently it returns something like this:

return {
  // my updated alert level state
};

and we could change it to something like this:

return {
  noData: true,
  state: {
    // my updated alert level state
  },
};

Should be fairly straightforward to do and more future proof if ever we want to return more attributes than state from the executor.

Other options instead of noData: true could be status: 'no-data' or something like that.

mikecote · 2019-11-21T03:42:11Z

For the remaining, how do we determine these values - the last state when the alert function ran?

From how I see it, yes it would be based on the last execution / interval.

If so, perhaps storing that in the alert itself would be appropriate.

I think since we'll have a filter in the UI for statuses, it would make sense to store the status with the alert for searchability. After each execution, we would do an update on the alert document to update its status.

pmuellr · 2019-11-21T15:29:16Z

re: the "no data" status

It sounds like this could just be treated as an action group, for alert types that are sensitive to this. Eg, if they didn't have enough data, they'd schedule the action group "no-data", and could have whatever actions they wanted associated with that.

That would at least make that state "actionable", but wouldn't give us the ability to have it show up as a "status" value, without any kind of existing API changes, such as what Mike suggested.

If we end up making this part of the API signature, and alert status, feels like maybe "not enough data" is probably a better phrasing for this vs no data. Maybe something in the vein "inconclusive" or such ...

pmuellr · 2019-11-21T15:31:06Z

After each execution, we would do an update on the alert document to update its status

Ya, what I was thinking. Hopefully we can piggy-back this on top of an existing update, like the scheduling of the next run.

This also means we won't need the event log to determine that status ...

mdefazio · 2019-12-06T13:57:21Z

Posting this question here instead of slack:

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

Perhaps just No data?

mdefazio · 2019-12-06T13:59:47Z

Also, is the warning level a status?

peterschretlen · 2019-12-06T15:23:34Z

Also, is the warning level a status?

I think the status would be active (active = has one or more alert instances)?

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

A disabled alert has no status - could it be blank? If we need a value to filter on then I think disabled as a state is ok. no data has a special meaning, I don't think it works on a disabled alert.

peterschretlen · 2020-03-23T19:52:54Z

Repeating a comment from #58366 (comment): we should be able to filter alerts by their status if possible.

pmuellr · 2020-07-13T11:46:24Z

One thing not mentioned yet is "alert instance status". It seems like an alert instance can have most of the status values of the alert itself, except perhaps "error", since "error" indicates the alert executor ran into some problem. Note this specifically includes "no data", as some alert types may know the possible domain of their instances, and be able to determine if an instance has not produced data. But not all alerts will be able to do this - index threshold for instance doesn't know the domain of the possible groupings it uses for it's instance ids.

pmuellr · 2020-08-18T20:33:31Z

Just happened to think in a chat with Mike, we'll have the opportunity to "migrate" old alerts to contain data in this new executionStatus object, but how could we possibly get data to put in it? Presumably we could get some parts of it from the alert state, but I don't think you can access other SOs during a migration (seems horribly complicated!).

I think we're only talking about the status and date fields - the error field can always be null.

And it's not really important what's in the SO itself, but what we return from alertClient methods and http requests. So, do we want these to be optional? What a PITA that would be, when the only possible time they could be null is right after a migration, up until the alert function is executed for the first time after a migration.

Thinking we can have another status value of "unknown", that we can use in a case like this, and may come in handy later as well. We'll want to add a release note about this, if it ends up showing up in the UI - not sure it will or not.

I don't think we will, looking at the current web ui. But that made me realize we probably want this new status field in the alerts table view:

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor.

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing fix type error with find result

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing fix type error with find result add reasons to alert execution errors add some jest tests add some function tests fix status update to use alert namespace fix function test

During development of elastic#75553, some issues came up with the optimistic concurrency control (OCC) we were using internally within the alertsClient, via the `version` option/property of the saved object. The referenced PR updates new fields in the alert from the taskManager task after the alertType executor runs. In some alertsClient methods, OCC is used to update the alert which are requested via user requests. And so in some cases, version conflict errors were coming up when the alert was updated by task manager, in the middle of one of these methods. Note: the SIEM function test cases stress test this REALLY well. In this PR, we remove OCC from methods that were currently using it, namely `update()`, `updateApiKey()`, `enable()`, `disable()`, and the `[un]mute[All,Instance]()` methods. Of these methods, OCC is really only _practically_ needed by `update()`, but even for that we don't support OCC in the API, yet; see: issue elastic#74381 . For cases where we know only attributes not contributing to AAD are being updated, a new function is provided that does a partial update on just those attributes, making partial updates for those attributes a bit safer. That will be used by PR elastic#75553. [Alerting] formalize alert status and add status fields to alert saved object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing fix type error with find result add reasons to alert execution errors add some jest tests add some function tests fix status update to use alert namespace fix function test finish function tests

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing fix type error with find result add reasons to alert execution errors add some jest tests add some function tests fix status update to use alert namespace fix function test finish function tests more fixes after rebase fix type checks and jest tests after rebase

…d object resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'unknown'; date: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ``` interim commits: calculate the execution status, some refactoring write the execution status to the alert after execution use real date in execution status on create add an await to an async fn comment out status update to see if SIEM FT succeeds fix SIEM FT alert deletion issue use partial updates and retries in alerts clients to avoid conflicts fix jest tests clean up conflict-fixin code moar conflict-prevention fixing fix type error with find result add reasons to alert execution errors add some jest tests add some function tests fix status update to use alert namespace fix function test finish function tests more fixes after rebase fix type checks and jest tests after rebase add migration and find functional tests fix relative import

…d object (#75553) resolves #51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```

…d object (elastic#75553) resolves elastic#51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```

…d object (#75553) (#79227) resolves #51099 This formalizes the concept of "alert status", in terms of it's execution, with some new fields in the alert saved object and types used with the alert client and http APIs. These fields are read-only from the client point-of-view; they are provided in the alert structures, but are only updated by the alerting framework itself. The values will be updated after each run of the alert type executor. The data is added to the alert as the `executionStatus` field, with the following shape: ```ts interface AlertExecutionStatus { status: 'ok' | 'active' | 'error' | 'pending' | 'unknown'; lastExecutionDate: Date; error?: { reason: 'read' | 'decrypt' | 'execute' | 'unknown'; message: string; }; } ```

mikecote added discuss Feature:Alerting Team:Stack Services labels Nov 19, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

mikecote mentioned this issue Dec 31, 2019

Ability to fetch alert state / alert instance state #48442

Closed

mikecote assigned pmuellr Jan 15, 2020

peterschretlen mentioned this issue Feb 13, 2020

[alerting event log] add event log for alert execution and alerts scheduling actions #55636

Closed

peterschretlen mentioned this issue Feb 24, 2020

[Discuss] Sort and filter options in the alert management view #58366

Open

peterschretlen mentioned this issue Mar 19, 2020

Allow sorting the alert list by name #60584

Closed

mikecote unassigned pmuellr Mar 25, 2020

mikecote mentioned this issue May 25, 2020

Ability to alert when there's been no data for x amount of time #67296

Closed

pmuellr self-assigned this Aug 18, 2020

pmuellr mentioned this issue Aug 18, 2020

[Alerting] rename alert client getStatus API #75332

Closed

pmuellr mentioned this issue Aug 18, 2020

API to get all active instances from Observability consumers #70169

Closed

pmuellr mentioned this issue Aug 20, 2020

[Alerting] formalize alert status and add status fields to alert saved object #75553

Merged

9 tasks

pmuellr closed this as completed in #75553 Oct 1, 2020

stefnestor mentioned this issue Jun 7, 2021

Document Kibana Alert Statuses #101521

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

pmuellr mentioned this issue Oct 27, 2022

[R&D] Data flow stopped rule #141936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert statuses #51099

Alert statuses #51099

mikecote commented Nov 19, 2019

peterschretlen commented Nov 20, 2019

pmuellr commented Nov 20, 2019

peterschretlen commented Nov 20, 2019

mikecote commented Nov 21, 2019

mikecote commented Nov 21, 2019 •

edited

Loading

pmuellr commented Nov 21, 2019

pmuellr commented Nov 21, 2019

mdefazio commented Dec 6, 2019 •

edited

Loading

mdefazio commented Dec 6, 2019

peterschretlen commented Dec 6, 2019

peterschretlen commented Mar 23, 2020

pmuellr commented Jul 13, 2020

pmuellr commented Aug 18, 2020

Alert statuses #51099

Alert statuses #51099

Comments

mikecote commented Nov 19, 2019

peterschretlen commented Nov 20, 2019

pmuellr commented Nov 20, 2019

peterschretlen commented Nov 20, 2019

mikecote commented Nov 21, 2019

mikecote commented Nov 21, 2019 • edited Loading

pmuellr commented Nov 21, 2019

pmuellr commented Nov 21, 2019

mdefazio commented Dec 6, 2019 • edited Loading

mdefazio commented Dec 6, 2019

peterschretlen commented Dec 6, 2019

peterschretlen commented Mar 23, 2020

pmuellr commented Jul 13, 2020

pmuellr commented Aug 18, 2020

mikecote commented Nov 21, 2019 •

edited

Loading

mdefazio commented Dec 6, 2019 •

edited

Loading