Convert alerts to use task manager intervals #46001

mikecote · 2019-09-18T12:50:11Z

This portion will require #45152. Once completed, alerting should use the task manager's interval property as well as the update API whenever the interval changes.

elasticmachine · 2019-09-18T12:50:24Z

Pinging @elastic/kibana-stack-services

mikecote · 2019-09-18T12:57:07Z

This will require an update API in task manager in order to update interval value (causing a re-schedule). This is getting more similar to #45152 the more I dig into it. Both issues could be resolved in 1-2 PRs. cc @gmmorris.

mikecote · 2019-09-18T13:15:47Z

Updating issue: This portion will require #45152. Once completed, alerting should use the task manager's interval property as well as the update API whenever the interval changes.

gmmorris · 2019-11-20T12:07:28Z

I've began investigating the changes needed in alerting and a couple of questions spring to mind:

Do we still want to store interval on Alert SavedObjects? If interval is used to schedule the Task, then, in theory, we have no need to store the interval on the Alert, but before I clean off that field I want to make sure this is the case. UPDATE: I have now realised we do need it, as enabling/disabling an Alert deletes the Task, but you don't need to specify the interval when enabling, so we'd need to store the interval somewhere as a source of truth.
How do we reschedule existing Alerts once we've made this change? This is a unique migration as it's not a Alert -> Alert migration, but rather, we want to migrate an Alerting Task based on a version change in such a way that info that currently resides on the Alert is used to migrate the Task. As far as I'm aware, migrations can't currently do this.
Any ideas?

mikecote · 2019-11-20T13:07:05Z

This is a unique migration as it's not a Alert -> Alert migration, but rather, we want to migrate an Alerting Task based on a version change in such a way that info that currently resides on the Alert is used to migrate the Task.

Luckily we don't need to support migrations yet (ex: 7.5 -> 7.6) as we currently don't support existing alerts in pre 7.6 releases.

gmmorris · 2019-11-20T13:10:32Z

Ahh so we're fine with just breaking existing alerts when moving form 7.5 to 7.6?
I guess, as we're not even in beta yet that is reasonable.... lucky 😬

mikecote · 2019-12-10T22:28:17Z

@gmmorris I think we can close this one now that it is no longer relevant and the updated work is tracked in #51575?

gmmorris · 2019-12-30T12:38:43Z

@mikecote I don't think we should close this, as long term, I don't think we want to maintain two separate interval implementations, but I no longer think this is higher priority than the migration to the Kibana Platform.

Shall we move this to 7.7 bellow migration to the platform?

mikecote · 2019-12-31T14:16:53Z

@gmmorris good point, we should re-triage this issue at the next sync. I guess it can be 7.7 or back into "code debt" column.

gmmorris · 2020-04-23T14:31:52Z

This has come up again in #63188 (comment) so we should probably reprioritise this again.

Looking into the above issue we've realised that Alerting needs quite a bit of what Task Manager offers out of the box in terms of retries for scheduled tasks, but it isn't a 1-to-1 fit and we'll likely need to make some changes to TM to properly support Alertings's needs.

Whoever picks up this issue, please read through the comments here: #63188 (comment)

In addition, something worth noting:
https://github.com/elastic/kibana/blob/46901f557b11e3e53256c44534e579712d2e18e9/x-pack/plugins/task_manager/server/task_runner.ts#L237-245

In the above section you'll note that Task Manager uses a different path when a task has a schedule on it, which doesn't currently apply to Alerts, and we want it to. That said - the implementation is problematic for us as it uses now + timeout as the new schedule, when in fact, that might be completely wrong as it might be way too late for an interval of 5s but also way too soon in certain scenarios.

This will have to be rethought, and is likely wrong from TM as a whole, not just Alerting.
I'm not sure what the right solution is, but this seems the wrong scheduling for a timeout. 🤔

mikecote · 2020-05-15T13:27:19Z

Adding a note that we should have a discussion on this issue before starting any development. Just to review what level of effort will be required, what approach and alternatives we have, what problem(s) we're solving, etc.

gmmorris · 2020-09-23T11:00:53Z

The work on TM observability (#77456) has brought to my attention that we can't visualize the "stress" that alerting puts on TM until this issue is done.
As TM has no understanding of Alert tasks as being scheduled tasks (because they reschedule themselves), there is no way for TM to know the cadence of Alerting tasks which means the observability info there isn't quite as useful.

Luckily this issue is prioritised, so we'll have this pre GA, but I wanted to mention this so that we understand just how valuable this issue is. 👍

gmmorris · 2020-09-23T11:01:46Z

Also worth noting - TM doesn't support Day / Hour scheduling, only Seconds and Minutes, so we'll need to add that to TM as part of this work.

mikecote · 2020-10-01T00:37:11Z

I think some of the early issues attempting to do this was OCC issues when a task was running and would / could revert the updated schedule. There's probably some lessons learned from updating alerts w/ OCC logic that could be applied here (cc @pmuellr)

mikecote · 2020-10-01T17:01:36Z

This issue should solve test 1 and 3 from #53650 (comment).

Alert dies after 3 consecutive timeouts

Alerts execution that times out will try again 10 minutes later

Otherwise separate issues should be opened for those as I believe they are the driver why this issue is in To-Do for GA.

gmmorris · 2020-10-01T18:22:57Z

I think some of the early issues attempting to do this was OCC issues when a task was running and would / could revert the updated schedule. There's probably some lessons learned from updating alerts w/ OCC logic that could be applied here (cc @pmuellr)

We might be able to side step this by allowing partial updates of certain parts of the task, but there's a danger here that it might still cause a failure of a running task if we don't limit what fields it touches 🤔

gmmorris · 2020-10-02T09:34:47Z

One important note about this issue - if we do not migrate Alerting to use the Task manager intervals, the alerting tasks will not appear to Task Manager as a recurring task, but rather as a single execution task.

This means that some of the observability added by #77456 won't be as useful for diagnosing alerting problems as they would be otherwise.
This is because some of the workload stats we aggregate include information such as interval frequency (to answer the question of "how many tasks are scheduled to run every 1s?). This will not include alerting tasks as they do not have an interval field on them.

We could address this by providing these stats in the Alerting health endpoint, but that's far from ideal.

Another thought is that we could possibly have an interval field on the alerts even if it isn't used and the alert task reschedules itself (the way it does today). Some investigation might be needed to see if that has a negative impact or not. But it's a thought. 🤷

mikecote · 2020-10-02T20:15:02Z

Side note: this issue should be paired with another person due to the potential complexity it will have.

gmmorris · 2020-10-12T12:26:27Z

I've been spiking this locally and it seems this isn't nearly as complicated as we originally thought. We can break this down into 2 steps:

Add support for h and d intervals to Task Manager to align with Alerting
Add support for schedule in Task return type. This would allow tasks to return a recurring schedule instead of just a runAt when they return a value. This addresses the update/OCC issues we encountered around this in the past.
Add fresh pull of schedule in case it was updated during a task run.

Spiked here: #80149

mikecote added Feature:Alerting Team:Stack Services labels Sep 18, 2019

mikecote self-assigned this Sep 18, 2019

gmmorris self-assigned this Sep 18, 2019

mikecote unassigned gmmorris Sep 18, 2019

This was referenced Nov 14, 2019

[DISCUSS] Task manager update API to allow changing a task's interval #45152

Closed

[Task Manager] Adds a reschedule api to Task Manager #50718

Closed

gmmorris mentioned this issue Nov 20, 2019

[Alerting] Alerting now uses Task Manager intervals internally #51182

Closed

7 tasks

mikecote assigned gmmorris and unassigned mikecote Nov 21, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

gmmorris mentioned this issue Dec 19, 2019

[Discuss] Resurrect Zombie Alert tasks #53603

Closed

gmmorris mentioned this issue Apr 23, 2020

Alert types to have customizable execution timeouts / retries #63188

Closed

mikecote unassigned gmmorris May 7, 2020

mikecote mentioned this issue Aug 11, 2020

Alerting GA #74788

Closed

36 tasks

gmmorris mentioned this issue Sep 16, 2020

[Alerting] Exempt Alerts pre 7.10 from RBAC on their Action execution until updated #75563

Merged

3 tasks

mikecote mentioned this issue Sep 30, 2020

Test alerting and action error handling to ensure it works as designed #53650

Closed

gmmorris self-assigned this Oct 5, 2020

gmmorris assigned YulNaumenko Oct 26, 2020

YulNaumenko linked a pull request Oct 30, 2020 that will close this issue

[Task Manager] Changed alerts schedule logic to use Task Manager internals #80149

Merged

3 tasks

YulNaumenko mentioned this issue Oct 30, 2020

[Task Manager] Changed alerts schedule logic to use Task Manager internals #80149

Merged

3 tasks

YulNaumenko closed this as completed in #80149 Nov 2, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert alerts to use task manager intervals #46001

Convert alerts to use task manager intervals #46001

mikecote commented Sep 18, 2019 •

edited

Loading

elasticmachine commented Sep 18, 2019

mikecote commented Sep 18, 2019

mikecote commented Sep 18, 2019

gmmorris commented Nov 20, 2019 •

edited

Loading

mikecote commented Nov 20, 2019

gmmorris commented Nov 20, 2019

mikecote commented Dec 10, 2019

gmmorris commented Dec 30, 2019

mikecote commented Dec 31, 2019

gmmorris commented Apr 23, 2020

mikecote commented May 15, 2020

gmmorris commented Sep 23, 2020

gmmorris commented Sep 23, 2020

mikecote commented Oct 1, 2020

mikecote commented Oct 1, 2020

gmmorris commented Oct 1, 2020

gmmorris commented Oct 2, 2020

mikecote commented Oct 2, 2020

gmmorris commented Oct 12, 2020

Convert alerts to use task manager intervals #46001

Convert alerts to use task manager intervals #46001

Comments

mikecote commented Sep 18, 2019 • edited Loading

elasticmachine commented Sep 18, 2019

mikecote commented Sep 18, 2019

mikecote commented Sep 18, 2019

gmmorris commented Nov 20, 2019 • edited Loading

mikecote commented Nov 20, 2019

gmmorris commented Nov 20, 2019

mikecote commented Dec 10, 2019

gmmorris commented Dec 30, 2019

mikecote commented Dec 31, 2019

gmmorris commented Apr 23, 2020

mikecote commented May 15, 2020

gmmorris commented Sep 23, 2020

gmmorris commented Sep 23, 2020

mikecote commented Oct 1, 2020

mikecote commented Oct 1, 2020

gmmorris commented Oct 1, 2020

gmmorris commented Oct 2, 2020

mikecote commented Oct 2, 2020

gmmorris commented Oct 12, 2020

mikecote commented Sep 18, 2019 •

edited

Loading

gmmorris commented Nov 20, 2019 •

edited

Loading