Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Disk usage alerting #75419

Merged
merged 28 commits into from
Sep 30, 2020
Merged

Conversation

igoristic
Copy link
Contributor

@igoristic igoristic commented Aug 19, 2020

Resolves #74819

This is part of the "Additional Alerting" effort for Stack Monitoring
Screen Shot 2020-09-02 at 9 51 45 AM

The check calculates each data node to make sure the disk usage is below the implied threshold.

Testing:

  1. Create a Stack Monitoring environment
  2. Through the regular Setup Mode > Alert Edit flow/ux, set the threshold to something low (like 2%)
    Screen Shot 2020-09-02 at 9 51 29 AM

@igoristic igoristic requested a review from a team September 2, 2020 13:48
@igoristic igoristic marked this pull request as ready for review September 2, 2020 13:49
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, nice work so far!

Found a couple of things right away that's making it hard to continue to test

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another comment about the next steps and hopefully get feedback from Ravi

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisronline
Copy link
Contributor

image

Also, @hbharding came up with some changes to the panel that I think will make it look better. WDYT?

@igoristic
Copy link
Contributor Author

Also, @hbharding came up with some changes to the panel that I think will make it look better. WDYT?

This seems like it should be separate issue, maybe?

Or, I could try doing it here, just feel like there'll be some back n forth if there isn't the how it should look picture

@chrisronline
Copy link
Contributor

This seems like it should be separate issue, maybe?

Yes, good point! Let's tackle it separately

createLink(
'xpack.monitoring.alerts.diskUsage.ui.nextSteps.resizeYourDeployment',
'Resize your deployment (ECE)',
`{elasticWebsiteUrl}/guide/en/cloud-enterprise/{docLinkVersion}/ece-resize-deployment.html`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}, [] as string[]);
firingNodeUuids.sort(); // It doesn't matter how we sort, but keep the order consistent
const instanceId = `${this.type}:${cluster.clusterUuid}:${firingNodeUuids.join(',')}`;
const instanceId = `.monitoring:${this.type}:${cluster.clusterUuid}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does help with maintaining instance state for after resolving an alert, it introduces another problem. The unique instance id is where the throttling is enforced - meaning if you create an instance with a previously used id, the actions are subject to the throttle period started by the first time you used that instance.

In this case, since the instance id is based off the cluster id, this will never change from cluster to cluster, even if the number of nodes that are firing changes.

Imagine a 3 node cluster (A, B, C) and node A is firing an alert, this instance id will be .monitoring:cpu_usage:clusterUuid and the actions will fire and then the throttling will start for all .monitoring:cpu_usage:clusterUuid instances. Now, imagine node A resolves itself, but node B starts firing an alert. This will run and try and fire the actions, but it will be subject to the throttle period (which by default is 1d) so they wouldn't see any messaging about it.

This is why I originally did it by firingNodeUuids to ensure we generated a unique instance id based on what was actually firing.

I'm not sure we can go in this direction because I worry that our alerting will miss valid cases where it should send actions and we will lose trust with our users.

WDYT?

oldState.ui.isFiring !== newAlertState.ui.isFiring &&
oldState.ui.resolvedMS !== newAlertState.ui.resolvedMS
);
if (!relatedOldState) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this logic here.

If the the old state is the same node as the new state, but the isFiring flipped and resolvedMS is different, that means we need to fire a resolution? I'm not sure we even need to look at resolvedMS. If the isFiring went from true to false, then we need to fire resolved actions.

Or am I just not reading this properly?

}

if (deltaFiringStates.length) {
const instance = services.alertInstanceFactory(`${deltaInstanceIdPrefix}:firing`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot - We can just execute actions off a unique instance id for firing and resolved

}
}

protected async processData(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we did something like this for processData to handle the resolutions?

https://gist.github.com/chrisronline/8cc094cd1876e895746d5e91db84be7c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This working properly actually uncovers a UX issue in the UI where we don't really surface this well. We should really defer this to a separate PR IMO.

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A couple of things I noticed

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

@kbn/optimizer bundle module count

id value diff baseline
monitoring 628 +4 624

async chunks size

id value diff baseline
monitoring 1.2MB +137.0B 1.2MB

distributable file count

id value diff baseline
default 45784 +3 45781

page load bundle size

id value diff baseline
monitoring 183.3KB +22.7KB 160.6KB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Awesome job!

@igoristic igoristic merged commit c49d546 into elastic:master Sep 30, 2020
@igoristic igoristic deleted the disk-usage-alerting branch September 30, 2020 16:38
igoristic added a commit that referenced this pull request Sep 30, 2020
* Disk usage alert draft

* Fixed typings and defaults

* Fixed tests

* Fixed tests

* Addressed code feedback

* Fixed disk and cpu usage states

* Fixed resolve state and throttle

* CR feedback

* Fixed links
@igoristic
Copy link
Contributor Author

Backport:
7.x: bf93191

gmmorris added a commit to gmmorris/kibana that referenced this pull request Sep 30, 2020
* master: (97 commits)
  [Actions] Adds a "Test Connector" button on the Connectors List to make discovery of the Test tab easier (elastic#78746)
  [Discover] Fix functional time picker test permissions (elastic#78564)
  [ML] Fixing module datafeed overrides (elastic#78925)
  Adds some missing licenses to the CSV export (elastic#78719)
  [dev/cli] ensure plugins/ and all watch source dirs exist (elastic#78973)
  [Lens] Stop using scripted metric to collect telemetry (elastic#78687)
  [Lens] fix wrong message in fields accordion (elastic#78924)
  [Enterprise Search][App Search] Credentials Logic updates (elastic#78644)
  [Monitoring] Disk usage alerting (elastic#75419)
  [SECURITY_SOLUTION] Trusted apps list expand/collapse details (elastic#78601)
  Update content on interstitial page (elastic#78881)
  chore(NA): include hjson as a prod dependency (elastic#78941)
  Fix empty meta fields input in Advanced Settings  (elastic#78576)
  [Lens] Maintain order of operations in dimension panel (elastic#78864)
  Fix plugin doc title (elastic#78880)
  load apm-rum agent lazily (elastic#78760)
  [ML] Skip full ML access permission test
  Optimize charts plugin (elastic#78922)
  ui_actions service initial docs (elastic#78902)
  skip failing suite (elastic#78942)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Monitoring][Additional-Alerting] Disk Capacity
4 participants