-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make 1M1min possible #620
make 1M1min possible #620
Conversation
This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker. Testing done: 1. Added unit tests. 2. Manually confirmed 1M1min is possible after the above changes. Signed-off-by: Kaituo Li <kaituo@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #620 +/- ##
============================================
+ Coverage 78.93% 79.15% +0.21%
- Complexity 4204 4240 +36
============================================
Files 296 301 +5
Lines 17686 17795 +109
Branches 1880 1891 +11
============================================
+ Hits 13960 14085 +125
+ Misses 2826 2818 -8
+ Partials 900 892 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Signed-off-by: Kaituo Li <kaituo@amazon.com>
src/main/java/org/opensearch/ad/feature/CompositeRetriever.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/ad/settings/AnomalyDetectorSettings.java
Outdated
Show resolved
Hide resolved
@@ -304,6 +303,9 @@ class PageListener implements ActionListener<CompositeRetriever.Page> { | |||
|
|||
@Override | |||
public void onResponse(CompositeRetriever.Page entityFeatures) { | |||
if (pageIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are too many pages and not reach the last page until current job interval ends, will current job still iterate next page while next job triggers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -158,14 +162,14 @@ private ActionListener<Optional<AnomalyDetector>> onGetDetector( | |||
) { | |||
return ActionListener.wrap(detectorOptional -> { | |||
if (!detectorOptional.isPresent()) { | |||
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", true)); | |||
listener.onFailure(new EndRunException(detectorId, "AnomalyDetector is not available.", false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change to false
? If detector is not found, do we still need to run job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be the node with the detector config is temporarily unavailable. Changing to false to to accommodate that.
return; | ||
} | ||
|
||
AnomalyDetector detector = detectorOptional.get(); | ||
|
||
if (request.getEntities() == null) { | ||
listener.onResponse(null); | ||
listener.onFailure(new EndRunException(detectorId, "Fail to get any entities from request.", false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sparse data, it's possible that some interval has no data. With this change, we will stop AD job if there are no data for X intervals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no data, model node should not receive the request in the first place. If request.getEntities() == null, there might be some bug in our code.
} | ||
} catch (IllegalArgumentException e) { | ||
// fail to score likely due to model corruption. Re-cold start to recover. | ||
cache.get().removeEntityModel(detectorId, modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Track model corruption times? It's fine to add stats in next PR, you can add some todo here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the tracking code. Will send the new commit.
Signed-off-by: Kaituo Li <kaituo@amazon.com>
BWC test failed because we deprecated ODFE and the download link is archived. |
Have a fix coming soon for this - see #625 |
Thanks Tyler |
Besides BWC tests, do you have other comments? @ohltyler @ylwu-amzn |
I've taken a look at your latest commit - LGTM |
.intSetting( | ||
"plugins.anomaly_detection.expected_cold_entity_execution_time_in_secs", | ||
3, | ||
"plugins.anomaly_detection.expected_cold_entity_execution_time_in_millisecs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change from second to millisecond? Will it have any bwc issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it gives more control on the speed. Previously, we can only specify down to 1 second. Now we can specify down to 1 millisecond. It won't as these settings are not documented in public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Failed tests will be fixed in #625, I think we can merge this and backport first. |
* make 1M1min possible This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker. Testing done: 1. Added unit tests. 2. Manually confirmed 1M1min is possible after the above changes. Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit 08fdbdd)
* make 1M1min possible This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker. Testing done: 1. Added unit tests. 2. Manually confirmed 1M1min is possible after the above changes. Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit 08fdbdd)
Description
This PR improves performance to make the 1M1min experiment possible. First, I changed coordinating node pagination from sync to async mode in AnomalyResultTransportAction so that the coordinating node does not have to wait for model nodes' responses before fetching the next page. Second, during the million-entity evaluation, CPU is mostly around 1% with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. This PR evens out the resource usage more fairly across a large maintenance window by introducing CheckpointMaintainWorker. Third, during a model corruption, I retrigger cold start for mitigation. Check ModelManager.score, EntityResultTransportAction, and CheckpointReadWorker.
Testing done:
Issues Resolved
#338
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.