Expose model accuracy metrics in tests #600

kaituo · 2022-06-28T23:42:13Z

Description

This PR adds an option flag to print logs during tests and turn on the flag in CI workflow. The flag is disabled by default. By doing this, we can record model accuracy metrics in git workflows and later retrieve it for analysis.

Testing done:

We can turn on/off logs during tests.
The accuracy logs are recorded.

Signed-off-by: Kaituo Li kaituo@amazon.com

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ohltyler

LGTM, thanks for adding this!

ohltyler · 2022-06-28T23:52:13Z

Can you add this under 'Enhancements' under 2.1 release notes?

This PR adds an option flag to print logs during tests and turn on the flag in CI workflow. The flag is disabled by default. By doing this, we can record model accuracy metrics in git workflows and later retrieve it for analysis. Testing done: 1. We can turn on/off logs during tests. 2. The accuracy logs are recorded. Signed-off-by: Kaituo Li <kaituo@amazon.com>

Signed-off-by: Kaituo Li <kaituo@amazon.com>

kaituo · 2022-06-29T00:07:14Z

Can you add this under 'Enhancements' under 2.1 release notes?

added

codecov-commenter · 2022-06-29T00:21:53Z

Codecov Report

Merging #600 (44a3b05) into main (d484f9b) will decrease coverage by 0.19%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #600      +/-   ##
============================================
- Coverage     79.21%   79.02%   -0.20%     
+ Complexity     4222     4207      -15     
============================================
  Files           296      296              
  Lines         17686    17686              
  Branches       1880     1880              
============================================
- Hits          14010    13976      -34     
- Misses         2783     2811      +28     
- Partials        893      899       +6

Flag	Coverage Δ
plugin	`79.02% <ø> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...java/org/opensearch/ad/task/ADBatchTaskRunner.java	`81.76% <0.00%> (-4.56%)`	⬇️
...ain/java/org/opensearch/ad/model/ModelProfile.java	`69.09% <0.00%> (-1.82%)`	⬇️
...ava/org/opensearch/ad/task/ADHCBatchTaskCache.java	`90.12% <0.00%> (-1.24%)`	⬇️
...ain/java/org/opensearch/ad/task/ADTaskManager.java	`76.67% <0.00%> (-0.46%)`	⬇️
...rch/ad/transport/ForwardADTaskTransportAction.java	`97.45% <0.00%> (+3.38%)`	⬆️

sean-zheng-amazon

LGTM. you might want to check other info level logs just to make sure you don't print too much verbose in testing.

* Expose model accuracy metrics in tests This PR adds an option flag to print logs during tests and turn on the flag in CI workflow. The flag is disabled by default. By doing this, we can record model accuracy metrics in git workflows and later retrieve it for analysis. Testing done: 1. We can turn on/off logs during tests. 2. The accuracy logs are recorded. Signed-off-by: Kaituo Li <kaituo@amazon.com> (cherry picked from commit f630c8f)

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. For the single stream detector, we refactored tests in DetectionResultEvalutationIT and moved it to SingleStreamModelPerfIT. For the HCAD detector, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. We also fixed opensearch-project#712 by revising the client setup code. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. We run the benchmark in separate github workflows since they can be time consuming. For example, it takes 25+ minutes to run HCAD benchmarking alone in 1.1. Also, we print bench-marking results in standard output for recording purpose. For HCAD, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. For single stream detectors, we use a curated data set with known anomaly windows. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. We run the benchmark in separate github workflows since they can be time consuming. For example, it takes 25+ minutes to run HCAD benchmarking alone in 1.1. Also, we print bench-marking results in standard output for recording purpose. For HCAD, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. For single stream detectors, we use a curated data set with known anomaly windows. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. For the single stream detector, we refactored tests in DetectionResultEvalutationIT and moved it to SingleStreamModelPerfIT. For the HCAD detector, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. We also fixed opensearch-project#712 by revising the client setup code. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. We run the benchmark in separate github workflows since they can be time consuming. For example, it takes 25+ minutes to run HCAD benchmarking alone in 1.1. Also, we print bench-marking results in standard output for recording purpose. For HCAD, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. For single stream detectors, we use a curated data set with known anomaly windows. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

* HCAD model performance benchmark This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. We run the benchmark in separate github workflows since they can be time consuming. For example, it takes 25+ minutes to run HCAD benchmarking alone in 1.1. Also, we print bench-marking results in standard output for recording purpose. For HCAD, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. For single stream detectors, we use a curated data set with known anomaly windows. We also backported #600 so that we can capture the performance data in CI output. Testing done: 1. added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

* AD model performance benchmark This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds a HCAD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported opensearch-project#600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

This PR adds an AD model performance benchmark so that we can compare model performance across versions. Regarding benchmark data, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

* AD model performance benchmark This PR adds an AD model performance benchmark so that we can compare model performance across versions. For the single stream detector, we refactored tests in DetectionResultEvalutationIT and moved it to SingleStreamModelPerfIT. For the HCAD detector, we randomly generated synthetic data with known anomalies inserted throughout the signal. In particular, these are one/two/four dimensional data where each dimension is a noisy cosine wave. Anomalies are inserted into one dimension with 0.003 probability. Anomalies across each dimension can be independent or dependent. We have approximately 5000 observations per data set. The data set is generated using the same random seed so the result is comparable across versions. We also backported #600 so that we can capture the performance data in CI output. We also fixed #712 by revising the client setup code. Testing done: * added unit tests to run the benchmark. Signed-off-by: Kaituo Li <kaituo@amazon.com>

kaituo requested review from a team, ohltyler and sean-zheng-amazon June 28, 2022 23:42

opensearch-trigger-bot bot added backport 2.x infra Changes to infrastructure, testing, CI/CD, pipelines, etc. labels Jun 28, 2022

kaituo added backport 2.1 enhancement New feature or request and removed infra Changes to infrastructure, testing, CI/CD, pipelines, etc. labels Jun 28, 2022

ohltyler previously approved these changes Jun 28, 2022

View reviewed changes

kaituo added 2 commits June 28, 2022 17:05

update release notes

44a3b05

Signed-off-by: Kaituo Li <kaituo@amazon.com>

kaituo dismissed ohltyler’s stale review via 44a3b05 June 29, 2022 00:07

kaituo force-pushed the benchmarking branch from 88160df to 44a3b05 Compare June 29, 2022 00:07

kaituo requested a review from ohltyler June 29, 2022 00:07

ohltyler approved these changes Jun 29, 2022

View reviewed changes

sean-zheng-amazon approved these changes Jun 29, 2022

View reviewed changes

kaituo merged commit f630c8f into opensearch-project:main Jun 29, 2022

opensearch-trigger-bot bot mentioned this pull request Jun 29, 2022

[Backport 2.x] Expose model accuracy metrics in tests #601

Merged

opensearch-trigger-bot bot mentioned this pull request Jun 29, 2022

[Backport 2.1] Expose model accuracy metrics in tests #602

Merged

kaituo mentioned this pull request Nov 7, 2022

AD model performance benchmark #716

Merged

kaituo mentioned this pull request Nov 23, 2022

AD model performance benchmark (#729) #734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose model accuracy metrics in tests #600

Expose model accuracy metrics in tests #600

kaituo commented Jun 28, 2022

ohltyler left a comment

ohltyler commented Jun 28, 2022

kaituo commented Jun 29, 2022

codecov-commenter commented Jun 29, 2022 •

edited

Loading

sean-zheng-amazon left a comment

Expose model accuracy metrics in tests #600

Expose model accuracy metrics in tests #600

Conversation

kaituo commented Jun 28, 2022

Description

ohltyler left a comment

Choose a reason for hiding this comment

ohltyler commented Jun 28, 2022

kaituo commented Jun 29, 2022

codecov-commenter commented Jun 29, 2022 • edited Loading

Codecov Report

sean-zheng-amazon left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 29, 2022 •

edited

Loading