Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A sufficiently small interval value on a histogram can crash the node #14558

Closed
icercel opened this issue Jun 26, 2024 · 2 comments · Fixed by #14754
Closed

[BUG] A sufficiently small interval value on a histogram can crash the node #14558

icercel opened this issue Jun 26, 2024 · 2 comments · Fixed by #14754
Assignees
Labels
bug Something isn't working enhancement Enhancement or improvement to existing feature or request Search:Aggregations

Comments

@icercel
Copy link

icercel commented Jun 26, 2024

Describe the bug

Provided you have a long field on your index, with extreme min and max values for it, when attempting to return a histogram aggregation on that field using a small interval value, the node instance crashes with OOM.

Related component

Search:Aggregations

To Reproduce

  1. Use the default docker-compose provided on the OS site (it's using :latest, which at the time of writing is 2.15.0)

  2. Add 2 documents

curl -k -XPUT -u "admin:$OPENSEARCH_INITIAL_ADMIN_PASSWORD" \
  'https://localhost:9200/sample-index/_doc/1' \
  -H 'Content-Type: application/json' \
  -d '{"some_value": 1}'

curl -k -XPUT -u "admin:$OPENSEARCH_INITIAL_ADMIN_PASSWORD" \
  'https://localhost:9200/sample-index/_doc/2' \
  -H 'Content-Type: application/json' \
  -d '{"some_value": 1234567890}'
  1. Attempt a histogram with a sufficiently large interval
curl -k -XGET -u "admin:$OPENSEARCH_INITIAL_ADMIN_PASSWORD" \
  'https://localhost:9200/sample-index/_search' \
  -H 'Content-Type: application/json' \
  -d '{"size":0, "aggs": { "test": { "histogram": { "field": "some_value", "interval": 300000000 }}}}'
  1. OpenSearch correctly (i think) returns the buckets:
{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "test": {
      "buckets": [
        {
          "key": 0,
          "doc_count": 1
        },
        {
          "key": 300000000,
          "doc_count": 0
        },
        {
          "key": 600000000,
          "doc_count": 0
        },
        {
          "key": 900000000,
          "doc_count": 0
        },
        {
          "key": 1200000000,
          "doc_count": 1
        }
      ]
    }
  }
}
  1. change the interval value to 1000
curl -k -XGET -u "admin:$OPENSEARCH_INITIAL_ADMIN_PASSWORD"  \
 'https://localhost:9200/sample-index/_search' \
 -H 'Content-Type: application/json' \
 -d '{"size":0, "aggs": { "test": { "histogram": { "field": "some_value", "interval": 1000 }}}}'
  1. OpenSearch correctly responds with
{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "too_many_buckets_exception",
      "reason": "Trying to create too many buckets. Must be less than or equal to: [65535] but was [1234568]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
      "max_buckets": 65535
    }
  },
  "status": 503
}
  1. change the interval to 100
curl -k -XGET -u "admin:$OPENSEARCH_INITIAL_ADMIN_PASSWORD"  \
 'https://localhost:9200/sample-index/_search' \
 -H 'Content-Type: application/json' \
 -d '{"size":0, "aggs": { "test": { "histogram": { "field": "some_value", "interval": 100 }}}}'
  1. OpenSearch responds with something like curl: (56) OpenSSL SSL_read: error:0A000126:SSL routines::unexpected eof while reading, errno 0, because opensearch-node1 just died:
opensearch-node1         | [2024-06-26T12:12:51,906][INFO ][o.o.m.j.JvmGcMonitorService] [opensearch-node1] [gc][1318] overhead, spent [366ms] collecting in the last [1.1s]
opensearch-node1         | java.lang.OutOfMemoryError: Java heap space
opensearch-node1         | Dumping heap to data/java_pid1.hprof ...
opensearch-node1         | Unable to create data/java_pid1.hprof: File exists
opensearch-node1         | [2024-06-26T12:12:52,440][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-node1] fatal error in thread [opensearch[opensearch-node1][search][T#24]], exiting
opensearch-node1         | java.lang.OutOfMemoryError: Java heap space
opensearch-node1         | 	at java.base/java.util.Arrays.copyOf(Arrays.java:3482) ~[?:?]
opensearch-node1         | 	at java.base/java.util.ArrayList.grow(ArrayList.java:237) ~[?:?]
opensearch-node1         | 	at java.base/java.util.ArrayList.grow(ArrayList.java:244) ~[?:?]
opensearch-node1         | 	at java.base/java.util.ArrayList.add(ArrayList.java:515) ~[?:?]
opensearch-node1         | 	at java.base/java.util.ArrayList$ListItr.add(ArrayList.java:1150) ~[?:?]
opensearch-node1         | 	at org.opensearch.search.aggregations.bucket.histogram.InternalHistogram.addEmptyBuckets(InternalHistogram.java:416) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.search.aggregations.bucket.histogram.InternalHistogram.reduce(InternalHistogram.java:436) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:290) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.search.aggregations.InternalAggregations.topLevelReduce(InternalAggregations.java:225) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.action.search.SearchPhaseController.reduceAggs(SearchPhaseController.java:557) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.action.search.SearchPhaseController.reducedQueryPhase(SearchPhaseController.java:528) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.action.search.QueryPhaseResultConsumer.reduce(QueryPhaseResultConsumer.java:153) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:136) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:122) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:941) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.15.0.jar:2.15.0]
opensearch-node1         | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
opensearch-node1         | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
opensearch-node1         | 	at java.base/java.lang.Thread.runWith(Thread.java:1596) ~[?:?]
opensearch-node1         | 	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-node1         | fatal error in thread [opensearch[opensearch-node1][search][T#24]], exiting

Expected behavior

i would have expected (liked, if possible) to get the same too_many_buckets_exception

Additional Details

Plugins
n/a

Screenshots
n/a

Host/Environment (please complete the following information):

  • OS: Ubuntu
  • Version 22.04.4

Additional context

  • the version of the OpenSearch is 2.15.0, made no changes to the docker-compose.yml

Workarounds

  • adding "min_doc_count": 1 prevents the crash (and it returns 2 buckets, key: 0 and key: 1234567800); this expects that the clients will have to reconstruct the rest of the empty buckets themselves (not always possible for my particular case, sadly)
  • changing the heap from 512m to 1024m, for example, prevents the crash for "interval": 100", but it crashes for "interval": 10"
@icercel icercel added bug Something isn't working untriaged labels Jun 26, 2024
@bowenlan-amzn bowenlan-amzn added the enhancement Enhancement or improvement to existing feature or request label Jul 4, 2024
@bowenlan-amzn
Copy link
Member

Did some work on this today.
Aggregation only knows the correct number of buckets for the response at the final reduce phase. Before that, aggregations are done on every shard which may lead to a larger number of buckets (larger than 65535, the limit) to reduce in the end.

if (docCounts.increment(bucketOrd, docCount) == docCount) {
// We calculate the final number of buckets only during the reduce phase. But we still need to
// trigger bucket consumer from time to time in order to give it a chance to check available memory and break
// the execution if we are running out. To achieve that we are passing 0 as a bucket count.
multiBucketConsumer.accept(0);
}

So before the reduce phase, what we can do best is to fail gracefully before OOM. Above is a snippet to do that.
I will raise a PR later to add this logic into histogram.

@icercel
Copy link
Author

icercel commented Jul 22, 2024

@bowenlan-amzn , hi, terribly sorry for the delay, wanted to thank you for looking into this and providing a fix 🙇
can't wait to see it in AWS!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement Enhancement or improvement to existing feature or request Search:Aggregations
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants