[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

threesquared · 2022-09-07T11:52:36Z

Describe the bug

We are seeing intermittent but pretty regular issues when using the Opensearch docker image in CI. All we see in the logs for the docker-compose container output is:

opensearch_1 | Killing opensearch process 10
opensearch_1 | Killing performance analyzer process 11

This is even after enabling trace level logging with rootLogger.level = trace in the log4j2 config. The full docker-compose config is:

elasticsearch:
    image: public.ecr.aws/peakon/opensearch:1.3.3
    environment:
      - discovery.type=single-node
      - http.cors.enabled=true
      - http.cors.allow-origin=*
      - bootstrap.memory_lock=true
      - 'OPENSEARCH_JAVA_OPTS=-Xms2G -Xmx2G'
    restart: always
    shm_size: 2gb
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - ./docker/log4j2.properties:/usr/share/opensearch/config/log4j2.properties

We are also extending the Opensearch image to remove the security plugin and add analysis-icu plugin and our image can be found on public ECR as public.ecr.aws/peakon/opensearch:1.3.3

FROM opensearchproject/opensearch:1.3.3

RUN /usr/share/opensearch/bin/opensearch-plugin install --batch --verbose analysis-icu \
    && /usr/share/opensearch/bin/opensearch-plugin remove --verbose opensearch-security

The container continues to run and after execing into it the logs directory contains some output from the performance analyser plugin with some errors like:

ERROR org.opensearch.performanceanalyzer.rca.RcaController - Error reading file /usr/share/opensearch/data/rca_enabled.conf

However simply running opensearch within the container works, but using the entrypoint script results in the same error. It seems to be something to do with the terminateProcesses function.

We have found that removing CHILD from the line trap terminateProcesses TERM INT EXIT CHLD in the script allows Opensearch to start up.

To reproduce

This issue is hard for us to reproduce locally but happens fairly regularly in our CI infrastructure. If there is anything else we can do to debug this or get some more information that would be extremely handy.

Expected behavior

The docker image should start Opensearch or exit with some error in the logs

Screenshots

No response

Host / Environment

Docker version 20.10.17, build 100c701

Additional context

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

dblock · 2022-09-07T15:42:43Z

Moving to opensearch-devops, but if you think it's a PA problem open a related issue in https://github.com/opensearch-project/performance-analyzer/

peterzhuamazon · 2022-09-07T18:51:54Z

Hi @threesquared,

The design is ensure that if either PA or OS process get killed, both process will get killed to ensure both process always running together.

It was a design back in the day to remove dependencies on supervisord as it is not favored by docker hub official images.

The thing that seems suspicious is the PA log error accessing:

ERROR org.opensearch.performanceanalyzer.rca.RcaController - Error reading file /usr/share/opensearch/data/rca_enabled.conf

I dont recall if PA would stop if the conf is not accessible, but even then there should not have any permission problem with that file, let me take a look on it.

Have you guys try to use the official docker-compose.yml on opensearch.org?
https://opensearch.org/samples/docker-compose.yml

We have tested with above compose file and havent seen issues yet.
At least not happening when we start the cluster.
Do you guys see the issues right after startup, or hours later?

Thanks.

dblock · 2022-09-08T19:24:31Z

@peterzhuamazon Did you mean to move it back to opensearch-build. Curious why?

prudhvigodithi · 2022-09-20T22:06:44Z

Hey @threesquared just checking back if you have tried with the suggested docker-compose file by @peterzhuamazon and still notice the issue?

threesquared · 2022-09-22T15:51:51Z

So it's really hard to pin down as its very intermittent. The compose file we use in CI is only a single-node but otherwise is the same as the official compose file.

We have hacked around this with a script that catches an exit code from dockerize that the port has not come up and then uses docker-compose restart to try again.

I will try and catch the issue again but there are no logs in the container other than the ones for the performance analyser I mentioned above but I will attach the full contents next time.

nhtruong · 2022-10-04T12:38:24Z

We also encounter this issue in OS Javascript repo. One of our workflows runs a matrix of 24 configurations, and often one or two of those configurations will encounter this issue.

Those "Killing process" messages came from opensearch-docker-entrypoint.sh. It's most likely that said script encountered an error and triggered the trap EXIT and killed these 2 processes.

Grepping the container's logs for those messages to restart the container is a workaround we're considering.

dblock · 2022-10-04T14:15:25Z

The workaround in opensearch-project/opensearch-js#304 is not pretty, but it seems to work.

nhtruong · 2022-10-06T17:14:06Z

Added an issue on the meta repo with my findings and a couple of workarounds: opensearch-project/project-meta#62

dblock · 2022-10-06T18:21:12Z

Those "Killing process" messages came from opensearch-docker-entrypoint.sh. It's most likely that said script encountered an error and triggered the trap EXIT and killed these 2 processes.

If you look at that code, it receives a signal to terminate, then terminates the processes. Look for the cause of that signal.

opensearch-build/docker/release/config/opensearch/opensearch-docker-entrypoint.sh

Line 105 in 4597d77

trap terminateProcesses TERM INT EXIT CHLD

dblock · 2022-10-06T18:23:24Z

@threesquared do you have a complete set of logs from a failed to start container by any chance?

threesquared · 2022-10-06T19:00:39Z

@dblock we just see the same as opensearch-project/project-meta#62 all that comes out of the container logs is:

opensearch_1 | Killing opensearch process 10
opensearch_1 | Killing performance analyzer process 11

dblock · 2022-10-06T19:40:56Z

@threesquared But you also have a log inside opensearch/logs inside the container. That's what I'm after.

threesquared · 2022-10-06T20:07:33Z

Ah ok sorry I wish I had saved the whole output when I last caught it. I will try and reproduce and upload a full log

threesquared · 2022-10-07T10:11:36Z

So I managed to catch this again using a script like this using dockerize in our CI platform:

RETRIES=100

for ((i=0; i<RETRIES; i++)); do
  ./docker/scripts/dockerize.sh \
    -wait http://elasticsearch:9200 \
    -timeout 30s

  if [[ $? -eq 1 ]]
  then
    sleep 1h
  else
    docker-compose up -d --force-recreate --no-deps elasticsearch
  fi
done

This gist is the last 200 lines of the logs/performance-analyzer.log

Let me know if you need any more info from the container. No other log files in that directory and as usual only those two lines in the docker logs.

nhtruong · 2022-10-07T11:43:20Z

I've collected some logs via github workflow. Opensearch container can fail in 2 ways:

Open Search crashes as soon as the container is created but the container still appears to be running and the docker logs only shows the 2 "Killing Process" lines mentioned above.
The Container crashes after some time, and we see a lot more in docker logs.

I also collected OpenSearch logs inside the container along with Kernel and System logs of the workflow runner.
For comparision, here's what a successful run looks like.

dblock · 2022-10-07T12:10:52Z

@nhtruong I compared successful and failed runs and it's pretty clear that in both OpenSearch actually starts ([2022-10-06T22:08:58,320][INFO ][o.o.s.c.ConfigurationRepository] [5fb0eef93fa0] Node '5fb0eef93fa0' initialized). So I don't think it's to blame for the shutdown, something else is terminating it, and maybe it's just cleanup?

There's a 20 second timeout in https://github.com/opensearch-project/opensearch-js/blob/main/Makefile#L6. Can we replace that with an active wait that polls tje OpenSearch cluster start with a much longer timeout?

@threesquared same in your case, see if you actually can hit that node in the active wait loop and increase the timeout to minutes so we can see if the node ever starts responding?

nhtruong · 2022-10-07T17:31:06Z

@dblock There 2 ways the runs could fail, one of it produces almost no logs, and just

Killing opensearch process 10
Killing performance analyzer process 11

Where did you see [2022-10-06T22:08:58,320][INFO ][o.o.s.c.ConfigurationRepository] [5fb0eef93fa0] Node '5fb0eef93fa0' in that run?

In the makefile I used to recreate these fail runs, I actually used 45 seconds sleep time. In both 2 failed runs, whether the container was stopped or not, the docker logs show the "Kill process" message in both cases. Are you saying that this service will restart automatically after some time if Opensearch is not running?

dblock · 2022-10-07T18:32:29Z

I was comparing the other two runs that had more logs. This one puzzles me.

My running theory rn is that we start the container, do something that fails (run a test), don't properly report it or fail to see it, visibly shutdown the container (e.g. job killed in GHA), which causes the container to kill the process because it has received a termination signal itself.

Make that sleep 2 minutes and log when it's done waiting. Do you see a kill signal before?

nhtruong · 2022-10-07T22:15:32Z

The workflow I ran did not do any tests. It marked a run as failed if the container was unhealthy, and collected a bunch of logs. So, the only interactions OpenSearch had were from the curl command that ran every 5 seconds for health check. And the first healthcheck happened 20 seconds after the container's been booted up.

I don't think there was anything from outside of the container that sent the kill signal to OS. OOMKilled: false when you inspect the container after a crash. The trap was likely triggered by the EXIT or CHLD signal.

dblock · 2022-10-11T21:28:51Z

See opensearch-project/opensearch-js#304 for a retry implementation. I propose we leave this open until we can get to the bottom of the issue, maybe someone will be more successful?

prudhvigodithi · 2022-12-20T13:57:20Z

Hey the resolved issue #1185, should fix this intermittent startup issues. Keeping the memory setting setting/adjustment aside, we can now disable the performance analyzer https://github.com/opensearch-project/opensearch-build/tree/main/docker/release#disable-performance-analyzer-agent-cli-and-related-configurations, this is be out for 2.4.0, should we backport this to 1.3.x? @gaiksaya @dblock @bbarani @peterzhuamazon WDTY?
Thank you

peterzhuamazon · 2022-12-28T03:36:48Z

Hi @prudhvigodithi this change is already in 1.3.x as we release it in 1.3.7.

Thanks.

prudhvigodithi · 2022-12-30T11:23:53Z

Thanks @peterzhuamazon since 1.3.7 is already out, we should be closing this issue, @threesquared can you please test with 1.3.7 and add your thoughts here please?
Thank you

bbarani · 2023-01-30T06:48:25Z

Closing this issue since we have released 1.3.7 version and haven't noticed this behavior. Feel free to re-open if required.

threesquared added bug Something isn't working untriaged Issues that have not yet been triaged labels Sep 7, 2022

dblock transferred this issue from opensearch-project/opensearch-build Sep 7, 2022

peterzhuamazon transferred this issue from opensearch-project/opensearch-devops Sep 7, 2022

peterzhuamazon removed the untriaged Issues that have not yet been triaged label Sep 15, 2022

nhtruong mentioned this issue Oct 4, 2022

[BUG] Integration tests are failing intermittently opensearch-project/opensearch-js#268

Closed

nhtruong mentioned this issue Oct 4, 2022

Retry for OpenSearch Container opensearch-project/opensearch-js#304

Merged

nhtruong mentioned this issue Oct 6, 2022

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container opensearch-project/project-meta#62

Open

bbarani closed this as completed Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

threesquared commented Sep 7, 2022

dblock commented Sep 7, 2022

peterzhuamazon commented Sep 7, 2022

dblock commented Sep 8, 2022

prudhvigodithi commented Sep 20, 2022

threesquared commented Sep 22, 2022

nhtruong commented Oct 4, 2022 •

edited

Loading

dblock commented Oct 4, 2022

nhtruong commented Oct 6, 2022

dblock commented Oct 6, 2022

dblock commented Oct 6, 2022

threesquared commented Oct 6, 2022

dblock commented Oct 6, 2022

threesquared commented Oct 6, 2022

threesquared commented Oct 7, 2022

nhtruong commented Oct 7, 2022

dblock commented Oct 7, 2022 •

edited

Loading

nhtruong commented Oct 7, 2022 •

edited

Loading

dblock commented Oct 7, 2022 •

edited

Loading

nhtruong commented Oct 7, 2022 •

edited

Loading

dblock commented Oct 11, 2022

prudhvigodithi commented Dec 20, 2022

peterzhuamazon commented Dec 28, 2022

prudhvigodithi commented Dec 30, 2022

bbarani commented Jan 30, 2023

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container #2568

Comments

threesquared commented Sep 7, 2022

Describe the bug

To reproduce

Expected behavior

Screenshots

Host / Environment

Additional context

Relevant log output

dblock commented Sep 7, 2022

peterzhuamazon commented Sep 7, 2022

dblock commented Sep 8, 2022

prudhvigodithi commented Sep 20, 2022

threesquared commented Sep 22, 2022

nhtruong commented Oct 4, 2022 • edited Loading

dblock commented Oct 4, 2022

nhtruong commented Oct 6, 2022

dblock commented Oct 6, 2022

dblock commented Oct 6, 2022

threesquared commented Oct 6, 2022

dblock commented Oct 6, 2022

threesquared commented Oct 6, 2022

threesquared commented Oct 7, 2022

nhtruong commented Oct 7, 2022

dblock commented Oct 7, 2022 • edited Loading

nhtruong commented Oct 7, 2022 • edited Loading

dblock commented Oct 7, 2022 • edited Loading

nhtruong commented Oct 7, 2022 • edited Loading

dblock commented Oct 11, 2022

prudhvigodithi commented Dec 20, 2022

peterzhuamazon commented Dec 28, 2022

prudhvigodithi commented Dec 30, 2022

bbarani commented Jan 30, 2023

nhtruong commented Oct 4, 2022 •

edited

Loading

dblock commented Oct 7, 2022 •

edited

Loading

nhtruong commented Oct 7, 2022 •

edited

Loading

dblock commented Oct 7, 2022 •

edited

Loading

nhtruong commented Oct 7, 2022 •

edited

Loading