Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hazelcast may take much longer time to restart - need to wait longer on snapshot restore #10605

Closed
rymsha opened this issue Jun 22, 2024 · 0 comments · Fixed by #10616
Closed
Assignees
Labels
Milestone

Comments

@rymsha
Copy link
Contributor

rymsha commented Jun 22, 2024

Here are logs where after snapshot restore Hazelcast was still alive for ~31 seconds while Felix Framework gave up waiting after 10 seconds - and tied to boot up XP. XP could not start again - because Hazelcast port (5701) was still occupied.

024-05-30T17:58:39,181 INFO  c.e.x.l.i.framework.FrameworkService - Resetting server...
2024-05-30T17:58:39,181 INFO  c.e.x.l.i.framework.FrameworkService - Server has been reset..

2024-05-30T18:05:52,941 INFO  c.e.x.l.i.framework.FrameworkService - Restarting server...
2024-05-30T18:05:52,941 INFO  c.e.x.l.i.framework.FrameworkService - Stopping server...

2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...
2024-05-30T18:06:08,337 ERROR com.hazelcast.instance.AddressPicker - [LOCAL] [3.12.13#OSS] [3.12.13] Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.
com.hazelcast.core.HazelcastException: Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.

2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Server has been stopped
2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...

2024-05-30T18:06:09,795 INFO  c.h.i.partition.impl.MigrationThread - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] All migration tasks have been completed. (lastRepartitionTime=Thu May 30 18:06:05 CEST 2024, completedMigrations=1686, totalCompletedMigrations=2328, elapsedMigrationTime=3588ms, totalElapsedMigrationTime=3588ms)

2024-05-30T18:06:23,780 INFO  com.hazelcast.instance.Node - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] Shutting down node engine...
2024-05-30T18:06:23,956 INFO  org.elasticsearch.node - [b31apvl00112] stopped
2024-05-30T18:06:23,956 INFO  org.elasticsearch.node - [b31apvl00112] closing ...
2024-05-30T18:06:23,970 INFO  org.elasticsearch.node - [b31apvl00112] closed

Solution: increase XP wait (FrameworkService.WAIT_FOR_STOP_TIMEOUT_MS) to 600 seconds for restart. And don't attempt to start the server, it shutdown timed out - it might end up in a bad condition (because other Java threads not belonging to OSGi might still be alive) and requires manual restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants