Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly monitor the embedded ansible service #13978

Merged
merged 2 commits into from
Feb 21, 2017

Conversation

carbonin
Copy link
Member

This PR mostly handles implementing a better worker lifecycle (start, do_work, kill) for the EmbeddedAnsibleWorker

For this we should only heartbeat when the locally running ansible service is alive and start the service if it isn't alive and all the processes are not running (EmbeddedAnsible.running?).

We are aware that this opens the potential for a race condition when the worker is killed and the server starts a replacement before the original worker is completely down.

The combination of implementing EmbeddedAnsibleWorker#kill as #stop and starting the service every time we can't heartbeat should allow the new worker to recover even if the stop is run after the new worker runs EmbeddedAnsible.start

If this does become a problem we will need to implement some new mechanisms for dealing with such "singleton" workers.

Before this change if someone called start before calling configure,
they would get into a bad state because we would not have the secret
key saved.

This would lead tower to use the one on the filesystem which was
generated at RPM install time which means that every build would
have access to that key.

This change makes sure that start also does the work of configure
if configure had not been called yet.
Only heartbeat when the locally running ansible service is alive.

Also, start the service if it isn't alive and all the processes are
not running (EmbeddedAnsible.running?).

We are aware that this opens the potential for a race condition when
the worker is killed and the server starts a replacement before the
original worker is completely down.

The combination of implementing EmbeddedAnsibleWorker#kill as #stop
and starting the service every time we can't heartbeat should allow
the new worker to recover even if the stop is run after the new worker
runs EmbeddedAnsible.start

If this does become a problem we will need to implement some new
mechanisms for dealing with such "singleton" workers.
if EmbeddedAnsible.running?
_log.info("#{log_prefix} supervisord is ok!")
if EmbeddedAnsible.alive?
heartbeat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this introduce a delay? That is, you have to wait for the worker heartbeat detection to stop the worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy yeah, we'll keep trying to fix the situation by reconfiguring and starting the service via the setup script until the worker hasn't responded. It was the simplest thing to do until we figure out what ways the service can fail.

@miq-bot
Copy link
Member

miq-bot commented Feb 17, 2017

Checked commits carbonin/manageiq@3d7229f~...678c4e7 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
4 files checked, 0 offenses detected
Everything looks good. 🍰

@Fryguy Fryguy merged commit 1c4217d into ManageIQ:master Feb 21, 2017
@Fryguy Fryguy added this to the Sprint 55 Ending Feb 27, 2017 milestone Feb 21, 2017
@gtanzillo gtanzillo mentioned this pull request Mar 1, 2017
@carbonin carbonin deleted the properly_monitor_the_ansible_service branch March 7, 2017 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants