Properly monitor the embedded ansible service #13978

carbonin · 2017-02-17T21:43:30Z

This PR mostly handles implementing a better worker lifecycle (start, do_work, kill) for the EmbeddedAnsibleWorker

For this we should only heartbeat when the locally running ansible service is alive and start the service if it isn't alive and all the processes are not running (EmbeddedAnsible.running?).

We are aware that this opens the potential for a race condition when the worker is killed and the server starts a replacement before the original worker is completely down.

The combination of implementing EmbeddedAnsibleWorker#kill as #stop and starting the service every time we can't heartbeat should allow the new worker to recover even if the stop is run after the new worker runs EmbeddedAnsible.start

If this does become a problem we will need to implement some new mechanisms for dealing with such "singleton" workers.

Before this change if someone called start before calling configure, they would get into a bad state because we would not have the secret key saved. This would lead tower to use the one on the filesystem which was generated at RPM install time which means that every build would have access to that key. This change makes sure that start also does the work of configure if configure had not been called yet.

Only heartbeat when the locally running ansible service is alive. Also, start the service if it isn't alive and all the processes are not running (EmbeddedAnsible.running?). We are aware that this opens the potential for a race condition when the worker is killed and the server starts a replacement before the original worker is completely down. The combination of implementing EmbeddedAnsibleWorker#kill as #stop and starting the service every time we can't heartbeat should allow the new worker to recover even if the stop is run after the new worker runs EmbeddedAnsible.start If this does become a problem we will need to implement some new mechanisms for dealing with such "singleton" workers.

Fryguy · 2017-02-17T21:51:43Z

app/models/embedded_ansible_worker/runner.rb

-    if EmbeddedAnsible.running?
-      _log.info("#{log_prefix} supervisord is ok!")
+    if EmbeddedAnsible.alive?
+      heartbeat


Doesn't this introduce a delay? That is, you have to wait for the worker heartbeat detection to stop the worker.

@Fryguy yeah, we'll keep trying to fix the situation by reconfiguring and starting the service via the setup script until the worker hasn't responded. It was the simplest thing to do until we figure out what ways the service can fail.

miq-bot · 2017-02-17T21:51:52Z

Checked commits carbonin/manageiq@3d7229f~...678c4e7 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
4 files checked, 0 offenses detected
Everything looks good. 🍰

carbonin added 2 commits February 17, 2017 16:16

carbonin added core/workers enhancement labels Feb 17, 2017

carbonin assigned gtanzillo Feb 17, 2017

carbonin requested review from jrafanie and Fryguy February 17, 2017 21:43

Fryguy reviewed Feb 17, 2017

View reviewed changes

Fryguy merged commit 1c4217d into ManageIQ:master Feb 21, 2017

Fryguy added this to the Sprint 55 Ending Feb 27, 2017 milestone Feb 21, 2017

gtanzillo mentioned this pull request Mar 1, 2017

Changelog 55 #14047

Merged

carbonin deleted the properly_monitor_the_ansible_service branch March 7, 2017 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly monitor the embedded ansible service #13978

Properly monitor the embedded ansible service #13978

carbonin commented Feb 17, 2017

Fryguy Feb 17, 2017

jrafanie Feb 20, 2017

miq-bot commented Feb 17, 2017

Properly monitor the embedded ansible service #13978

Properly monitor the embedded ansible service #13978

Conversation

carbonin commented Feb 17, 2017

Fryguy Feb 17, 2017

Choose a reason for hiding this comment

jrafanie Feb 20, 2017

Choose a reason for hiding this comment

miq-bot commented Feb 17, 2017