Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default playbook service timeout to 100 minutes #19279

Merged
merged 1 commit into from
Sep 11, 2019

Conversation

lfu
Copy link
Member

@lfu lfu commented Sep 10, 2019

Automate has its own timeout. This Ansible runner timeout does not play nice with automate timeout.

https://bugzilla.redhat.com/show_bug.cgi?id=1750370

@miq-bot assign @tinaafitz
@miq-bot add_label bug, Ivanchuk/yes, changelog/yes, blocker

cc @Fryguy @NickLaMuro

Copy link
Member

@NickLaMuro NickLaMuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the job_timeout for ansible-runner as well?

https://ansible-runner.readthedocs.io/en/latest/intro.html#runnersettings

Also, not sure where this error in the ticket is coming from.

@lfu
Copy link
Member Author

lfu commented Sep 10, 2019

Do we need to update the job_timeout for ansible-runner as well?

@NickLaMuro That is a good point. Is there a value so the ansible job can run forever?
Ideally job_timeout should be the same as automate timeout. But I'm not sure where to get this automate timeout from.
@tinaafitz @gmcculloug Thoughts?

Also, not sure where this error in the ticket is coming from.

From this line.

@NickLaMuro
Copy link
Member

NickLaMuro commented Sep 10, 2019

Also, not sure where this error in the ticket is coming from.

From this line.

Oh butts, I was looking around for this line:

EVM.log -> "restarting worker after [3600] job aborting ..."

Still don't know where that is from, but what you provided is enough. Thanks!

@NickLaMuro
Copy link
Member

@NickLaMuro That is a good point. Is there a value so the ansible job can run forever?

@lfu I think the quick fix is we can manually set the job_timeout to None:

https://github.com/ansible/ansible-runner/blob/f6f6a24538817a84d09feeac85a747b5b9238c0e/ansible_runner/runner.py#L193

But I would have to check if that works as expected myself. Unsure, but I will look into it for the time being.

@dmetzger57
Copy link
Contributor

@lfu @NickLaMuro so the proposal is to essentially disable the timeout in Runner and rely on Automate's timeout? Is this how things worked with Embedded Tower - we relied solely on Automate's timeout?

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

Since the AnsibleRunnerWorkflow can be used outside of an automate context (e.g. for provider operations), I don't think we should use the automate timeout solely. I could see the automate specific caller setting it to None in order to allow automate's timeout to supercede (however, we still have to call .stop on the thing somehow), but I don't we should set the default of AnsibleRunnerWorkflow to None.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch from c4f9573 to 57e8844 Compare September 10, 2019 15:52
@NickLaMuro
Copy link
Member

Since the AnsibleRunnerWorkflow can be used outside of an automate context (e.g. for provider operations), I don't think we should use the automate timeout solely...

Are you suggesting that automate is the only one that uses this timeout, because I am pretty sure that services will make use of it too:

https://github.com/ManageIQ/manageiq/blob/575def3be/app/models/manageiq/providers/ansible_runner_workflow.rb#L51

Everything should be using AnsiblePlaybookWorkflow (inherits from the above) to handle running jobs, so I don't see how this wouldn't be used by all of them.

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

I don't think anyone is suggesting it is.

@NickLaMuro
Copy link
Member

however, we still have to call .stop on the thing somehow

Yeah, that I agree is a probably with using job_timeout: None, we would be probably risking that, so it would be a poor choice long term to go with that option.

@lfu
Copy link
Member Author

lfu commented Sep 10, 2019

Also, I don't understand how this PR turns off the Ansible runner timeout, since timeout parameter is still being passed to the job.

timeout parameter is passed with nil which would bypass the timeout checking.

@lfu
Copy link
Member Author

lfu commented Sep 10, 2019

however, we still have to call .stop on the thing somehow

Does .stop actually kill the ansible job? If that is the case, we might want to do it in process_abort when job got timeout from automate.

[----] W, [2019-09-10T10:36:55.287875 #1866:2b17c6bfc5bc]  WARN -- : MIQ(ManageIQ::Providers::AnsiblePlaybookWorkflow#timeout!) Job: guid: [356c7324-14ac-48f8-aa0b-174ae45028a1], job timed out after 305.327338904 seconds of inactivity.  Inactivity threshold [300 seconds], aborting
[----] I, [2019-09-10T10:36:55.297601 #1866:2b17c6bfc5bc]  INFO -- : MIQ(MiqQueue.put) Message id: [22439],  id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [2], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 305.327338904 seconds of inactivity.  Inactivity threshold [300 seconds]", "error"]

@NickLaMuro
Copy link
Member

@lfu I think the concern with this is that there is a default option for :timeout in the kwargs of this class:

timeout: 1.hour, poll_interval: 1.second, verbosity: 0, become_enabled: false)

So this is already set by default, unless specifically :timeout => nil is passed.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

Does .stop actually kill the ansible job? If that is the case, we might want to do it in process_abort when job got timeout from automate.

Yes. So, yes, if the service chooses to avoid the internal AnsiblePlaybookRuner's timeout by passing nil, then process_abort seems like a good place to do the stop, however,

  • you have to call the stop on the Ansible::Runner::AsyncResponse somehow
  • the stop has to be called on the same appliance that the ansible runner instance, which is not the same box that the service's check_completed loop is running on. That is, process_abort will have to asynchronously call it on a remote appliance via the queue and then, I guess, wait_for_task, or assume it does it's job.

So, given that, it feels rather complicated.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

What is the actual issue here? Is there a timeout race of some sort? If so, what happens when the service "wins". What is the downside to just using the ansible-runner built-in timeout set to the same value as the service timeout?

@NickLaMuro
Copy link
Member

What is the actual issue here? Is there a timeout race of some sort?

I think it is simply that something in our system (either the "Automate timeout" or the "runner timeout") are causing the playbook that should take an hour or more to not complete because it reaches the timeout. I don't think the timeout conflict is happening yet, but I think we are concern about this solution not being enough if the underlying timeout in ansible-runner/ansible is then triggered.

What is the downside to just using the ansible-runner built-in timeout set to the same value as the service timeout?

I am personally fine with this, though it was one more "special snowflake" instance instead of having ansible-runner conform to our standard. Don't have strong feelings one way or the other, but that is the counter argument I can think of.

@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch from 57e8844 to 64f841d Compare September 10, 2019 16:42
@lfu
Copy link
Member Author

lfu commented Sep 10, 2019

Automate has its own timeout which is set by max retries and service timeout is set by execution_ttl. Currently service timeout is passed to the job if it is set.

In the case reported in the BZ, automate timeout is 100 minutes, service timeout is not set and playbook takes around 65 minutes to complete. So with old version, the service was able to finish while with new version, service failed due to the 60 minutes timeout from ansible runner.

@@ -16,6 +16,12 @@ def launch_runner
Ansible::Runner.run_async(env_vars, extra_vars, playbook_path, kwargs)
end

def process_abort(*args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice! I didn't realize process_abort was part of the ansible_playbook_workflow...I thought it was something on the service side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfu Can you move this into the base class? Role will have to do the exact same thing, so this should be in the base runner class.

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

service timeout is not set
...
service failed due to the 60 minutes timeout from ansible runner.

So if the service timeout (execution_ttl) is not set, then it's inherting the default ansible_playbook_workflow timeout. It sounds to me like the caller should pass the automation timeout if the service timeout is not set, so as to avoid inherting that default.

That is, somehow tweak this [1] to pass the automation timeout in addition, or change the method to use execution_ttl || automation_timeout

[1] https://github.com/ManageIQ/manageiq/blob/master/app/models/service_ansible_playbook.rb#L112-L126

@Fryguy
Copy link
Member

Fryguy commented Sep 10, 2019

@NickLaMuro Thoughts on my proposal vs what's in this PR? The PR at the moment I think looks good, and I'd be ok with merging, but I'm not sure I like the general idea of AnsiblePlaybookWorkflow not "owning" it's own timeout. Feels like we could get in trouble if it's allowed to run forever and expecting something else to clean it up. I'm not sure....thoughts?

@lfu
Copy link
Member Author

lfu commented Sep 10, 2019

That is, somehow tweak this [1] to pass the automation timeout in addition, or change the method to use execution_ttl || automation_timeout

This would be the perfect solution! But I doubt it is possible to get the automate timeout from a service or a configuration_script. That is why I have to set it to nil to bypass the timeout checking and default to automate timeout.

@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch from 64f841d to fb65d9a Compare September 10, 2019 17:52
@NickLaMuro
Copy link
Member

Thoughts on my proposal vs what's in this PR? The PR at the moment I think looks good, and I'd be ok with merging, but I'm not sure I like the general idea of AnsiblePlaybookWorkflow not "owning" it's own timeout. Feels like we could get in trouble if it's allowed to run forever and expecting something else to clean it up. I'm not sure....thoughts?

@Fryguy I think it is fine, I just don't know if the expectation from QA is that it is able to run longer than an hour:

https://github.com/sbulage/ansible_playbooks/blob/096f0ab9b4b7eb22bb136e8ce350c58ee73ce3aa/timeout_60mins_playbook.yaml#L2

And I don't know if awx/Tower is allowing it to run for longer. Unsure, but this might get pushed back after this is merged if that is the case.

@dmetzger57
Copy link
Contributor

And I don't know if awx/Tower is allowing it to run for longer. Unsure, but this might get pushed back after this is merged if that is the case.

@NickLaMuro According to QE, the AWX / Tower solution in 5.10 supported greater than an hour runs based on https://bugzilla.redhat.com/show_bug.cgi?id=1750370#c3 if I read that correctly

@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch from fb65d9a to 29188fe Compare September 10, 2019 20:12
@@ -55,7 +55,7 @@ def execute
def poll_runner
response = Ansible::Runner::ResponseAsync.load(context[:ansible_runner_response])
if response.running?
if started_on + options[:timeout] < Time.now.utc
if options[:timeout].present? && (started_on + options[:timeout]) < Time.now.utc
response.stop

queue_signal(:abort, "ansible #{execution_type} has been running longer than timeout", "error")
Copy link
Member

@Fryguy Fryguy Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfu in the non-automate case where timeout is passed, then this will try to abort after doing response.stop and then I think it will hit the process_abort method and try to response.stop again. If so, I think you need to remove the response.stop 2 lines up and just let it flwo through your new abort method. Can you verify?

EDIT: Also, I'm not sure this path even works...does queue_signal(:abort) even work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure queue_signal(:abort) works!

@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch 2 times, most recently from 692d813 to e475d8e Compare September 10, 2019 21:58
@lfu lfu changed the title Need to turn off Ansible runner timeout for automate. Set default playbook service timeout to 100 minutes Sep 10, 2019
@lfu lfu force-pushed the ansible_runner_timeout_1750370 branch from e475d8e to 861cc99 Compare September 11, 2019 13:35
@miq-bot
Copy link
Member

miq-bot commented Sep 11, 2019

Checked commit lfu@861cc99 with ruby 2.4.6, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
3 files checked, 0 offenses detected
Everything looks fine. 🍪

@Fryguy Fryguy merged commit e1e730f into ManageIQ:master Sep 11, 2019
@Fryguy Fryguy added this to the Sprint 120 Ending Sep 16, 2019 milestone Sep 11, 2019
simaishi pushed a commit that referenced this pull request Sep 11, 2019
Set default playbook service timeout to 100 minutes

(cherry picked from commit e1e730f)

https://bugzilla.redhat.com/show_bug.cgi?id=1750370
@simaishi
Copy link
Contributor

Ivanchuk backport details:

$ git log -1
commit 55bc9be7150764c2fa69037d781308bedc7c8327
Author: Jason Frey <jfrey@redhat.com>
Date:   Wed Sep 11 12:25:56 2019 -0400

    Merge pull request #19279 from lfu/ansible_runner_timeout_1750370
    
    Set default playbook service timeout to 100 minutes
    
    (cherry picked from commit e1e730fb136f2702a02b04f815eb93f94934a426)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1750370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants