Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

Open
wants to merge 3 commits into
base: 14.0
Choose a base branch
from

Conversation

hparfr
Copy link
Contributor

@hparfr hparfr commented May 23, 2024

I'm trying to improve the remediation of started job stucked.

Cron Jobs Garbage Collector, with the second parameter, let you requeue started job. But if the root cause like a CPU limit error is still present after the requeue, the issue always reappear.

With this new cron, the job is marked as failed and not requeued.

@OCA-git-bot
Copy link
Contributor

Hi @guewen,
some modules you are maintaining are being modified, check this out!

@@ -10,6 +10,15 @@
<field name="state">code</field>
<field name="code">model.requeue_stuck_jobs()</field>
</record>
<record id="ir_cron_queue_job_fail_dead_jobs()" model="ir.cron">
<field name="name">Jobs Garbage Collector</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same name as the job above.

"""
now = fields.datetime.now()
started_dl = now - timedelta(minutes=started_delta)
if started_delta <= 10:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hardcoded 10 minutes here, perhaps use config['limit_time_real'] seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have a limit_time_real short for the regular http workers and a long for queue workers on a separate instance.

The idea is to use this cron as a last line of defence against poorly configured instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this would be better served with something like

if started_delta <= int(self.env['ir.config_parameter'].sudo().get_param('queue_job.limit_time_dead', 10)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an argument to bypass the 10 min limit without having to change the code

queue_job/data/queue_data.xml Outdated Show resolved Hide resolved
@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0):
).requeue()
return True

def fail_dead_jobs(self, started_delta, force_low_delta=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a required change, but I think it would be reasonable to force_low_delta=True here to turn your safety check on by default. I'm going to override

            <field name="code">model.fail_dead_jobs(240)</field>

anyway to significantly decrease the started_delta for my own purposes, so it will be no inconvenience to me to also disable that check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the opposite, you will have to change the cron to something like :

model.fail_dead_jobs(5, force_low_delta=True)

BTW, why are you planning to put a low value here ? What's your use case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why are you planning to put a low value here ? What's your use case ?

I'm impatient. 😉 I don't use a dedicated queue_job server, so all my jobs run within the default 60/120 cpu/real time limits. Ten minutes is an eternity. 😉 I also plan to decrease the Dead Job cron interval.

Copy link
Contributor

@florian-dacosta florian-dacosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I believe in a new version (18?) it would be nice to merge both cron.

The cron would then : requeue the enqueued jobs + set to fail the jobs started for too long (+ put this behavior by default).
But it is better not to do it in an already released version.

<field name="numbercall">-1</field>
<field name="model_id" ref="model_queue_job" />
<field name="state">code</field>
<field name="code">model.fail_dead_jobs(240)</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add an advice on how to choose the value somewhere if someone wants to adapt it to its config?
In the Readme or in the methode description ? The ideal value would be the cpu_time limit from the queue job server config right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think a comment here in the code would be nice too 😉

Comment on lines +428 to +436
This function, mark jobs started longtime ago
as failed.

Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, we can't know, etc.

This mechanism should be very exceptionnal.
It may help, for instance, if someone forget to configure
properly his system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This function, mark jobs started longtime ago
as failed.
Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, we can't know, etc.
This mechanism should be very exceptionnal.
It may help, for instance, if someone forget to configure
properly his system.
This function marks jobs started longtime ago
as failed.
Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, etc. We can't know.
This mechanism should be very exceptional.
It may help, for instance, if someone forgot to configure
properly his system.

= you know what you do
"""
now = fields.datetime.now()
started_dl = now - timedelta(minutes=started_delta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
started_dl = now - timedelta(minutes=started_delta)
started_dl = fields.Datetime.subtract(now, minutes=started_delta)

@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0):
).requeue()
return True

def fail_dead_jobs(self, started_delta, force_low_delta=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def fail_dead_jobs(self, started_delta, force_low_delta=False):
def gc_dead_jobs(self, started_delta, force_low_delta=False):

<field name="numbercall">-1</field>
<field name="model_id" ref="model_queue_job" />
<field name="state">code</field>
<field name="code">model.fail_dead_jobs(240)</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think a comment here in the code would be nice too 😉

@@ -10,6 +10,15 @@
<field name="state">code</field>
<field name="code">model.requeue_stuck_jobs()</field>
</record>
<record id="ir_cron_queue_job_fail_dead_jobs" model="ir.cron">
<field name="name">Take care of unresponsive jobs</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While thinking of a better name for this cron I wonder.... why don't we add another method and use only one cron?

Eg:

<field name="code">model.gc_stuck_jobs()</field>
[...]

def gc_stuck_jobs(self, started_delta=None, force_low_delta=False):
    self.requeue_stuck_jobs()
    if started_delta:
        self.gc_dead_jobs(started_delta, force_low_delta=force_low_delta)

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants