[14.0][IMP] queue_job: add cron to purge dead jobs. #653

hparfr · 2024-05-23T16:11:07Z

I'm trying to improve the remediation of started job stucked.

Cron Jobs Garbage Collector, with the second parameter, let you requeue started job. But if the root cause like a CPU limit error is still present after the requeue, the issue always reappear.

With this new cron, the job is marked as failed and not requeued.

OCA-git-bot · 2024-05-23T16:11:17Z

Hi @guewen,
some modules you are maintaining are being modified, check this out!

amh-mw · 2024-05-27T12:03:27Z

queue_job/data/queue_data.xml

@@ -10,6 +10,15 @@
            <field name="state">code</field>
            <field name="code">model.requeue_stuck_jobs()</field>
        </record>
+        <record id="ir_cron_queue_job_fail_dead_jobs()" model="ir.cron">
+            <field name="name">Jobs Garbage Collector</field>


This has the same name as the job above.

amh-mw · 2024-05-27T12:04:22Z

queue_job/models/queue_job.py

+        """
+        now = fields.datetime.now()
+        started_dl = now - timedelta(minutes=started_delta)
+        if started_delta <= 10:


Instead of hardcoded 10 minutes here, perhaps use config['limit_time_real'] seconds?

You may have a limit_time_real short for the regular http workers and a long for queue workers on a separate instance.

The idea is to use this cron as a last line of defence against poorly configured instance.

I still think this would be better served with something like

if started_delta <= int(self.env['ir.config_parameter'].sudo().get_param('queue_job.limit_time_dead', 10)):

I added an argument to bypass the 10 min limit without having to change the code

queue_job/data/queue_data.xml

amh-mw · 2024-05-31T11:54:30Z

queue_job/models/queue_job.py

@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0):
        ).requeue()
        return True

+    def fail_dead_jobs(self, started_delta, force_low_delta=False):


Not a required change, but I think it would be reasonable to force_low_delta=True here to turn your safety check on by default. I'm going to override

<field name="code">model.fail_dead_jobs(240)</field>

anyway to significantly decrease the started_delta for my own purposes, so it will be no inconvenience to me to also disable that check.

It's the opposite, you will have to change the cron to something like :

model.fail_dead_jobs(5, force_low_delta=True)

BTW, why are you planning to put a low value here ? What's your use case ?

BTW, why are you planning to put a low value here ? What's your use case ?

I'm impatient. 😉 I don't use a dedicated queue_job server, so all my jobs run within the default 60/120 cpu/real time limits. Ten minutes is an eternity. 😉 I also plan to decrease the Dead Job cron interval.

florian-dacosta

LGTM
I believe in a new version (18?) it would be nice to merge both cron.

The cron would then : requeue the enqueued jobs + set to fail the jobs started for too long (+ put this behavior by default).
But it is better not to do it in an already released version.

florian-dacosta · 2024-07-17T08:35:16Z

queue_job/data/queue_data.xml

+            <field name="numbercall">-1</field>
+            <field name="model_id" ref="model_queue_job" />
+            <field name="state">code</field>
+            <field name="code">model.fail_dead_jobs(240)</field>


I think it would be nice to add an advice on how to choose the value somewhere if someone wants to adapt it to its config?
In the Readme or in the methode description ? The ideal value would be the cpu_time limit from the queue job server config right ?

I agree. I think a comment here in the code would be nice too 😉

simahawk · 2024-07-25T06:43:45Z

queue_job/models/queue_job.py

+        This function, mark jobs started longtime ago
+        as failed.
+
+        Cause of death can be CPU Time limit reached
+        a SIGTERM, a power shortage, we can't know, etc.
+
+        This mechanism should be very exceptionnal.
+        It may help, for instance, if someone forget to configure
+        properly his system.


Suggested change

This function, mark jobs started longtime ago

as failed.

Cause of death can be CPU Time limit reached

a SIGTERM, a power shortage, we can't know, etc.

This mechanism should be very exceptionnal.

It may help, for instance, if someone forget to configure

properly his system.

This function marks jobs started longtime ago

as failed.

Cause of death can be CPU Time limit reached

a SIGTERM, a power shortage, etc. We can't know.

This mechanism should be very exceptional.

It may help, for instance, if someone forgot to configure

properly his system.

simahawk · 2024-07-25T06:45:21Z

queue_job/models/queue_job.py

+                                = you know what you do
+        """
+        now = fields.datetime.now()
+        started_dl = now - timedelta(minutes=started_delta)


Suggested change

started_dl = now - timedelta(minutes=started_delta)

started_dl = fields.Datetime.subtract(now, minutes=started_delta)

simahawk · 2024-07-25T06:46:15Z

queue_job/models/queue_job.py

@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0):
        ).requeue()
        return True

+    def fail_dead_jobs(self, started_delta, force_low_delta=False):


Suggested change

def fail_dead_jobs(self, started_delta, force_low_delta=False):

def gc_dead_jobs(self, started_delta, force_low_delta=False):

simahawk · 2024-07-25T06:48:12Z

queue_job/data/queue_data.xml

+            <field name="numbercall">-1</field>
+            <field name="model_id" ref="model_queue_job" />
+            <field name="state">code</field>
+            <field name="code">model.fail_dead_jobs(240)</field>


I agree. I think a comment here in the code would be nice too 😉

simahawk · 2024-07-25T06:51:59Z

queue_job/data/queue_data.xml

@@ -10,6 +10,15 @@
            <field name="state">code</field>
            <field name="code">model.requeue_stuck_jobs()</field>
        </record>
+        <record id="ir_cron_queue_job_fail_dead_jobs" model="ir.cron">
+            <field name="name">Take care of unresponsive jobs</field>


While thinking of a better name for this cron I wonder.... why don't we add another method and use only one cron?

Eg:

<field name="code">model.gc_stuck_jobs()</field> [...] def gc_stuck_jobs(self, started_delta=None, force_low_delta=False): self.requeue_stuck_jobs() if started_delta: self.gc_dead_jobs(started_delta, force_low_delta=force_low_delta)

WDYT?

queue_job: add cron to purge dead jobs.

3aab23c

amh-mw suggested changes May 27, 2024

View reviewed changes

fixup! queue_job: add cron to purge dead jobs.

fb06ddd

amh-mw approved these changes May 31, 2024

View reviewed changes

fixup! fixup! queue_job: add cron to purge dead jobs.

250bd9e

florian-dacosta approved these changes Jul 17, 2024

View reviewed changes

OCA-git-bot added approved labels Jul 17, 2024

simahawk requested changes Jul 25, 2024

View reviewed changes

OCA-git-bot removed the approved label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

hparfr commented May 23, 2024

OCA-git-bot commented May 23, 2024

amh-mw May 27, 2024

amh-mw May 27, 2024

hparfr May 27, 2024

amh-mw May 30, 2024

hparfr May 31, 2024

amh-mw May 31, 2024

hparfr Jun 3, 2024

amh-mw Jun 4, 2024

florian-dacosta left a comment

florian-dacosta Jul 17, 2024

simahawk Jul 25, 2024

simahawk Jul 25, 2024

simahawk Jul 25, 2024

simahawk Jul 25, 2024

simahawk Jul 25, 2024

simahawk Jul 25, 2024

	started_dl = now - timedelta(minutes=started_delta)
	started_dl = fields.Datetime.subtract(now, minutes=started_delta)

	def fail_dead_jobs(self, started_delta, force_low_delta=False):
	def gc_dead_jobs(self, started_delta, force_low_delta=False):

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

Are you sure you want to change the base?

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

Conversation

hparfr commented May 23, 2024

OCA-git-bot commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

florian-dacosta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment