Schedule the check for timed out queue messages once for the region #19585

carbonin · 2019-12-04T21:49:10Z

Previously, for some reason, we were checking the queue for timed out messages every time we called MiqServer#monitor_workers (by default every 15 seconds). The logic to determine which workers needed their messages checked for timeout was also a bit too complex.

This PR moves the check for timed out queue messages to a schedule which is run once every 10 minutes for the entire region. Ten minutes was chosen as the default because that is also the default queue message timeout.

jrafanie · 2019-12-05T15:46:32Z

app/models/miq_schedule_worker/runner.rb

@@ -118,7 +118,7 @@ def schedules_for_all_roles
  end

  def schedules_for_scheduler_role
-    # These schedules need to run only once in a zone per interval, so let the single scheduler role handle them
+    # These schedules need to run only once in a region per interval, so let the single scheduler role handle them


https://twitter.com/nzkoz/status/538892801941848064?lang=en

jrafanie

🥇 This is great. One scheduler per region queueing this up vs. all servers doing it is amazing.

Fryguy · 2019-12-05T16:26:11Z

I'm not sure if there is a side effect of the timed out messages only being checked every 10 minutes

Fryguy · 2019-12-05T16:28:10Z

Oops hit comment too early...

I'll concerned that there are other timing mechanisms (like automate state machines) relying on timing out before triggering the next state it something. Changing the poll interval to 10 minutes could cause a message to not be detected an additional 10 minutes after it times out of you catch the timing wrong.

app/models/miq_queue.rb

carbonin · 2019-12-05T16:40:45Z

I'll concerned that there are other timing mechanisms (like automate state machines) relying on timing out before triggering the next state it something. Changing the poll interval to 10 minutes could cause a message to not be detected an additional 10 minutes after it times out of you catch the timing wrong.

Yeah, I'm up for suggestions for the schedule time. 10 minutes seemed like a good starting point just because of the queue timeout. We could go with 5 to minimize the chance that something waits for an entire extra timeout period before being found.

We're also validating all the workers messages every time they sync config, but with this patch that's happening less frequently as well (we're only calling request_workers_to_sync_config when they actually need to sync config).

carbonin · 2019-12-05T17:07:58Z

relying on timing out before triggering the next state it something

But generally this seems ... not great.

config/settings.yml

carbonin · 2019-12-05T21:41:20Z

Marking this WIP as @jrafanie and I just noticed that the only place we persist the worker heartbeat from drb to the database is in request_workers_to_sync_config so we really don't want to be calling that significantly less frequently (as in only when we need workers to sync config).

Going to be looking into refactoring that in a separate patch, then I'll rebase this onto that one.

carbonin · 2019-12-06T22:07:27Z

Rebased onto #19609 Will rebase on master once that is merged.

carbonin · 2019-12-10T14:42:30Z

Rebased and updated the schedule to every 1 minute instead of 10.

This should be ready for re-review @Fryguy @jrafanie

jrafanie

LGTM

app/models/miq_queue.rb

app/models/miq_server/worker_management/monitor.rb

config/settings.yml

We will schedule a check every ten minutes (the default queue message timeout value). This will allow us to stop doing this every 15 seconds in the server monitor loop

It's a bit overkill to be checking for timed out queue messages every time we monitor workers (by default every 15 seconds). This has been moved to a schedule which means that we can also remove all of the "procesed worker" tracking from the monitor_workers method

This prevents us from bringing back every message in dequeue in the region to just check its timeout values when we could do the check in sql

This removes the collection editing entirely while also removing the need to track individual worker ids.

app/models/miq_queue.rb

Fryguy

This LGTM...only thing left I think are some specs for the scope method, but otherwise, I'm good, so approving

miq-bot · 2019-12-11T19:06:07Z

Checked commits carbonin/manageiq@0bbfe67~...6698bf2 with ruby 2.5.5, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
8 files checked, 0 offenses detected
Everything looks fine. 🏆

jrafanie · 2019-12-11T19:13:24Z

app/models/miq_server/worker_management/monitor.rb

-    miq_workers.delete(*processed_workers) unless processed_workers.empty?
-    processed_workers.collect(&:id)
+
+    miq_workers.reload if worker_deleted


👍 I like this better than modifying the in-memory association

I want this merged to test if my other change in this area is needed on master... 😉

app/models/miq_schedule_worker/jobs.rb

jrafanie · 2019-12-12T14:58:28Z

app/models/miq_server/worker_management/monitor.rb

-    miq_workers.delete(*processed_workers) unless processed_workers.empty?
-    processed_workers.collect(&:id)
+
+    miq_workers.reload if worker_deleted


💯 Would recommend

jrafanie

LGTM, feel free to merge if your concerns were addressed @Fryguy

jrafanie · 2019-12-12T17:15:36Z

app/models/miq_server/worker_management/monitor.rb

-    processed_worker_ids += miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker|
-      # Check their queue messages for timeout
-      worker.validate_active_messages
+    miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker|


@carbonin FYI, this commit: e6cbfa8 fixed the "can't modify frozen Hash" error. If I change the above line to remove the .where, I can recreate that error for a simulated "exceeding memory worker" because it attempts to update a deleted worker from the cached association. The .where makes it run the query each time through the method.

Oh, interesting. We may still want to make that explicit at some point though ... I wonder if we should do a miq_workers.all.select here and a miq_workers.reload at the start of the method? I guess we're really splitting hairs at that point though ...

I think it's ok here as I'd expect a query like this to not be cached, so, there's no need to reload anything. Relying on an association previously made it very unclear whether we had any expectations on caching. I believe your change is much more explicit. Maybe it's just me though?

My concern is more clearly expressed in https://github.com/ManageIQ/manageiq/pull/19638/files#r357318923

I'm worried that someone removing the where for whatever reason wouldn't realize that we were relying on this line reloading the association for reasons other than this particular query.

carbonin · 2019-12-13T14:42:11Z

@jrafanie @Fryguy Is this one good to go?

jrafanie · 2019-12-13T14:54:51Z

Merging, we can address any concerns in a followup PR.

carbonin added refactoring core labels Dec 4, 2019

carbonin assigned jrafanie Dec 4, 2019

carbonin requested a review from Fryguy December 4, 2019 21:50

jrafanie reviewed Dec 5, 2019

View reviewed changes

jrafanie approved these changes Dec 5, 2019

View reviewed changes

Fryguy reviewed Dec 5, 2019

View reviewed changes

app/models/miq_queue.rb Outdated Show resolved Hide resolved

jrafanie reviewed Dec 5, 2019

View reviewed changes

config/settings.yml Outdated Show resolved Hide resolved

carbonin changed the title ~~Schedule the check for timed out queue messages once for the region~~ [WIP] Schedule the check for timed out queue messages once for the region Dec 5, 2019

carbonin added the wip label Dec 5, 2019

carbonin mentioned this pull request Dec 6, 2019

Move the important parts of monitoring workers into #monitor_workers #19609

Merged

miq-bot added the unmergeable label Dec 6, 2019

carbonin force-pushed the schedule_queue_timeout_check branch from 8a02100 to dd242c3 Compare December 6, 2019 22:06

miq-bot removed the unmergeable label Dec 9, 2019

carbonin force-pushed the schedule_queue_timeout_check branch from dd242c3 to 08b9e22 Compare December 10, 2019 14:38

carbonin changed the title ~~[WIP] Schedule the check for timed out queue messages once for the region~~ Schedule the check for timed out queue messages once for the region Dec 10, 2019

miq-bot removed the wip label Dec 10, 2019

carbonin force-pushed the schedule_queue_timeout_check branch from 08b9e22 to 4c674bf Compare December 10, 2019 14:42

carbonin requested review from jrafanie and Fryguy December 10, 2019 14:42

jrafanie approved these changes Dec 10, 2019

View reviewed changes

Fryguy reviewed Dec 10, 2019

View reviewed changes

app/models/miq_queue.rb Outdated Show resolved Hide resolved

Fryguy reviewed Dec 10, 2019

View reviewed changes

app/models/miq_server/worker_management/monitor.rb Outdated Show resolved Hide resolved

Fryguy reviewed Dec 10, 2019

View reviewed changes

app/models/miq_server/worker_management/monitor.rb Outdated Show resolved Hide resolved

Fryguy reviewed Dec 10, 2019

View reviewed changes

config/settings.yml Outdated Show resolved Hide resolved

carbonin added 3 commits December 10, 2019 16:25

Schedule a check for timed out queue messages

0bbfe67

We will schedule a check every ten minutes (the default queue message timeout value). This will allow us to stop doing this every 15 seconds in the server monitor loop

Only call check_for_timeout on messages that are likely timedout

c156dcc

This prevents us from bringing back every message in dequeue in the region to just check its timeout values when we could do the check in sql

carbonin force-pushed the schedule_queue_timeout_check branch from 4c674bf to ecc9b73 Compare December 10, 2019 21:27

Reload the miq_workers association if a worker was deleted

cc04c94

This removes the collection editing entirely while also removing the need to track individual worker ids.

carbonin force-pushed the schedule_queue_timeout_check branch from ecc9b73 to 4640044 Compare December 10, 2019 22:04

Fryguy reviewed Dec 11, 2019

View reviewed changes

app/models/miq_queue.rb Show resolved Hide resolved

Fryguy approved these changes Dec 11, 2019

View reviewed changes

Move timeout query to dedicated scope method

6698bf2

carbonin force-pushed the schedule_queue_timeout_check branch from 4640044 to 6698bf2 Compare December 11, 2019 19:02

jrafanie reviewed Dec 11, 2019

View reviewed changes

app/models/miq_schedule_worker/jobs.rb Show resolved Hide resolved

jrafanie reviewed Dec 12, 2019

View reviewed changes

jrafanie approved these changes Dec 12, 2019

View reviewed changes

jrafanie reviewed Dec 12, 2019

View reviewed changes

jrafanie merged commit bab00c7 into ManageIQ:master Dec 13, 2019

jrafanie added this to the Sprint 127 Ending Jan 6, 2020 milestone Dec 13, 2019

carbonin mentioned this pull request Dec 17, 2019

[Podification] Make it work with Zones ManageIQ/manageiq-pods#353

Closed

carbonin deleted the schedule_queue_timeout_check branch April 23, 2020 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedule the check for timed out queue messages once for the region #19585

Schedule the check for timed out queue messages once for the region #19585

carbonin commented Dec 4, 2019

jrafanie Dec 5, 2019

jrafanie left a comment

Fryguy commented Dec 5, 2019

Fryguy commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 6, 2019

carbonin commented Dec 10, 2019

jrafanie left a comment

Fryguy left a comment

miq-bot commented Dec 11, 2019

jrafanie Dec 11, 2019

jrafanie Dec 12, 2019

jrafanie Dec 12, 2019

jrafanie left a comment

jrafanie Dec 12, 2019 •

edited

Loading

carbonin Dec 12, 2019 •

edited

Loading

jrafanie Dec 12, 2019

carbonin Dec 12, 2019

carbonin commented Dec 13, 2019

jrafanie commented Dec 13, 2019

Schedule the check for timed out queue messages once for the region #19585

Schedule the check for timed out queue messages once for the region #19585

Conversation

carbonin commented Dec 4, 2019

jrafanie Dec 5, 2019

Choose a reason for hiding this comment

jrafanie left a comment

Choose a reason for hiding this comment

Fryguy commented Dec 5, 2019

Fryguy commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 5, 2019

carbonin commented Dec 6, 2019

carbonin commented Dec 10, 2019

jrafanie left a comment

Choose a reason for hiding this comment

Fryguy left a comment

Choose a reason for hiding this comment

miq-bot commented Dec 11, 2019

jrafanie Dec 11, 2019

Choose a reason for hiding this comment

jrafanie Dec 12, 2019

Choose a reason for hiding this comment

jrafanie Dec 12, 2019

Choose a reason for hiding this comment

jrafanie left a comment

Choose a reason for hiding this comment

jrafanie Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

carbonin Dec 12, 2019 • edited Loading

Choose a reason for hiding this comment

jrafanie Dec 12, 2019

Choose a reason for hiding this comment

carbonin Dec 12, 2019

Choose a reason for hiding this comment

carbonin commented Dec 13, 2019

jrafanie commented Dec 13, 2019

jrafanie Dec 12, 2019 •

edited

Loading

carbonin Dec 12, 2019 •

edited

Loading