-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule the check for timed out queue messages once for the region #19585
Schedule the check for timed out queue messages once for the region #19585
Conversation
@@ -118,7 +118,7 @@ def schedules_for_all_roles | |||
end | |||
|
|||
def schedules_for_scheduler_role | |||
# These schedules need to run only once in a zone per interval, so let the single scheduler role handle them | |||
# These schedules need to run only once in a region per interval, so let the single scheduler role handle them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇 This is great. One scheduler per region queueing this up vs. all servers doing it is amazing.
I'm not sure if there is a side effect of the timed out messages only being checked every 10 minutes |
Oops hit comment too early... I'll concerned that there are other timing mechanisms (like automate state machines) relying on timing out before triggering the next state it something. Changing the poll interval to 10 minutes could cause a message to not be detected an additional 10 minutes after it times out of you catch the timing wrong. |
Yeah, I'm up for suggestions for the schedule time. 10 minutes seemed like a good starting point just because of the queue timeout. We could go with 5 to minimize the chance that something waits for an entire extra timeout period before being found. We're also validating all the workers messages every time they sync config, but with this patch that's happening less frequently as well (we're only calling |
But generally this seems ... not great. |
Marking this WIP as @jrafanie and I just noticed that the only place we persist the worker heartbeat from drb to the database is in Going to be looking into refactoring that in a separate patch, then I'll rebase this onto that one. |
8a02100
to
dd242c3
Compare
Rebased onto #19609 Will rebase on master once that is merged. |
dd242c3
to
08b9e22
Compare
08b9e22
to
4c674bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We will schedule a check every ten minutes (the default queue message timeout value). This will allow us to stop doing this every 15 seconds in the server monitor loop
It's a bit overkill to be checking for timed out queue messages every time we monitor workers (by default every 15 seconds). This has been moved to a schedule which means that we can also remove all of the "procesed worker" tracking from the monitor_workers method
This prevents us from bringing back every message in dequeue in the region to just check its timeout values when we could do the check in sql
4c674bf
to
ecc9b73
Compare
This removes the collection editing entirely while also removing the need to track individual worker ids.
ecc9b73
to
4640044
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM...only thing left I think are some specs for the scope method, but otherwise, I'm good, so approving
4640044
to
6698bf2
Compare
Checked commits carbonin/manageiq@0bbfe67~...6698bf2 with ruby 2.5.5, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0 |
miq_workers.delete(*processed_workers) unless processed_workers.empty? | ||
processed_workers.collect(&:id) | ||
|
||
miq_workers.reload if worker_deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like this better than modifying the in-memory association
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want this merged to test if my other change in this area is needed on master... 😉
miq_workers.delete(*processed_workers) unless processed_workers.empty? | ||
processed_workers.collect(&:id) | ||
|
||
miq_workers.reload if worker_deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Would recommend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to merge if your concerns were addressed @Fryguy
processed_worker_ids += miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker| | ||
# Check their queue messages for timeout | ||
worker.validate_active_messages | ||
miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carbonin FYI, this commit: e6cbfa8 fixed the "can't modify frozen Hash" error. If I change the above line to remove the .where
, I can recreate that error for a simulated "exceeding memory worker" because it attempts to update a deleted worker from the cached association. The .where
makes it run the query each time through the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. We may still want to make that explicit at some point though ... I wonder if we should do a miq_workers.all.select
here and a miq_workers.reload
at the start of the method? I guess we're really splitting hairs at that point though ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok here as I'd expect a query like this to not be cached, so, there's no need to reload anything. Relying on an association previously made it very unclear whether we had any expectations on caching. I believe your change is much more explicit. Maybe it's just me though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is more clearly expressed in https://github.com/ManageIQ/manageiq/pull/19638/files#r357318923
I'm worried that someone removing the where
for whatever reason wouldn't realize that we were relying on this line reloading the association for reasons other than this particular query.
Merging, we can address any concerns in a followup PR. |
Previously, for some reason, we were checking the queue for timed out messages every time we called
MiqServer#monitor_workers
(by default every 15 seconds). The logic to determine which workers needed their messages checked for timeout was also a bit too complex.This PR moves the check for timed out queue messages to a schedule which is run once every 10 minutes for the entire region. Ten minutes was chosen as the default because that is also the default queue message timeout.