Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule the check for timed out queue messages once for the region #19585

Merged
merged 5 commits into from
Dec 13, 2019

Conversation

carbonin
Copy link
Member

@carbonin carbonin commented Dec 4, 2019

Previously, for some reason, we were checking the queue for timed out messages every time we called MiqServer#monitor_workers (by default every 15 seconds). The logic to determine which workers needed their messages checked for timeout was also a bit too complex.

This PR moves the check for timed out queue messages to a schedule which is run once every 10 minutes for the entire region. Ten minutes was chosen as the default because that is also the default queue message timeout.

@@ -118,7 +118,7 @@ def schedules_for_all_roles
end

def schedules_for_scheduler_role
# These schedules need to run only once in a zone per interval, so let the single scheduler role handle them
# These schedules need to run only once in a region per interval, so let the single scheduler role handle them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 This is great. One scheduler per region queueing this up vs. all servers doing it is amazing.

@Fryguy
Copy link
Member

Fryguy commented Dec 5, 2019

I'm not sure if there is a side effect of the timed out messages only being checked every 10 minutes

@Fryguy
Copy link
Member

Fryguy commented Dec 5, 2019

Oops hit comment too early...

I'll concerned that there are other timing mechanisms (like automate state machines) relying on timing out before triggering the next state it something. Changing the poll interval to 10 minutes could cause a message to not be detected an additional 10 minutes after it times out of you catch the timing wrong.

app/models/miq_queue.rb Outdated Show resolved Hide resolved
@carbonin
Copy link
Member Author

carbonin commented Dec 5, 2019

I'll concerned that there are other timing mechanisms (like automate state machines) relying on timing out before triggering the next state it something. Changing the poll interval to 10 minutes could cause a message to not be detected an additional 10 minutes after it times out of you catch the timing wrong.

Yeah, I'm up for suggestions for the schedule time. 10 minutes seemed like a good starting point just because of the queue timeout. We could go with 5 to minimize the chance that something waits for an entire extra timeout period before being found.

We're also validating all the workers messages every time they sync config, but with this patch that's happening less frequently as well (we're only calling request_workers_to_sync_config when they actually need to sync config).

@carbonin
Copy link
Member Author

carbonin commented Dec 5, 2019

relying on timing out before triggering the next state it something

But generally this seems ... not great.

config/settings.yml Outdated Show resolved Hide resolved
@carbonin carbonin changed the title Schedule the check for timed out queue messages once for the region [WIP] Schedule the check for timed out queue messages once for the region Dec 5, 2019
@carbonin carbonin added the wip label Dec 5, 2019
@carbonin
Copy link
Member Author

carbonin commented Dec 5, 2019

Marking this WIP as @jrafanie and I just noticed that the only place we persist the worker heartbeat from drb to the database is in request_workers_to_sync_config so we really don't want to be calling that significantly less frequently (as in only when we need workers to sync config).

Going to be looking into refactoring that in a separate patch, then I'll rebase this onto that one.

@carbonin
Copy link
Member Author

carbonin commented Dec 6, 2019

Rebased onto #19609 Will rebase on master once that is merged.

@carbonin carbonin changed the title [WIP] Schedule the check for timed out queue messages once for the region Schedule the check for timed out queue messages once for the region Dec 10, 2019
@miq-bot miq-bot removed the wip label Dec 10, 2019
@carbonin
Copy link
Member Author

Rebased and updated the schedule to every 1 minute instead of 10.

This should be ready for re-review @Fryguy @jrafanie

Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

app/models/miq_queue.rb Outdated Show resolved Hide resolved
config/settings.yml Outdated Show resolved Hide resolved
We will schedule a check every ten minutes (the default queue message
timeout value).

This will allow us to stop doing this every 15 seconds in the
server monitor loop
It's a bit overkill to be checking for timed out queue messages
every time we monitor workers (by default every 15 seconds).

This has been moved to a schedule which means that we can also
remove all of the "procesed worker" tracking from the monitor_workers
method
This prevents us from bringing back every message in dequeue in the
region to just check its timeout values when we could do the check
in sql
This removes the collection editing entirely while also removing
the need to track individual worker ids.
Copy link
Member

@Fryguy Fryguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM...only thing left I think are some specs for the scope method, but otherwise, I'm good, so approving

@miq-bot
Copy link
Member

miq-bot commented Dec 11, 2019

Checked commits carbonin/manageiq@0bbfe67~...6698bf2 with ruby 2.5.5, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
8 files checked, 0 offenses detected
Everything looks fine. 🏆

miq_workers.delete(*processed_workers) unless processed_workers.empty?
processed_workers.collect(&:id)

miq_workers.reload if worker_deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like this better than modifying the in-memory association

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want this merged to test if my other change in this area is needed on master... 😉

miq_workers.delete(*processed_workers) unless processed_workers.empty?
processed_workers.collect(&:id)

miq_workers.reload if worker_deleted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Would recommend

Copy link
Member

@jrafanie jrafanie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge if your concerns were addressed @Fryguy

processed_worker_ids += miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker|
# Check their queue messages for timeout
worker.validate_active_messages
miq_workers.where(:status => MiqWorker::STATUSES_CURRENT_OR_STARTING).each do |worker|
Copy link
Member

@jrafanie jrafanie Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin FYI, this commit: e6cbfa8 fixed the "can't modify frozen Hash" error. If I change the above line to remove the .where, I can recreate that error for a simulated "exceeding memory worker" because it attempts to update a deleted worker from the cached association. The .where makes it run the query each time through the method.

Copy link
Member Author

@carbonin carbonin Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. We may still want to make that explicit at some point though ... I wonder if we should do a miq_workers.all.select here and a miq_workers.reload at the start of the method? I guess we're really splitting hairs at that point though ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok here as I'd expect a query like this to not be cached, so, there's no need to reload anything. Relying on an association previously made it very unclear whether we had any expectations on caching. I believe your change is much more explicit. Maybe it's just me though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is more clearly expressed in https://github.com/ManageIQ/manageiq/pull/19638/files#r357318923

I'm worried that someone removing the where for whatever reason wouldn't realize that we were relying on this line reloading the association for reasons other than this particular query.

@carbonin
Copy link
Member Author

@jrafanie @Fryguy Is this one good to go?

@jrafanie
Copy link
Member

Merging, we can address any concerns in a followup PR.

@jrafanie jrafanie merged commit bab00c7 into ManageIQ:master Dec 13, 2019
@jrafanie jrafanie added this to the Sprint 127 Ending Jan 6, 2020 milestone Dec 13, 2019
@carbonin carbonin deleted the schedule_queue_timeout_check branch April 23, 2020 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants