-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix master server failover race condition #13065
Fix master server failover race condition #13065
Conversation
https://bugzilla.redhat.com/show_bug.cgi?id=1402943 Previously, we would abort if a different master existed, even if it was shut down. * server 1 is master and shuts down * server 3 runs monitor_servers, becomes master and shuts down * server 2 runs monitor_servers AFTER 3 becomes master server 2 wouldn't take over as master because it sees the inactive server 3 as master.
all_servers.each do |s| | ||
_log.debug "Setting this server, #{name}, as master server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it wouldn't hurt to always log this one too. It'll give us some more insight as to what's going the next time we have to chase a bug in this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
|
||
# Set is_master on self, reset every other server in the region, including | ||
# inactive ones. | ||
parent.miq_servers.each do |s| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is to ensure that we properly set the is_master on all servers. Even the ones that are inactive, right?
Also, is there a chance that miq_servers
could have been cached? Should we do MiqRegion.my_region(true)
above when parent
is set? Or parent.reload.miq_servers
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my_region(true) makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
https://bugzilla.redhat.com/show_bug.cgi?id=1402943 We lock on the region row and base all of our server is_master queries and changes on it, therefore, it's really important we don't have a cached region.
4c213f0
to
bbf28c2
Compare
@gtanzillo @carbonin I was able recreate a failing scenario (now fixed) for a server that "restarted"... see the test change in the last commit. |
Checked commits jrafanie/manageiq@b5c09d8~...c7d72a0 with ruby 2.2.5, rubocop 0.37.2, and haml-lint 0.16.1 app/models/miq_server/server_monitor.rb
|
Thanks for helping me get to the bottom of this issue @carbonin @gtanzillo @jdeubel 🙇 👏 |
Great work @jrafanie !! Nice find and fix. |
…ce_condition Fix master server failover race condition (cherry picked from commit 1eafadd) https://bugzilla.redhat.com/show_bug.cgi?id=1403983
Euwe backport details:
|
…lover_race_condition Fix master server failover race condition (cherry picked from commit 1eafadd) https://bugzilla.redhat.com/show_bug.cgi?id=1402943
Opened #13977 for darga backport. |
[DARGA] Fix master server failover race condition (backport #13065)
Backported to Darga via #13977 |
Abort takeover only if an active master exists
https://bugzilla.redhat.com/show_bug.cgi?id=1402943
Previously, we would abort if a different master existed, even if it was
shut down.
server 2 wouldn't take over as master because it sees the inactive
server 3 as master.
Note, commit 2 is the 🍖 of the change. commit 1 is just logging.