Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix master server failover race condition #13065

Merged

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Dec 8, 2016

Abort takeover only if an active master exists

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

Previously, we would abort if a different master existed, even if it was
shut down.

  • server 1 is master and shuts down
  • server 3 runs monitor_servers, becomes master and shuts down
  • server 2 runs monitor_servers AFTER 3 becomes master

server 2 wouldn't take over as master because it sees the inactive
server 3 as master.

Note, commit 2 is the 🍖 of the change. commit 1 is just logging.

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

Previously, we would abort if a different master existed, even if it was
shut down.

* server 1 is master and shuts down
* server 3 runs monitor_servers, becomes master and shuts down
* server 2 runs monitor_servers AFTER 3 becomes master

server 2 wouldn't take over as master because it sees the inactive
server 3 as master.
@jrafanie
Copy link
Member Author

jrafanie commented Dec 8, 2016

cc @carbonin @gtanzillo

@carbonin carbonin self-assigned this Dec 8, 2016
all_servers.each do |s|
_log.debug "Setting this server, #{name}, as master server"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it wouldn't hurt to always log this one too. It'll give us some more insight as to what's going the next time we have to chase a bug in this code.

@carbonin carbonin added this to the Sprint 51 Ending Jan 2, 2017 milestone Dec 8, 2016
Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍


# Set is_master on self, reset every other server in the region, including
# inactive ones.
parent.miq_servers.each do |s|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is to ensure that we properly set the is_master on all servers. Even the ones that are inactive, right?

Also, is there a chance that miq_servers could have been cached? Should we do MiqRegion.my_region(true) above when parent is set? Or parent.reload.miq_servers here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my_region(true) makes sense to me.

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

We lock on the region row and base all of our server is_master
queries and changes on it, therefore, it's really important we don't
have a cached region.
@jrafanie jrafanie force-pushed the fix_master_server_failover_race_condition branch from 4c213f0 to bbf28c2 Compare December 8, 2016 20:05
@jrafanie
Copy link
Member Author

jrafanie commented Dec 8, 2016

@gtanzillo @carbonin I was able recreate a failing scenario (now fixed) for a server that "restarted"... see the test change in the last commit.

@miq-bot
Copy link
Member

miq-bot commented Dec 8, 2016

Checked commits jrafanie/manageiq@b5c09d8~...c7d72a0 with ruby 2.2.5, rubocop 0.37.2, and haml-lint 0.16.1
2 files checked, 1 offense detected

app/models/miq_server/server_monitor.rb

@jrafanie
Copy link
Member Author

jrafanie commented Dec 8, 2016

Thanks for helping me get to the bottom of this issue @carbonin @gtanzillo @jdeubel 🙇 👏

@carbonin carbonin merged commit 1eafadd into ManageIQ:master Dec 8, 2016
@Fryguy
Copy link
Member

Fryguy commented Dec 9, 2016

Great work @jrafanie !! Nice find and fix.

@jrafanie jrafanie deleted the fix_master_server_failover_race_condition branch December 9, 2016 20:01
simaishi pushed a commit that referenced this pull request Jan 9, 2017
…ce_condition

Fix master server failover race condition
(cherry picked from commit 1eafadd)

https://bugzilla.redhat.com/show_bug.cgi?id=1403983
@simaishi
Copy link
Contributor

simaishi commented Jan 9, 2017

Euwe backport details:

$ git log -1
commit db36d94a8db938e5f6221561b65c83502f3f528b
Author: Nick Carboni <ncarboni@redhat.com>
Date:   Thu Dec 8 17:15:23 2016 -0500

    Merge pull request #13065 from jrafanie/fix_master_server_failover_race_condition
    
    Fix master server failover race condition
    (cherry picked from commit 1eafadd79813e7472d12fca8842fa12ac60bd6ee)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1403983

jrafanie pushed a commit to jrafanie/manageiq that referenced this pull request Feb 17, 2017
…lover_race_condition

Fix master server failover race condition
(cherry picked from commit 1eafadd)

https://bugzilla.redhat.com/show_bug.cgi?id=1402943
@jrafanie
Copy link
Member Author

Opened #13977 for darga backport.

simaishi added a commit that referenced this pull request Apr 20, 2017
[DARGA] Fix master server failover race condition (backport #13065)
@simaishi
Copy link
Contributor

Backported to Darga via #13977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants