Fix master server failover race condition #13065

jrafanie · 2016-12-08T17:27:44Z

Abort takeover only if an active master exists

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

Previously, we would abort if a different master existed, even if it was
shut down.

server 1 is master and shuts down
server 3 runs monitor_servers, becomes master and shuts down
server 2 runs monitor_servers AFTER 3 becomes master

server 2 wouldn't take over as master because it sees the inactive
server 3 as master.

Note, commit 2 is the 🍖 of the change. commit 1 is just logging.

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

https://bugzilla.redhat.com/show_bug.cgi?id=1402943 Previously, we would abort if a different master existed, even if it was shut down. * server 1 is master and shuts down * server 3 runs monitor_servers, becomes master and shuts down * server 2 runs monitor_servers AFTER 3 becomes master server 2 wouldn't take over as master because it sees the inactive server 3 as master.

jrafanie · 2016-12-08T17:28:55Z

cc @carbonin @gtanzillo

gtanzillo · 2016-12-08T18:33:29Z

app/models/miq_server/server_monitor.rb

      all_servers.each do |s|
+      _log.debug "Setting this server, #{name}, as master server"


I think it wouldn't hurt to always log this one too. It'll give us some more insight as to what's going the next time we have to chase a bug in this code.

carbonin

Looks good 👍

gtanzillo · 2016-12-08T18:58:26Z

app/models/miq_server/server_monitor.rb

+
+      # Set is_master on self, reset every other server in the region, including
+      # inactive ones.
+      parent.miq_servers.each do |s|


So this is to ensure that we properly set the is_master on all servers. Even the ones that are inactive, right?

Also, is there a chance that miq_servers could have been cached? Should we do MiqRegion.my_region(true) above when parent is set? Or parent.reload.miq_servers here.

Yeah, my_region(true) makes sense to me.

gtanzillo

👍 LGTM

https://bugzilla.redhat.com/show_bug.cgi?id=1402943 We lock on the region row and base all of our server is_master queries and changes on it, therefore, it's really important we don't have a cached region.

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

jrafanie · 2016-12-08T20:42:47Z

@gtanzillo @carbonin I was able recreate a failing scenario (now fixed) for a server that "restarted"... see the test change in the last commit.

miq-bot · 2016-12-08T20:47:40Z

Checked commits jrafanie/manageiq@b5c09d8~...c7d72a0 with ruby 2.2.5, rubocop 0.37.2, and haml-lint 0.16.1
2 files checked, 1 offense detected

app/models/miq_server/server_monitor.rb

❕ - Line 19, Col 121 - Metrics/LineLength - Line is too long. [153/120]

jrafanie · 2016-12-08T21:08:03Z

Thanks for helping me get to the bottom of this issue @carbonin @gtanzillo @jdeubel 🙇 👏

Fryguy · 2016-12-09T19:55:07Z

Great work @jrafanie !! Nice find and fix.

…ce_condition Fix master server failover race condition (cherry picked from commit 1eafadd) https://bugzilla.redhat.com/show_bug.cgi?id=1403983

simaishi · 2017-01-09T18:43:14Z

Euwe backport details:

$ git log -1
commit db36d94a8db938e5f6221561b65c83502f3f528b
Author: Nick Carboni <ncarboni@redhat.com>
Date:   Thu Dec 8 17:15:23 2016 -0500

    Merge pull request #13065 from jrafanie/fix_master_server_failover_race_condition
    
    Fix master server failover race condition
    (cherry picked from commit 1eafadd79813e7472d12fca8842fa12ac60bd6ee)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1403983

…lover_race_condition Fix master server failover race condition (cherry picked from commit 1eafadd) https://bugzilla.redhat.com/show_bug.cgi?id=1402943

jrafanie · 2017-02-17T20:53:44Z

Opened #13977 for darga backport.

[DARGA] Fix master server failover race condition (backport #13065)

simaishi · 2017-04-20T22:47:52Z

Backported to Darga via #13977

jrafanie added 2 commits December 8, 2016 12:11

Add logging around master server failover

b5c09d8

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

jrafanie added bug core euwe/yes labels Dec 8, 2016

carbonin self-assigned this Dec 8, 2016

gtanzillo reviewed Dec 8, 2016

View reviewed changes

carbonin added this to the Sprint 51 Ending Jan 2, 2017 milestone Dec 8, 2016

carbonin approved these changes Dec 8, 2016

View reviewed changes

gtanzillo reviewed Dec 8, 2016

View reviewed changes

gtanzillo approved these changes Dec 8, 2016

View reviewed changes

make_master_server uncached again!

bbf28c2

https://bugzilla.redhat.com/show_bug.cgi?id=1402943 We lock on the region row and base all of our server is_master queries and changes on it, therefore, it's really important we don't have a cached region.

jrafanie force-pushed the fix_master_server_failover_race_condition branch from 4c213f0 to bbf28c2 Compare December 8, 2016 20:05

Test a restarted server takeover from a stopped master

c7d72a0

https://bugzilla.redhat.com/show_bug.cgi?id=1402943

carbonin merged commit 1eafadd into ManageIQ:master Dec 8, 2016

jrafanie deleted the fix_master_server_failover_race_condition branch December 9, 2016 20:01

simaishi added euwe/backported and removed euwe/yes labels Jan 9, 2017

jrafanie mentioned this pull request Feb 17, 2017

[DARGA] Fix master server failover race condition (backport #13065) #13977

Merged

jrafanie added the darga/yes label Feb 17, 2017

simaishi added a commit that referenced this pull request Apr 20, 2017

Merge pull request #13977 from jrafanie/backport_13065_to_darga

07a82c2

[DARGA] Fix master server failover race condition (backport #13065)

simaishi added darga/backported and removed darga/yes labels Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix master server failover race condition #13065

Fix master server failover race condition #13065

jrafanie commented Dec 8, 2016

jrafanie commented Dec 8, 2016

gtanzillo Dec 8, 2016

carbonin left a comment

gtanzillo Dec 8, 2016

jrafanie Dec 8, 2016

gtanzillo left a comment

jrafanie commented Dec 8, 2016

miq-bot commented Dec 8, 2016

jrafanie commented Dec 8, 2016

Fryguy commented Dec 9, 2016

simaishi commented Jan 9, 2017

jrafanie commented Feb 17, 2017

simaishi commented Apr 20, 2017

		all_servers.each do \|s\|
		_log.debug "Setting this server, #{name}, as master server"

Fix master server failover race condition #13065

Fix master server failover race condition #13065

Conversation

jrafanie commented Dec 8, 2016

jrafanie commented Dec 8, 2016

gtanzillo Dec 8, 2016

Choose a reason for hiding this comment

carbonin left a comment

Choose a reason for hiding this comment

gtanzillo Dec 8, 2016

Choose a reason for hiding this comment

jrafanie Dec 8, 2016

Choose a reason for hiding this comment

gtanzillo left a comment

Choose a reason for hiding this comment

jrafanie commented Dec 8, 2016

miq-bot commented Dec 8, 2016

jrafanie commented Dec 8, 2016

Fryguy commented Dec 9, 2016

simaishi commented Jan 9, 2017

jrafanie commented Feb 17, 2017

simaishi commented Apr 20, 2017