Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change corosync and sbd parameters based on the profile environment detected #842

Merged
merged 4 commits into from
Aug 11, 2021

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Jul 6, 2021

Motivation

In different environment, some corosync and sbd parameters should be adjusted.
It should be nice when bootstrapping if crmsh could detect the environment, then load the different profile data for different platforms.

Changes

  • I add /etc/crm/profiles.yml for different platforms
# The valid profile names are:
# "microsoft-azure", "google-cloud-platform", "amazon-web-services", "s390", "default"
#
# "default" profile is loaded in the beginning.
#
# Those specific profile will override the corresponding values in "default"
# profile if the specific environment is detected.
#
# Users could customize the "default" profile for their needs, for example,
# those on-premise environments which is not defined yet.
#
# Profiles are only loaded on bootstrap init node.
#
# More details please see man corosync.conf, man sbd

default:
  corosync.totem.crypto_hash: sha1
  corosync.totem.crypto_cipher: aes256
  corosync.totem.token: 5000
  corosync.totem.join: 60
  corosync.totem.max_messages: 20
  corosync.totem.token_retransmits_before_loss_const: 10
  # sbd.msgwait is set to sbd.watchdog_timeout*2 by crmsh
  # or, you can define your own value in profiles.yml
  sbd.watchdog_timeout: 15

microsoft-azure:
  corosync.totem.token: 30000
  sbd.watchdog_timeout: 60
  • The whole init process in Azure
    When starting pacemaker, show the time should wait for sbd(sbd.watchdog_timeout * 2 * 1.2)
# crm cluster init -s /dev/sdc1 -y
  Loading "default" profile from /etc/crm/profiles.yml
  Detected "microsoft-azure" platform
  Loading "microsoft-azure" profile from /etc/crm/profiles.yml
  Configuring csync2
  Generating csync2 shared key (this may take a while)...done
  csync2 checking files...done
  Configuring corosync (unicast)
  Initializing SBD......done
  Hawk cluster interface is now running. To see cluster status, open:
    https://10.0.0.4:7630/
  Log in with username 'hacluster'
  Starting pacemaker(waiting for sbd 144s)...done
  Waiting for cluster..............done
  Loading initial cluster configuration
  Done (log saved to /var/log/crmsh/ha-cluster-bootstrap.log)
  • When starting cluster service, show the time should wait for sbd
# crm cluster start
Starting pacemaker(waiting for sbd 144s)...done
INFO: Cluster services started
  • sbd dump after bootstrapping in Azure
# sbd -d /dev/sdb1 dump
==Dumping header on disk /dev/sdb1
Header version     : 2.1
UUID               : c1489a11-02f6-4475-ac98-daa5e119d44e
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 60
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 120
==Header on disk /dev/sdb1 is dumped
  • Cat /etc/sysconfig/sbd in Azure
# cat /etc/sysconfig/sbd|grep SBD_WATCHDOG_TIMEOUT
SBD_WATCHDOG_TIMEOUT=60
  • show crmsh config in Azure
# crm configure show
node 1: crmsh15sp2-1
primitive stonith-sbd stonith:external/sbd \
	params pcmk_delay_max=30s
property cib-bootstrap-options: \
	have-watchdog=true \
	dc-version="2.0.4+20200616.2deceaa3a-3.6.1-2.0.4+20200616.2deceaa3a" \
	cluster-infrastructure=corosync \
	cluster-name=hacluster \
	stonith-enabled=true \
	stonith-timeout=172
rsc_defaults rsc-options: \
	resource-stickiness=1 \
	migration-threshold=3
op_defaults op-options: \
	timeout=600 \
	record-pending=true
  • corosync.conf in Azure
        token: 30000
        max_messages: 20
        token_retransmits_before_loss_const: 10

@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from ec5d707 to 9a97a4e Compare July 7, 2021 02:14
crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from 2f678d7 to 4978e21 Compare July 7, 2021 03:01
crmsh/sbd.py Outdated Show resolved Hide resolved
Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-split the commits: one to accomodate Azure parameters, another one to deal with TimeoutStartUSec alone. Their relationship is not strong enough to mix them together.

Copy link
Contributor

@arbulu89 arbulu89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangxin1300 ,
I have added a pair of comment I think you should check.

Besides them, some concerns:

  • I don't see any new UT here, and the change looks enough sensitive to be good to have them
  • I don't personally like this option to hardcode some provider specific values. If for some reason they change, you will need to change the code and package a new version... And more, knowing that we already have projects that do the same (like the habootstrap-formula). Personally, I don't like this way. In the other hand, if you have decided to go this way, it is fine, I'm just talking as a more outsider point of view

crmsh/constants.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300
Copy link
Collaborator Author

@liangxin1300 ,
I have added a pair of comment I think you should check.

Besides them, some concerns:

  • I don't see any new UT here, and the change looks enough sensitive to be good to have them

Yeah, I will add UT codes for this, after most of logic archived agreement between us:)

  • I don't personally like this option to hardcode some provider specific values. If for some reason they change, you will need to change the code and package a new version... And more, knowing that we already have projects that do the same (like the habootstrap-formula). Personally, I don't like this way. In the other hand, if you have decided to go this way, it is fine, I'm just talking as a more outsider point of view

The issue I concerned is, where can we get the data/values, from only one place, that could be shared with all HA stack projects? I remember we had discussions in confluence page before, Dario suggest we make a specific package which contains all profiles data, and then all projects could use that
@zzhou1 @gao-yan Any suggestions here?

@zzhou1
Copy link
Contributor

zzhou1 commented Jul 7, 2021

  • I don't personally like this option to hardcode some provider specific values. If for some reason they change, you will need to change the code and package a new version... And more, knowing that we already have projects that do the same (like the habootstrap-formula). Personally, I don't like this way. In the other hand, if you have decided to go this way, it is fine, I'm just talking as a more outsider point of view

The issue I concerned is, where can we get the data/values, from only one place, that could be shared with all HA stack projects? I remember we had discussions in confluence page before, Dario suggest we make a specific package which contains all profiles data, and then all projects could use that
@zzhou1 @gao-yan Any suggestions here?

Sounds like, some improvement is possible indeed. How about move the profile dictionary and also the matching string (eg. "microsoft-azure", "google-cloud-platform", "amazon-web-services") into crm.conf. Hence, it can be flexible enough for any one of these CSPs to further customize those values by users, if they don't like the default.

@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from 520948f to 0ccf3a3 Compare July 15, 2021 08:16
@liangxin1300
Copy link
Collaborator Author

@zzhou1 @arbulu89 I put the profile into /etc/crm/profiles.yml
What's your think?

Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to tweak the commit header

from
"Fix: sbd: adjust sbd parameters when in Azure(bsc#1175896)"

to
"Fix: sbd: adjust sbd systemd TimeoutStartSec together with SBD_DELAY_START"

It is the genric logic more than just Azure(bsc#1175896).

etc/profiles.yml Outdated Show resolved Hide resolved
@liangxin1300
Copy link
Collaborator Author

@zzhou1 @gao-yan If in virtualization environment but not in Azure, what the value should be for sbd msgwait? Keep default or same with Azure?

Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to tweak the commit header from
"Fix: bootstrap: bootstrap parameters better cope with Azure for corosync(bsc#1175896)"

to something like,
"Fix: bootstrap: corosync and sbd parameters better cope with the environment profile"

Since the code is generic to both CSP and on-premise.

crmsh/bootstrap.py Show resolved Hide resolved
crmsh/bootstrap.py Show resolved Hide resolved
@zzhou1
Copy link
Contributor

zzhou1 commented Jul 19, 2021

@zzhou1 @gao-yan If in virtualization environment but not in Azure, what the value should be for sbd msgwait? Keep default or same with Azure?

Aha, then, you brought up three folds in my view.

First of all, don't use Azure's default sbd parameters for others.

Secondly, I think you touch the part I call it "user_customized" in profiles.yaml, in particular, those ones not in CSP.

Thirdly, there are several hardcoded defaults values in the code currently. That's fine. They are the last resort if no "profiles.yaml" exisits. However, I feel it worth to bring them up front and write them into "user_customized" explicitly. These will slightly improve the user experience than the black magic only known by reading the code. Well, this is not a high priority improvment, I feel.

@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from 79cd89c to 6d2ddfa Compare July 20, 2021 03:18
@liangxin1300
Copy link
Collaborator Author

@zzhou1 @gao-yan If in virtualization environment but not in Azure, what the value should be for sbd msgwait? Keep default or same with Azure?

Aha, then, you brought up three folds in my view.

First of all, don't use Azure's default sbd parameters for others.

Secondly, I think you touch the part I call it "user_customized" in profiles.yaml, in particular, those ones not in CSP.

Thirdly, there are several hardcoded defaults values in the code currently. That's fine. They are the last resort if no "profiles.yaml" exisits. However, I feel it worth to bring them up front and write them into "user_customized" explicitly. These will slightly improve the user experience than the black magic only known by reading the code. Well, this is not a high priority improvment, I feel.

Changed as

user_customized:
  totem.token: 5000
  totem.consensus: 6000
  totem.join: 60
  totem.max_messages: 20
  totem.token_retransmits_before_loss_const: 10
  sbd.watchdog_timeout: 5
  sbd.msgwait: 10

microsoft-azure:
  totem.token: 30000
  totem.consensus: 36000
  totem.max_messages: 20
  totem.token_retransmits_before_loss_const: 10
  sbd.watchdog_timeout: 60
  sbd.msgwait: 120

@zzhou1 zzhou1 changed the title Change corosync and sbd parameters when detected in Azure Change corosync and sbd parameters based on the profile environment detected Jul 20, 2021
crmsh/bootstrap.py Show resolved Hide resolved
etc/profiles.yml Outdated Show resolved Hide resolved
crmsh/bootstrap.py Outdated Show resolved Hide resolved
etc/profiles.yml Outdated Show resolved Hide resolved
crmsh/bootstrap.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from c870994 to bbfefa0 Compare July 28, 2021 06:09
@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Jul 28, 2021

@gao-yan I added this commit bbfefa0
to make sure diskless sbd could load sbd_watchdog_value from profiles.yml if that value exists

My question is, should I plus self._sbd_watchdog_timeout in function _determine_sbd_watchdog_timeout(

def _determine_sbd_watchdog_timeout(self):
)?

crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300
Copy link
Collaborator Author

@zzhou1 @gao-yan
Latest changes:

  • Set some default value in SBDManager class
    SBD_WATCHDOG_TIMEOUT_DEFAULT = 5
    SBD_WATCHDOG_TIMEOUT_DEFAULT_S390 = 15
    STONITH_WATCHDOG_TIMEOUT_DEFAULT = SBD_WATCHDOG_TIMEOUT_DEFAULT * 2
    STONITH_WATCHDOG_TIMEOUT_DEFAULT_S390 = SBD_WATCHDOG_TIMEOUT_DEFAULT_S390 * 2
  • Set above default value when init class
        if context.is_s390:
            self._sbd_watchdog_timeout = self.SBD_WATCHDOG_TIMEOUT_DEFAULT_S390
            self._stonith_watchdog_timeout = self.STONITH_WATCHDOG_TIMEOUT_DEFAULT_S390
        else:
            self._sbd_watchdog_timeout = self.SBD_WATCHDOG_TIMEOUT_DEFAULT
            self._stonith_watchdog_timeout = self.STONITH_WATCHDOG_TIMEOUT_DEFAULT
  • In _initialize_sbd function, make sure diskless SBD also has the sbd_watchdog_timeout value from profile
        if "sbd.watchdog_timeout" in self._context.profiles_dict:
            self._sbd_watchdog_timeout = self._context.profiles_dict["sbd.watchdog_timeout"]
            self._correct_sbd_watchdog_timeout_for_s390()
        if self.diskless_sbd:
            return
  • Make sure msgwait always 2*sbd_watchdog_timeout
        sbd_msgwait_default = int(self._sbd_watchdog_timeout) * 2
        sbd_msgwait = sbd_msgwait_default
        if "sbd.msgwait" in self._context.profiles_dict:
            sbd_msgwait = self._context.profiles_dict["sbd.msgwait"]
            if int(sbd_msgwait) < sbd_msgwait_default:
                bootstrap.warn("sbd msgwait is set to {}, it was {}".format(sbd_msgwait_default, sbd_msgwait))
                sbd_msgwait = sbd_msgwait_default
  • Make sure sbd_watchdog_timeout never less than default on s390 platform
    def _correct_sbd_watchdog_timeout_for_s390(self):
        """
        Corrent watchdog timeout if less than s390 default
        """
        if self._context.is_s390 and self._sbd_watchdog_timeout < self.SBD_WATCHDOG_TIMEOUT_DEFAULT_S390:
            bootstrap.warn("sbd_watchdog_timeout is set to {} for s390, it was {}".format(self.SBD_WATCHDOG_TIMEOUT_DEFAULT_S390, self._sbd_watchdog_timeout))
            self._sbd_watchdog_timeout = self.SBD_WATCHDOG_TIMEOUT_DEFAULT_S390

crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, though still some changes needed.

crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 2 times, most recently from 07c3fc5 to eddd7a8 Compare July 30, 2021 07:48
Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Now, it covers diskless-sbd and qdevice too !

crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20210706_value_in_azure branch 3 times, most recently from 9318b9c to dd438e1 Compare August 2, 2021 06:40
@arbulu89 arbulu89 dismissed their stale review August 2, 2021 07:17

There are other recent reviews

Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much neater. Thanks for the awesome work, @liangxin1300 !

Only a couple of nitpickings here.

self._sbd_watchdog_timeout = self.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE
if self._sbd_watchdog_timeout < self.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE:
bootstrap.warn("sbd_watchdog_timeout is set to {} for qdevice, it was {}".format(self.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE, self._sbd_watchdog_timeout))
self._sbd_watchdog_timeout = self.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me, in case that corosync.quorum.device.timeout could be configured in profiles, of course it should check against that rather than the default. But no urgency. Feel free to do the improvement now or some time in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggest value for this timeout on different platform? If not clear yet, we could add it in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar to the logic in above for the case where sbd is being added with existing qdevice. It's just that now qdevice sync_timeout is according to any configured in profile rather than in corosync.conf.

crmsh/sbd.py Outdated
Check if SBD_DELAY_START is yes
"""
res = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
return True if res and res == "yes" else False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically sbd determines this based on crm_is_true() function and whether it's explicitly set with a delay value:

https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L989-L999

https://github.com/ClusterLabs/pacemaker/blob/master/lib/common/strings.c#L414-L444

Feel free to decide whether we need to make it cover all the cases of True here :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as using return utils.is_boolean_true(res)
Thanks!

profiles.yaml contains corosync and sbd parameters for different platform
…START

* Make sure sbd has suitable msgwait and watchdog timeout when initializing
* Enable SBD_DELAY_START for both diskless and disk based cases when virtual environment detected
* Adjust start timeout for sbd when set SBD_DELAY_START
* Refactors and more useful warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants