Support --all or specific node to manage cluster and nodes #797

liangxin1300 · 2021-04-01T09:08:42Z

Motivation

To start and stop the pacemaker/corosync stack at the whole cluster level
To avoid resources migration/node reset during the whole cluster shutdown procedure

Changes

Dev: ui_cluster: Support multi sub-commands with --all option or specific node
- crm cluster start/stop/restart [--all | <node>... ]
- crm cluster enable/disable [--all | <node>... ]
- crm node standby/online [--all | <node>... ]
Dev: doc: Consolidate help info for those sub-command using argparse
Stop cluster procedure:
1. When dlm running and quorum is lost, set enable_quorum_fencing=0, enable_quorum_lockspace=0 for dlm config option,
  to avoid dlm hanging
2. Stop pacemaker since it can make sure cluster has quorum until stop corosync
3. Then, stop qdevice if is active
4. At last, stop corosync
Standby/online action for nodes:
To avoid race condition for --all option, melt all standby/online values into one cib replace session

crmsh/ui_cluster.py

crmsh/bootstrap.py

crmsh/ui_cluster.py

test/features/steps/step_implenment.py

gao-yan

Thanks for the nice work, @liangxin1300!

gao-yan · 2021-11-15T12:25:16Z

crmsh/ui_cluster.py

+        # enable_quorum_fencing=0, enable_quorum_lockspace=0 for dlm config option
+        if utils.is_dlm_configured(node_list[0]) and not utils.is_quorate(node_list[0]):
+            logger.debug("Quorum is lost; Set enable_quorum_fencing=0 and enable_quorum_lockspace=0 for dlm")
+            utils.set_dlm_option(peer=node_list[0], enable_quorum_fencing=0, enable_quorum_lockspace=0)


IIUC, both enable_quorum_fencing and enable_quorum_lockspace need to be disabled for an inquorate dlm to gracefully stop? Could there be any risks?

yes, both.
from my opinion, there doesn't have any risk. because this action is only triggered by cmd "crm cluster stop".
before this patch, cluster (by dlm_controld) deny to do any further action until quorum is true.
after this patch, cluster will directly stop, there is no other behavior.

The biggest concern is, given that this cluster partition is inquorate, there might a quorate partition standing... For example this partition has been split and inquorate, but somehow hasn't been fenced. If we are shutting down this cluster partition, and in case these settings make the inquorate partition be able to acquire the access to lockspace and corrupt data, that would be a disaster. We should never sacrifice data integrity even for graceful shutdown...

enable_quorum_lockspace is disabled, this will make dlm lockspace related operation can keep going when the cluster quorum is lost.

your comment makes sense to me.

Thinking about it on the more safe/paranoid side, even if the user has confirmed to proceed, but allowing this simultaneously on multiple nodes is more like opening a Pandora's box... Since right after that, during stop, this cluster partition might spit apart into even more partitions...

If we go for it, we could ask user once, but we'd better proceed this specific procedure the serialized way: so
set_dlm_option -> stop dlm/pacemaker for only one node at a time, and proceed for another node only after it has succeeded on this one.

My previous point stays with the last standing single node situation. Given that the node is inquorate already, the proposed behavior for crm cluster stop (aka. crmsh -> set_dlm_option -> stop dlm/patcemaker) is just a "fencing" operation to protect data integrity, though no necessarily STONITH at the node level to do reboot.

Well, for the situation with multiple inquorate nodes, aka. multiple partitions, then the code here do have problem for '--all' situation. Simply because set_dlm_option only applies to one node. Not sure, if it is simple enough to address it in this PR, or open an issue to clarify this in another PR.

set_dlm_option is implemented by "dlm_tool set_config", which only run on a single node each time.

set_dlm_option is implemented by "dlm_tool set_config", which only run on a single node each time.

Fine.

Given the above situation, when all nodes are inquorate, what's the suggested graceful shutdown steps internally for "stop --all" ? My reading of the code is it only change the current local node, no repeat the same on other nodes.

The situation is more fun, in theory for a big cluster, some nodes are quorate, some are not. Agree, it is a transient corner case. What's the suggested internal steps for "stop --all"? @zhaohem

Maybe crmsh should never operate a cluster in the transient state at all by default? And ask user to answer Y/n, or use --force ?

gao-yan · 2021-11-15T12:25:54Z

crmsh/ui_cluster.py

+        for node in node_list[:]:
+            if utils.service_is_active("pacemaker.service", remote_addr=node):
+                logger.info("Cluster services already started on {}".format(node))
+                node_list.remove(node)


Pacemaker.service being active doesn't necessarily mean corosync-qdevice.service is active as well, right? Should corosync-qdevice be checked for the full set of nodes as well?

When using bootstrap to setup cluster with qdevice, qdevice service will be enabled, so after reboot, corosync-qdevice will be started
@zzhou1 What do you think?

"node_list.remove(node)" can be wrong in the use case of normal stop/start, eg. when pacemaker.service is up but qdevice.service is not in the mean time. No necessary in the case of reboot.

Improve this in #898

gao-yan · 2021-11-15T17:10:30Z

crmsh/ui_cluster.py

+
+        bootstrap.start_pacemaker(node_list)
+        if utils.is_qdevice_configured():
+            utils.start_service("corosync-qdevice", node_list=node_list)


Although there's no dependencies between corosync-qdevice and pacemaker, it'd be better to perform start of corosync-qdevice before pacemaker.

bootstrap.start_pacemaker here is to start pacemaker.service, according the dependency, corosync.service will be started firstly, then start qdevice does make sense; Otherwise, start qdevice without corosync started will be failed:)

Given that corosync-qdevice.service has these defined as well:

Wants=corosync.service After=corosync.service

I'd expect pure start of corosync-qdevice.service will resolve the dependency as well.

I can get the point to move corosync-qdevice.service start operation before pacemaker. There is no harm to have multiple systemctl start, in theory.

gao-yan · 2021-11-15T17:57:58Z

crmsh/ui_cluster.py

+        if not node_list:
+            return
+
+        bootstrap.wait_for_cluster("Waiting for {} online".format(' '.join(node_list)), node_list)


Is there special purpose of inserting this into the stop procedure and waiting for the listed nodes to be online?

Yes.

Just after crm cluster start --all, the status of all nodes are UNCLEAN, on this time, if execute crm cluster stop --all, I found the stop process will be hang, and on some node, the output of crm_mon will show all nodes' status are pending

If all nodes already Online, in wait_for_cluster function, I made a check firstly, and the line Waiting for .... online will be quiet

Good point.

I can understand the wait experience in bootstrap. However, this triggers me to think probably no necessary to force user to wait crm cluster start, same for crm cluster stop too. I would rather think to remove such kind of wait for both of them.

It makes sense to let crm cluster stop directly abort if the criteria not meet. It's a reasonable error handling, rather than have sysadmin really wait in front of the screen.

For those scripts do want to 'wait', before we implement '--wait' option, the following example steps could help
crm cluster start
crm wait_for_startup
crm cluster stop

Well, my idea is debatable as the different flavor of the user experience. It is not a critical one I think.

gao-yan · 2021-11-15T18:09:05Z

crmsh/ui_cluster.py

+        action = context.get_command_name()
+        utils.cluster_run_cmd("systemctl {} pacemaker.service".format(action), node_list)
+        if utils.is_qdevice_configured():
+            utils.cluster_run_cmd("systemctl {} corosync-qdevice.service".format(action), node_list)


Technically maybe it should be able to disable corosync-qdevice.service even if qdevice is not configured?

corosync-qdevice.service will not be started if not configured, right? So I think to check if configured then do the action will no harm?

I guess it depends on how is_qdevice_configured() does the check. Of course one could remove qdevice configuration from corosync.conf before stopping the running corosync-qdevice.service...

Good point, I think I would naturally expect crm cluster stop to stop corosync-qdevice.service even if corosync.conf has no qdevice.

PR to improve this: #895
To check if corosync-qdevice.service is available, not to check if it is configured
stop qdevice should no this issue, since I check if the service is active, then stop

gao-yan · 2021-11-15T18:24:47Z

crmsh/ui_node.py

+        node_list = parse_option_for_nodes(context, *args)
+        for node in node_list[:]:
+            if utils.is_standby(node):
+                logger.info("Node %s already standby", node)


What if the node is in standby but with a different lifetime from the currently specified?

Both "reboot" and "forever" lifetime, the status from crm_mon will be standby,
And using crm node online will revert the status for both cases

I mean, for example if an user intends to put a node into "forever standby" which is already in "reboot standby"...

I mean, for example if an user intends to put a node into "forever standby" which is already in "reboot standby"...

Then here this action will be rejected and say "already standby", until online it firstly.
Do we need to support this change? I mean change the lifetime from "reboot" to "forever"?
Current code in production support that, bug in my view, that might cause confuse/conflicts, like:

crm node standby reboot

crm node standby

crm node online
Then from crm_mon, we will see this node already in standby status

To make these actions more clear, I think we should reject above action and say "already standby"

I wouldn't check if it's already in any kind of standby :-). I'd just set what the user wants, as long as "online" command could bring it out of any cases/combinations of standby.

crmsh/ui_node.py

gao-yan · 2021-11-15T21:36:02Z

crmsh/ui_node.py

+        cib_str = xmlutil.xml_tostring(cib)
+        for node in node_list:
+            node_id = utils.get_nodeid_from_name(node)
+            cib_str = re.sub(constants.STANDBY_NV_RE.format(node_id=node_id, value="on"), r'\1value="off"\2', cib_str)


Similarly, a more reliable way would be do xpath queries and modify/delete

…ific node * crm cluster start/stop/restart [--all | <node>... ] * crm cluster enable/disable [--all | <node>... ] * crm node standby/online [--all | <node>... ]

When dlm configured and quorum is lost, before stop cluster service, should set enable_quorum_fencing=0, enable_quorum_lockspace=0 options

…ps/step_implementation.py

liangxin1300 force-pushed the 20210401_start_stop_all branch from 14231a3 to 29fb483 Compare April 6, 2021 03:07

liangxin1300 mentioned this pull request May 6, 2021

Add sbd stage on a running cluster #744

Merged

liangxin1300 force-pushed the 20210401_start_stop_all branch 2 times, most recently from dc0da69 to 23e3ac8 Compare June 1, 2021 01:43

liangxin1300 force-pushed the 20210401_start_stop_all branch from 7d5c8ff to 789c04a Compare July 5, 2021 07:06

liangxin1300 force-pushed the 20210401_start_stop_all branch from 789c04a to 1bd5e3d Compare July 17, 2021 13:15

liangxin1300 force-pushed the 20210401_start_stop_all branch 3 times, most recently from 07f4d15 to 0735a45 Compare August 5, 2021 06:57

liangxin1300 force-pushed the 20210401_start_stop_all branch 6 times, most recently from d45ff5b to a40c968 Compare October 31, 2021 14:04

liangxin1300 changed the title ~~[WIP] Support --all or specific node to manage cluster~~ Support --all or specific node to manage cluster and nodes Oct 31, 2021

liangxin1300 force-pushed the 20210401_start_stop_all branch 2 times, most recently from 314bad5 to d413304 Compare November 1, 2021 08:25

liangxin1300 requested review from zzhou1 and gao-yan November 1, 2021 08:28

liangxin1300 force-pushed the 20210401_start_stop_all branch 5 times, most recently from 5b885cf to 3696892 Compare November 2, 2021 01:53

zzhou1 reviewed Nov 3, 2021

View reviewed changes

crmsh/ui_cluster.py Show resolved Hide resolved

liangxin1300 force-pushed the 20210401_start_stop_all branch 4 times, most recently from 51786df to ab57fd6 Compare November 5, 2021 14:55

liangxin1300 force-pushed the 20210401_start_stop_all branch from ab57fd6 to 685c93a Compare November 5, 2021 15:55

zzhou1 reviewed Nov 6, 2021

View reviewed changes

crmsh/bootstrap.py Show resolved Hide resolved

crmsh/ui_cluster.py Outdated Show resolved Hide resolved

test/features/steps/step_implenment.py Show resolved Hide resolved

liangxin1300 force-pushed the 20210401_start_stop_all branch 8 times, most recently from e154460 to a9500e7 Compare November 10, 2021 08:18

zzhou1 approved these changes Nov 10, 2021

View reviewed changes

liangxin1300 force-pushed the 20210401_start_stop_all branch from a9500e7 to a4fd1f9 Compare November 11, 2021 01:24

gao-yan reviewed Nov 15, 2021

View reviewed changes

liangxin1300 added 8 commits November 16, 2021 21:36

Dev: ui_cluster: Support multi sub-commands with --all option or spec…

ba08299

…ific node * crm cluster start/stop/restart [--all | <node>... ] * crm cluster enable/disable [--all | <node>... ] * crm node standby/online [--all | <node>... ]

Dev: ui_cluster: Graceful shutdown dlm

f54bc6e

When dlm configured and quorum is lost, before stop cluster service, should set enable_quorum_fencing=0, enable_quorum_lockspace=0 options

Dev: ui_cluster: Make sure node is online when stop service

3b6fa88

Dev: doc: Consolidate help info for those using argparse

923330c

Dev: behave: Add functional test for previous changes

bf12c2d

Dev: behave: Rename features/steps/step_implenment.py as features/ste…

0f3ebd1

…ps/step_implementation.py

Dev: testcase: Remove node standby/online test case

4022b37

Dev: unittest: Add unit test for previous changes

827cb5e

liangxin1300 force-pushed the 20210401_start_stop_all branch from a4fd1f9 to 827cb5e Compare November 16, 2021 14:12

liangxin1300 mentioned this pull request Nov 16, 2021

Improvement list for PR#797 #887

Closed

6 tasks

liangxin1300 merged commit ef2bce2 into ClusterLabs:master Nov 16, 2021

This was referenced Nov 30, 2021

Dev: ui_node: Improve node standby/online methods #893

Merged

Dev: ui_cluster: Remove node from node list if node is unreachable #896

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support --all or specific node to manage cluster and nodes #797

Support --all or specific node to manage cluster and nodes #797

liangxin1300 commented Apr 1, 2021 •

edited

Loading

gao-yan left a comment

gao-yan Nov 15, 2021

zhaohem Nov 16, 2021

gao-yan Nov 16, 2021

ganghe Nov 16, 2021

zhaohem Nov 16, 2021

gao-yan Nov 16, 2021 •

edited

Loading

zzhou1 Nov 16, 2021

zhaohem Nov 16, 2021

zzhou1 Nov 23, 2021 •

edited

Loading

zzhou1 Nov 23, 2021 •

edited

Loading

gao-yan Nov 15, 2021

liangxin1300 Nov 16, 2021

zzhou1 Nov 16, 2021 •

edited

Loading

liangxin1300 Dec 3, 2021

gao-yan Nov 15, 2021

liangxin1300 Nov 16, 2021

gao-yan Nov 16, 2021 •

edited

Loading

zzhou1 Nov 16, 2021

liangxin1300 Nov 16, 2021

gao-yan Nov 15, 2021

liangxin1300 Nov 16, 2021

zzhou1 Nov 16, 2021 •

edited

Loading

gao-yan Nov 15, 2021

liangxin1300 Nov 16, 2021

gao-yan Nov 16, 2021

zzhou1 Nov 16, 2021

liangxin1300 Dec 1, 2021 •

edited

Loading

gao-yan Nov 15, 2021

liangxin1300 Nov 16, 2021

gao-yan Nov 16, 2021 •

edited

Loading

liangxin1300 Nov 16, 2021

gao-yan Nov 16, 2021

gao-yan Nov 15, 2021

Support --all or specific node to manage cluster and nodes #797

Support --all or specific node to manage cluster and nodes #797

Conversation

liangxin1300 commented Apr 1, 2021 • edited Loading

Motivation

Changes

gao-yan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gao-yan Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzhou1 Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

zzhou1 Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzhou1 Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gao-yan Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzhou1 Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liangxin1300 Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gao-yan Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liangxin1300 commented Apr 1, 2021 •

edited

Loading

gao-yan Nov 16, 2021 •

edited

Loading

zzhou1 Nov 23, 2021 •

edited

Loading

zzhou1 Nov 23, 2021 •

edited

Loading

zzhou1 Nov 16, 2021 •

edited

Loading

gao-yan Nov 16, 2021 •

edited

Loading

zzhou1 Nov 16, 2021 •

edited

Loading

liangxin1300 Dec 1, 2021 •

edited

Loading

gao-yan Nov 16, 2021 •

edited

Loading