Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support --all or specific node to manage cluster and nodes #797

Merged
merged 8 commits into from
Nov 16, 2021

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Apr 1, 2021

Motivation

  • To start and stop the pacemaker/corosync stack at the whole cluster level
  • To avoid resources migration/node reset during the whole cluster shutdown procedure

Changes

  • Dev: ui_cluster: Support multi sub-commands with --all option or specific node
    • crm cluster start/stop/restart [--all | <node>... ]
    • crm cluster enable/disable [--all | <node>... ]
    • crm node standby/online [--all | <node>... ]
  • Dev: doc: Consolidate help info for those sub-command using argparse
  • Stop cluster procedure:
    1. When dlm running and quorum is lost, set enable_quorum_fencing=0, enable_quorum_lockspace=0 for dlm config option,
      to avoid dlm hanging
    2. Stop pacemaker since it can make sure cluster has quorum until stop corosync
    3. Then, stop qdevice if is active
    4. At last, stop corosync
  • Standby/online action for nodes:
    To avoid race condition for --all option, melt all standby/online values into one cib replace session

@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 2 times, most recently from dc0da69 to 23e3ac8 Compare June 1, 2021 01:43
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 3 times, most recently from 07f4d15 to 0735a45 Compare August 5, 2021 06:57
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 6 times, most recently from d45ff5b to a40c968 Compare October 31, 2021 14:04
@liangxin1300 liangxin1300 changed the title [WIP] Support --all or specific node to manage cluster Support --all or specific node to manage cluster and nodes Oct 31, 2021
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 2 times, most recently from 314bad5 to d413304 Compare November 1, 2021 08:25
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 5 times, most recently from 5b885cf to 3696892 Compare November 2, 2021 01:53
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 4 times, most recently from 51786df to ab57fd6 Compare November 5, 2021 14:55
crmsh/bootstrap.py Show resolved Hide resolved
crmsh/ui_cluster.py Outdated Show resolved Hide resolved
test/features/steps/step_implenment.py Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20210401_start_stop_all branch 8 times, most recently from e154460 to a9500e7 Compare November 10, 2021 08:18
Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work, @liangxin1300!

# enable_quorum_fencing=0, enable_quorum_lockspace=0 for dlm config option
if utils.is_dlm_configured(node_list[0]) and not utils.is_quorate(node_list[0]):
logger.debug("Quorum is lost; Set enable_quorum_fencing=0 and enable_quorum_lockspace=0 for dlm")
utils.set_dlm_option(peer=node_list[0], enable_quorum_fencing=0, enable_quorum_lockspace=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, both enable_quorum_fencing and enable_quorum_lockspace need to be disabled for an inquorate dlm to gracefully stop? Could there be any risks?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, both.
from my opinion, there doesn't have any risk. because this action is only triggered by cmd "crm cluster stop".
before this patch, cluster (by dlm_controld) deny to do any further action until quorum is true.
after this patch, cluster will directly stop, there is no other behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest concern is, given that this cluster partition is inquorate, there might a quorate partition standing... For example this partition has been split and inquorate, but somehow hasn't been fenced. If we are shutting down this cluster partition, and in case these settings make the inquorate partition be able to acquire the access to lockspace and corrupt data, that would be a disaster. We should never sacrifice data integrity even for graceful shutdown...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_quorum_lockspace is disabled, this will make dlm lockspace related operation can keep going when the cluster quorum is lost.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your comment makes sense to me.

Copy link
Member

@gao-yan gao-yan Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it on the more safe/paranoid side, even if the user has confirmed to proceed, but allowing this simultaneously on multiple nodes is more like opening a Pandora's box... Since right after that, during stop, this cluster partition might spit apart into even more partitions...

If we go for it, we could ask user once, but we'd better proceed this specific procedure the serialized way: so
set_dlm_option -> stop dlm/pacemaker for only one node at a time, and proceed for another node only after it has succeeded on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous point stays with the last standing single node situation. Given that the node is inquorate already, the proposed behavior for crm cluster stop (aka. crmsh -> set_dlm_option -> stop dlm/patcemaker) is just a "fencing" operation to protect data integrity, though no necessarily STONITH at the node level to do reboot.

Well, for the situation with multiple inquorate nodes, aka. multiple partitions, then the code here do have problem for '--all' situation. Simply because set_dlm_option only applies to one node. Not sure, if it is simple enough to address it in this PR, or open an issue to clarify this in another PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_dlm_option is implemented by "dlm_tool set_config", which only run on a single node each time.

Copy link
Contributor

@zzhou1 zzhou1 Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_dlm_option is implemented by "dlm_tool set_config", which only run on a single node each time.

Fine.

Given the above situation, when all nodes are inquorate, what's the suggested graceful shutdown steps internally for "stop --all" ? My reading of the code is it only change the current local node, no repeat the same on other nodes.

The situation is more fun, in theory for a big cluster, some nodes are quorate, some are not. Agree, it is a transient corner case. What's the suggested internal steps for "stop --all"? @zhaohem

Copy link
Contributor

@zzhou1 zzhou1 Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe crmsh should never operate a cluster in the transient state at all by default? And ask user to answer Y/n, or use --force ?

for node in node_list[:]:
if utils.service_is_active("pacemaker.service", remote_addr=node):
logger.info("Cluster services already started on {}".format(node))
node_list.remove(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pacemaker.service being active doesn't necessarily mean corosync-qdevice.service is active as well, right? Should corosync-qdevice be checked for the full set of nodes as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using bootstrap to setup cluster with qdevice, qdevice service will be enabled, so after reboot, corosync-qdevice will be started
@zzhou1 What do you think?

Copy link
Contributor

@zzhou1 zzhou1 Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"node_list.remove(node)" can be wrong in the use case of normal stop/start, eg. when pacemaker.service is up but qdevice.service is not in the mean time. No necessary in the case of reboot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve this in #898


bootstrap.start_pacemaker(node_list)
if utils.is_qdevice_configured():
utils.start_service("corosync-qdevice", node_list=node_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there's no dependencies between corosync-qdevice and pacemaker, it'd be better to perform start of corosync-qdevice before pacemaker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrap.start_pacemaker here is to start pacemaker.service, according the dependency, corosync.service will be started firstly, then start qdevice does make sense; Otherwise, start qdevice without corosync started will be failed:)

Copy link
Member

@gao-yan gao-yan Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that corosync-qdevice.service has these defined as well:

Wants=corosync.service
After=corosync.service

I'd expect pure start of corosync-qdevice.service will resolve the dependency as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get the point to move corosync-qdevice.service start operation before pacemaker. There is no harm to have multiple systemctl start, in theory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if not node_list:
return

bootstrap.wait_for_cluster("Waiting for {} online".format(' '.join(node_list)), node_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there special purpose of inserting this into the stop procedure and waiting for the listed nodes to be online?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

  • Just after crm cluster start --all, the status of all nodes are UNCLEAN, on this time, if execute crm cluster stop --all, I found the stop process will be hang, and on some node, the output of crm_mon will show all nodes' status are pending
  • If all nodes already Online, in wait_for_cluster function, I made a check firstly, and the line Waiting for .... online will be quiet

Copy link
Contributor

@zzhou1 zzhou1 Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I can understand the wait experience in bootstrap. However, this triggers me to think probably no necessary to force user to wait crm cluster start, same for crm cluster stop too. I would rather think to remove such kind of wait for both of them.

It makes sense to let crm cluster stop directly abort if the criteria not meet. It's a reasonable error handling, rather than have sysadmin really wait in front of the screen.

For those scripts do want to 'wait', before we implement '--wait' option, the following example steps could help
crm cluster start
crm wait_for_startup
crm cluster stop

Well, my idea is debatable as the different flavor of the user experience. It is not a critical one I think.

action = context.get_command_name()
utils.cluster_run_cmd("systemctl {} pacemaker.service".format(action), node_list)
if utils.is_qdevice_configured():
utils.cluster_run_cmd("systemctl {} corosync-qdevice.service".format(action), node_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically maybe it should be able to disable corosync-qdevice.service even if qdevice is not configured?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corosync-qdevice.service will not be started if not configured, right? So I think to check if configured then do the action will no harm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how is_qdevice_configured() does the check. Of course one could remove qdevice configuration from corosync.conf before stopping the running corosync-qdevice.service...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think I would naturally expect crm cluster stop to stop corosync-qdevice.service even if corosync.conf has no qdevice.

Copy link
Collaborator Author

@liangxin1300 liangxin1300 Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to improve this: #895
To check if corosync-qdevice.service is available, not to check if it is configured
stop qdevice should no this issue, since I check if the service is active, then stop

node_list = parse_option_for_nodes(context, *args)
for node in node_list[:]:
if utils.is_standby(node):
logger.info("Node %s already standby", node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the node is in standby but with a different lifetime from the currently specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "reboot" and "forever" lifetime, the status from crm_mon will be standby,
And using crm node online will revert the status for both cases

Copy link
Member

@gao-yan gao-yan Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, for example if an user intends to put a node into "forever standby" which is already in "reboot standby"...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, for example if an user intends to put a node into "forever standby" which is already in "reboot standby"...

Then here this action will be rejected and say "already standby", until online it firstly.
Do we need to support this change? I mean change the lifetime from "reboot" to "forever"?
Current code in production support that, bug in my view, that might cause confuse/conflicts, like:

  1. crm node standby reboot
  2. crm node standby
  3. crm node online
    Then from crm_mon, we will see this node already in standby status

To make these actions more clear, I think we should reject above action and say "already standby"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't check if it's already in any kind of standby :-). I'd just set what the user wants, as long as "online" command could bring it out of any cases/combinations of standby.

crmsh/ui_node.py Show resolved Hide resolved
cib_str = xmlutil.xml_tostring(cib)
for node in node_list:
node_id = utils.get_nodeid_from_name(node)
cib_str = re.sub(constants.STANDBY_NV_RE.format(node_id=node_id, value="on"), r'\1value="off"\2', cib_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, a more reliable way would be do xpath queries and modify/delete

…ific node

    * crm cluster start/stop/restart [--all | <node>... ]
    * crm cluster enable/disable [--all | <node>... ]
    * crm node standby/online [--all | <node>... ]
When dlm configured and quorum is lost, before stop cluster service,
should set enable_quorum_fencing=0, enable_quorum_lockspace=0 options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants