Fix: fencer: Refresh CIB devices on change to nodes section #3855

nrwahl2 · 2025-03-28T08:36:02Z

This fixes a regression introduced by bf7ffcd. As of that commit, the fake local node is created after all resources have been unpacked. So it doesn't get added to resources' allowed_nodes tables.

This prevents registration of fencing devices when the fencer receives a CIB diff that removes the local node. For example, the user may have replaced the CIB with a boilerplate configuration that has an empty nodes section. Registering a fencing device requires that the local node be in the resource's allowed nodes table.

One option would be to add the fake local node to all resources' allowed nodes tables immediately after creating it. However, it shouldn't necessarily be an allowed node for all resources. For example, if symmetric-cluster="false", then a node should not be placed in a resource's allowed nodes table by default.

It's difficult to ensure correctness of the allowed nodes tables when a fake local node is involved. It may involve repeated code or a fragile and confusing dependence on the order of unpacking.

Since the fake local node is a hack in the first place, it seems better to avoid using it where possible. Currently the only code that even sets the local_node_name member of scheduler->priv is in:

the fencer
crm_mon when showing the "times" section

This commit works as follows. If the fencer receives a CIB diff notification that contains the nodes section, it triggers a full refresh of fencing device registrations. In our example above, where a user has replaced the CIB with a configuration that has an empty nodes section, this means all fencing device registrations will be removed.

However, the controller also has a CIB diff notification callback: do_cib_updated(). The controller's callback repopulates the nodes section with up-to-date information from the cluster layer (or its node cache) if it finds that an untrusted client like cibadmin has sent a modified the nodes section. Then it updates the CIB accordingly.

The fencer will receive this updated CIB and refresh fencing device registrations again, re-registering the fencing devices that were just removed.

Note that in the common case, we're not doing all this wasted work. The "remove and then re-register" sequence should happen only when a user has modified the CIB in a sloppy way (for example, by deleting nodes from the CIB's nodes section that have not been removed from the cluster).

In short, it seems better to rely on the controller's maintenance of the CIB's node list, than to rely on a "fake local node" hack in the scheduler.

See the following pull requests from Hideo Yamauchi and their discussions:

Thanks to Hideo for the report and finding the cause.

nrwahl2 · 2025-03-28T08:36:47Z

@HideoYamauchi Can you test this if you have time and see if it fixes your issue? I made sure it compiles, but I haven't had time to set up a test for your issue yet, and I have to step away for a while.

HideoYamauchi · 2025-03-28T09:18:39Z

@nrwahl2

I'll check.
Please wait a moment.

Best Regards,
Hideo Yamauchi.

HideoYamauchi · 2025-03-28T09:42:36Z

@nrwahl2

Great!

With this fix, we have confirmed that fenced correctly registers STONITH devices both when there is a node section in cib.xml and when there is not.

- If cib.xml has a node section,
[root@rh94-dev01 pacemaker]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh94-dev02 (version 3.0.0-6d6734a86b) - partition with quorum
  * Last updated: Fri Mar 28 18:32:46 2025 on rh94-dev01
  * Last change:  Fri Mar 28 18:27:52 2025 by hacluster via hacluster on rh94-dev01
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ rh94-dev01 rh94-dev02 ]

Full List of Resources:
  * Resource Group: pgsql-group:
    * filesystem1       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem2       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem3       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * ipaddr    (ocf:heartbeat:Dummy):   Started rh94-dev01
    * pgsql     (ocf:heartbeat:Dummy):   Started rh94-dev01
  * Clone Set: ping-clone [ping]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * fence1-scsi (stonith:fence_scsi):    Started rh94-dev01
  * fence2-scsi (stonith:fence_scsi):    Started rh94-dev02

Node Attributes:
  * Node: rh94-dev01:
    * ping-status                       : 1         
  * Node: rh94-dev02:
    * ping-status                       : 1         

Migration Summary:
[root@rh94-dev01 pacemaker]# 
[root@rh94-dev01 pacemaker]# egrep "Ignore|Creating" /var/log/pacemaker/pacemaker.log
(no output)

--if cib.xml does not have a node section,
[root@rh94-dev01 pacemaker]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh94-dev01 (version 3.0.0-6d6734a86b) - partition with quorum
  * Last updated: Fri Mar 28 18:37:18 2025 on rh94-dev01
  * Last change:  Fri Mar 28 18:36:23 2025 by hacluster via hacluster on rh94-dev01
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ rh94-dev01 rh94-dev02 ]

Full List of Resources:
  * Resource Group: pgsql-group:
    * filesystem1       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem2       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem3       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * ipaddr    (ocf:heartbeat:Dummy):   Started rh94-dev01
    * pgsql     (ocf:heartbeat:Dummy):   Started rh94-dev01
  * Clone Set: ping-clone [ping]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * fence1-scsi (stonith:fence_scsi):    Started rh94-dev01
  * fence2-scsi (stonith:fence_scsi):    Started rh94-dev02

Node Attributes:
  * Node: rh94-dev01:
    * ping-status                       : 1         
  * Node: rh94-dev02:
    * ping-status                       : 1         

Migration Summary:
[root@rh94-dev01 pacemaker]# 

[root@rh94-dev01 pacemaker]# egrep "Ignore|Creating" /var/log/pacemaker/pacemaker.log
Mar 28 18:36:01.645 rh94-dev01 pacemaker-fenced    [540611] (cluster_status)    info: Creating a fake local node for rh94-dev01
Mar 28 18:36:03.300 rh94-dev01 pacemaker-fenced    [540611] (cluster_status)    info: Creating a fake local node for rh94-dev01
[root@rh94-dev01 pacemaker]#

In both cases, STONITH with fence_scsi also worked correctly.

Please merge this fix.

Many thanks,
Hideo Yamauchi.

nrwahl2 · 2025-03-28T10:08:31Z

With this fix, we have confirmed that fenced correctly registers STONITH devices both when there is a node section in cib.xml and when there is not.

Excellent! I'm about to push an edit that adds a sanity check, ensuring that the local node is not already part of a resource's allowed nodes table. (I think that should be impossible, but the scheduler is complex :) ).

Please merge this fix.

I'll let the new CI build finish and then we'll merge it during NA hours if @clumens and @wenningerk have no objection.

nrwahl2 · 2025-03-28T10:12:23Z

@clumens We'll need to make sure this gets into RHEL 10. If we can get it into 10.0 without a z-stream, that would be ideal.

nrwahl2 · 2025-03-28T10:20:03Z

I'm going to bed soon, but I am starting to wonder if this approach is still too eager. If symmetric-cluster is false, then adding the fake local node to all resources' allowed nodes table might allow resources to run in places where they should not.

If that concern has merit, then another possibility might be:

Create the fake local node at the very beginning of cluster_status()
Set a node flag that indicates this is a fake/temporary node
Later, if we want to create the node during unpack, look up and "overwrite" the fake node. (By this I mean overwrite any fields that may differ between the fake node and the unpacked node, or remove the fake node and create a new one. Removing and recreating may get a bit complicated, since there may already be copies of the node in resources' allowed nodes tables.) My current idea is to write a "reset node" helper function, which pe_create_node() will call if it finds that the node already exists and has the pcmk__node_fake flag set.

I almost never think about clusters with symmetric-cluster=false, because they're rare in practice. However, if it is set to false, then we'd probably want to rely on the rest of the unpacking logic to see whether each resource is allowed (via constraints) to run on each node :(

BTW, I'm looking at pe__unpack_resource(), which calls resource_location() if symmetric-cluster is true or the resource represents a guest node. resource_location() in turn adds all nodes to the resource's allowed nodes table.

clumens · 2025-03-28T18:57:15Z

Is this something we can add a scheduler regression test for?

nrwahl2 · 2025-03-28T20:19:59Z

Is this something we can add a scheduler regression test for?

I don't think so.

The observed symptom required the fencer.
Even if we try to set up a contrived test to make sure the local node gets added to resources' allowed nodes if appropriate, AFAIK there's no way to set local_node_name via an input CIB.

* Remove some nesting. * Use const where possible. * Improve variable names and scope. * Use a macro for repeated string literal. * Use sizeof() - 1 instead of strlen() for string literal. * Remove unnecessary escaping of single quote within double quotes. * Copy only what is needed of the xpath string, if anything. * Log assertion on malformed patchset, since we created it in cib_perform_op(). * Add a comment to clarify why we're (apparently) doing all this in the first place. Keep the watchdog_device_update() comment as-is because I haven't looked into it. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

An ID is required when schema validation is enabled anyway, and this allows us to be sure rsc_id will be non-NULL within the block. It also allows us to shorten some strstr() searches by starting later in the string, not that this will matter in practice. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

The sole caller has already done the checking, we don't need event, and we don't need msg except as a way to get the patchset. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

If the child of a primitive element representing a fencing device was deleted, and we didn't catch it earlier as instance_attributes or meta_attributes, then previously we would remove the fencing device. We should remove the device only if the primitive itself was deleted. By inspection, it looks like this would unregister a fencing device if one of its operations (op elements) were deleted. If we find that only a child of the primitive was deleted, fall through to the next if block -- that is, "continue" only if the primitive itself was deleted and we call stonith_device_remove(). Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

nrwahl2 · 2025-03-30T21:21:09Z

I hope this new version works, and that temporarily unregistering devices doesn't break anything. As mentioned in the commit message, I'm afraid to rely on the allowed nodes table being correct when a fake local node is involved. I'd prefer to use the cluster layer (via the controller) as the source of truth.

This fixes a regression introduced by bf7ffcd. As of that commit, the fake local node is created after all resources have been unpacked. So it doesn't get added to resources' allowed_nodes tables. This prevents registration of fencing devices when the fencer receives a CIB diff that removes the local node. For example, the user may have replaced the CIB with a boilerplate configuration that has an empty nodes section. Registering a fencing device requires that the local node be in the resource's allowed nodes table. One option would be to add the fake local node to all resources' allowed nodes tables immediately after creating it. However, it shouldn't necessarily be an allowed node for all resources. For example, if symmetric-cluster=false, then a node should not be placed in a resource's allowed nodes table by default. It's difficult to ensure correctness of the allowed nodes tables when a fake local node is involved. It may involve repeated code or a fragile and confusing dependence on the order of unpacking. Since the fake local node is a hack in the first place, it seems better to avoid using it where possible. Currently the only code that even sets the local_node_name member of scheduler->priv is in: * the fencer * crm_mon when showing the "times" section This commit works as follows. If the fencer receives a CIB diff notification that contains the nodes section, it triggers a full refresh of fencing device registrations. In our example above, where a user has replaced the CIB with a configuration that has an empty nodes section, this means all fencing device registrations will be removed. However, the controller also has a CIB diff notification callback: do_cib_updated(). The controller's callback repopulates the nodes section with up-to-date information from the cluster layer (or its node cache) if it finds that an untrusted client like cibadmin has sent a modified the nodes section. Then it updates the CIB accordingly. The fencer will receive this updated CIB and refresh fencing device registrations again, re-registering the fencing devices that were just removed. Note that in the common case, we're not doing all this wasted work. The "remove and then re-register" sequence should happen only when a user has modified the CIB in a sloppy way (for example, by deleting nodes from the CIB's nodes section that have not been removed from the cluster). In short, it seems better to rely on the controller's maintenance of the CIB's node list, than to rely on a "fake local node" hack in the scheduler. See the following pull requests from Hideo Yamauchi and their discussions: ClusterLabs#3849 ClusterLabs#3852 Thanks to Hideo for the report and finding the cause. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

nrwahl2 · 2025-03-31T00:06:33Z

Pushed a new version with a FIXME comment. So I'm still not totally happy with this :(

HideoYamauchi · 2025-04-02T13:25:11Z

@nrwahl2

I have not yet confirmed that the fixes I made after I checked them work.
I will check that they work after this PR is finally merged.

However, there is something that concerns me.
schedullered and fenced are different processes.
It's fine for fenced to calculate the placement based on the re-notification, but if the unfencing notification from schedullerd comes first, won't unfencing fail in the same way?

I haven't looked at the contents of your fix PR in detail, but I thought I'd comment just to be sure.

Best Regards,
Hideo Yamauchi.

clumens · 2025-04-04T15:02:32Z

I think this is okay, but I've never even heard of the fake local node thing until this PR came up. If @HideoYamauchi confirms it works for him, and ctslab testing doesn't show any other problems, I am okay with merging this.

HideoYamauchi · 2025-04-05T02:43:05Z

Hi @clumens @nrwahl2

I think this is okay, but I've never even heard of the fake local node thing until this PR came up. If @HideoYamauchi confirms it works for him, and ctslab testing doesn't show any other problems, I am okay with merging this.

Okay!
Once the final PR is complete, I will run the tests again.

Many thanks,
Hideo Yamauchi.

nrwahl2 · 2025-04-07T08:14:25Z

As a quick status update, these two registration flags (cib_registered and api_registered) have caused me a lot of frustration.

For some reason, in 2009 we thought it was a good idea to allow registering fence devices without adding them to the CIB. Over the years, code has been added to handle corner cases stemming from that. For example, what should the fencer do when it finds a new fence device in the CIB with the same ID as a fence device that was already registered via the C API (including stonith_admin)? What should the fencer do when a device is deleted from the CIB but was also registered via the API? And so on. These PRs give an example.

I thought a simple fix would be to avoid deleting a fence device due to CIB change if it's registered via the API, but that has its own complications (like the PRs above).

Additionally, the executor registers fence devices via the API as part of their start operation ("to avoid a race condition"). So a running fence device that exists in the CIB is registered via both the CIB and the API, and the fencer refuses to execute the monitor operation unless the device is registered via the API.

Okay, so this status update wasn't quick after all :) I feel like this is all more complicated than it needs to be, and I'm trying to disentangle it. I appreciate your patience.

wenningerk · 2025-04-09T06:32:20Z

As a quick status update, these two registration flags (cib_registered and api_registered) have caused me a lot of frustration.

iirc background was the ability to use fenced independently from pacemaker. Don't know if that is still relevant or even possible.

nrwahl2 · 2025-04-09T06:39:52Z

iirc background was the ability to use fenced independently from pacemaker. Don't know if that is still relevant or even possible.

It's possible via stonith_admin and via the public C API, which no one uses (except for sbd and dlm in very limited ways) but which we have to maintain "just in case."

cts-fencing also runs the fencer in stand-alone mode, so there is no CIB involved. It registers devices in cts-fence-helper.c.

I find it highly unlikely that anyone is registering fencing devices directly via the API, except for cts-fencing. We can probably avoid doing so in the executor, but it's not entirely trivial. The api_registered flag has been overloaded to also mean "started," so we'll have to disentangle that.

HideoYamauchi · 2025-04-27T22:40:33Z

Hi @nrwahl2,

Regarding the fix for this issue, what is the status of the fix since then?
I understand that you are busy preparing for RHEL10.0, but please let us know the status.

Best Regards,
Hideo Yamauchi.

nrwahl2 · 2025-04-27T22:54:17Z

@HideoYamauchi This is my main project right now. However, I have not been able to work on it in the past 1 to 1.5 weeks. I'm hoping to return to it in this upcoming week.

I don't have major updates. #3863 and #3866 were both prep work for this fix. I wanted to make the fencer code easier for me to read and to reason about, so that I can try to come up with a fix that I'm confident in.

HideoYamauchi · 2025-04-27T23:30:14Z

Hi @nrwahl2,

Thanks for your comment.
Understood.

Many thanks,
Hideo Yamauchi.

HideoYamauchi · 2025-04-28T07:53:46Z

Hi @nrwahl2

I have one question.
Is this fix managed in RHEL's Bugzilla or Pacemaker's Project(https://projects.clusterlabs.org/)?

Best Regards,
Hideo Yamauchi.

nrwahl2 mentioned this pull request Mar 28, 2025

High: fenced: Fixed an issue where registering a STONITH device to fenced failed when the node name was unknown (for compatibility with 2.1 series). #3852

Closed

nrwahl2 force-pushed the nrwahl2-fencing branch from 776441d to 39aab7b Compare March 28, 2025 10:08

nrwahl2 requested a review from clumens March 28, 2025 10:10

nrwahl2 added 4 commits March 30, 2025 13:53

Refactor: fencer: Take patchset as arg in update_cib_stonith_devices()

0cf23c0

The sole caller has already done the checking, we don't need event, and we don't need msg except as a way to get the patchset. Signed-off-by: Reid Wahl <nrwahl@protonmail.com>

nrwahl2 force-pushed the nrwahl2-fencing branch from 39aab7b to b3ece51 Compare March 30, 2025 21:13

nrwahl2 changed the title ~~Fix: libpe_status: Add fake local node to all resources' allowed_nodes~~ Fix: fencer: Refresh CIB devices on change to nodes section Mar 30, 2025

nrwahl2 force-pushed the nrwahl2-fencing branch from b3ece51 to 757cce1 Compare March 31, 2025 00:06

nrwahl2 mentioned this pull request Apr 7, 2025

Some fencer device registration refactors #3863

Merged

Fix: fencer: Refresh CIB devices on change to nodes section #3855

Are you sure you want to change the base?

Fix: fencer: Refresh CIB devices on change to nodes section #3855

Conversation

nrwahl2 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrwahl2 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HideoYamauchi commented Mar 28, 2025

Uh oh!

HideoYamauchi commented Mar 28, 2025

Uh oh!

nrwahl2 commented Mar 28, 2025

Uh oh!

nrwahl2 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrwahl2 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clumens commented Mar 28, 2025

Uh oh!

nrwahl2 commented Mar 28, 2025

Uh oh!

nrwahl2 commented Mar 30, 2025

Uh oh!

nrwahl2 commented Mar 31, 2025

Uh oh!

HideoYamauchi commented Apr 2, 2025

Uh oh!

clumens commented Apr 4, 2025

Uh oh!

HideoYamauchi commented Apr 5, 2025

Uh oh!

nrwahl2 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenningerk commented Apr 9, 2025

Uh oh!

nrwahl2 commented Apr 9, 2025

Uh oh!

HideoYamauchi commented Apr 27, 2025

Uh oh!

nrwahl2 commented Apr 27, 2025

Uh oh!

HideoYamauchi commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HideoYamauchi commented Apr 28, 2025

Uh oh!

Uh oh!

nrwahl2 commented Mar 28, 2025 •

edited

Loading

nrwahl2 commented Mar 28, 2025 •

edited

Loading

nrwahl2 commented Mar 28, 2025 •

edited

Loading

nrwahl2 commented Mar 28, 2025 •

edited

Loading

nrwahl2 commented Apr 7, 2025 •

edited

Loading

HideoYamauchi commented Apr 27, 2025 •

edited

Loading