Skip to content

Fix: fencer: Refresh CIB devices on change to nodes section #3855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nrwahl2
Copy link
Contributor

@nrwahl2 nrwahl2 commented Mar 28, 2025

This fixes a regression introduced by bf7ffcd. As of that commit, the fake local node is created after all resources have been unpacked. So it doesn't get added to resources' allowed_nodes tables.

This prevents registration of fencing devices when the fencer receives a CIB diff that removes the local node. For example, the user may have replaced the CIB with a boilerplate configuration that has an empty nodes section. Registering a fencing device requires that the local node be in the resource's allowed nodes table.

One option would be to add the fake local node to all resources' allowed nodes tables immediately after creating it. However, it shouldn't necessarily be an allowed node for all resources. For example, if symmetric-cluster="false", then a node should not be placed in a resource's allowed nodes table by default.

It's difficult to ensure correctness of the allowed nodes tables when a fake local node is involved. It may involve repeated code or a fragile and confusing dependence on the order of unpacking.

Since the fake local node is a hack in the first place, it seems better to avoid using it where possible. Currently the only code that even sets the local_node_name member of scheduler->priv is in:

  • the fencer
  • crm_mon when showing the "times" section

This commit works as follows. If the fencer receives a CIB diff notification that contains the nodes section, it triggers a full refresh of fencing device registrations. In our example above, where a user has replaced the CIB with a configuration that has an empty nodes section, this means all fencing device registrations will be removed.

However, the controller also has a CIB diff notification callback: do_cib_updated(). The controller's callback repopulates the nodes section with up-to-date information from the cluster layer (or its node cache) if it finds that an untrusted client like cibadmin has sent a modified the nodes section. Then it updates the CIB accordingly.

The fencer will receive this updated CIB and refresh fencing device registrations again, re-registering the fencing devices that were just removed.

Note that in the common case, we're not doing all this wasted work. The "remove and then re-register" sequence should happen only when a user has modified the CIB in a sloppy way (for example, by deleting nodes from the CIB's nodes section that have not been removed from the cluster).

In short, it seems better to rely on the controller's maintenance of the CIB's node list, than to rely on a "fake local node" hack in the scheduler.

See the following pull requests from Hideo Yamauchi and their discussions:

Thanks to Hideo for the report and finding the cause.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 28, 2025

@HideoYamauchi Can you test this if you have time and see if it fixes your issue? I made sure it compiles, but I haven't had time to set up a test for your issue yet, and I have to step away for a while.

@HideoYamauchi
Copy link
Contributor

@nrwahl2

I'll check.
Please wait a moment.

Best Regards,
Hideo Yamauchi.

@HideoYamauchi
Copy link
Contributor

@nrwahl2

Great!

With this fix, we have confirmed that fenced correctly registers STONITH devices both when there is a node section in cib.xml and when there is not.

- If cib.xml has a node section,
[root@rh94-dev01 pacemaker]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh94-dev02 (version 3.0.0-6d6734a86b) - partition with quorum
  * Last updated: Fri Mar 28 18:32:46 2025 on rh94-dev01
  * Last change:  Fri Mar 28 18:27:52 2025 by hacluster via hacluster on rh94-dev01
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ rh94-dev01 rh94-dev02 ]

Full List of Resources:
  * Resource Group: pgsql-group:
    * filesystem1       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem2       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem3       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * ipaddr    (ocf:heartbeat:Dummy):   Started rh94-dev01
    * pgsql     (ocf:heartbeat:Dummy):   Started rh94-dev01
  * Clone Set: ping-clone [ping]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * fence1-scsi (stonith:fence_scsi):    Started rh94-dev01
  * fence2-scsi (stonith:fence_scsi):    Started rh94-dev02

Node Attributes:
  * Node: rh94-dev01:
    * ping-status                       : 1         
  * Node: rh94-dev02:
    * ping-status                       : 1         

Migration Summary:
[root@rh94-dev01 pacemaker]# 
[root@rh94-dev01 pacemaker]# egrep "Ignore|Creating" /var/log/pacemaker/pacemaker.log
(no output)

--if cib.xml does not have a node section,
[root@rh94-dev01 pacemaker]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh94-dev01 (version 3.0.0-6d6734a86b) - partition with quorum
  * Last updated: Fri Mar 28 18:37:18 2025 on rh94-dev01
  * Last change:  Fri Mar 28 18:36:23 2025 by hacluster via hacluster on rh94-dev01
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ rh94-dev01 rh94-dev02 ]

Full List of Resources:
  * Resource Group: pgsql-group:
    * filesystem1       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem2       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * filesystem3       (ocf:heartbeat:Dummy):   Started rh94-dev01
    * ipaddr    (ocf:heartbeat:Dummy):   Started rh94-dev01
    * pgsql     (ocf:heartbeat:Dummy):   Started rh94-dev01
  * Clone Set: ping-clone [ping]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh94-dev01 rh94-dev02 ]
  * fence1-scsi (stonith:fence_scsi):    Started rh94-dev01
  * fence2-scsi (stonith:fence_scsi):    Started rh94-dev02

Node Attributes:
  * Node: rh94-dev01:
    * ping-status                       : 1         
  * Node: rh94-dev02:
    * ping-status                       : 1         

Migration Summary:
[root@rh94-dev01 pacemaker]# 

[root@rh94-dev01 pacemaker]# egrep "Ignore|Creating" /var/log/pacemaker/pacemaker.log
Mar 28 18:36:01.645 rh94-dev01 pacemaker-fenced    [540611] (cluster_status)    info: Creating a fake local node for rh94-dev01
Mar 28 18:36:03.300 rh94-dev01 pacemaker-fenced    [540611] (cluster_status)    info: Creating a fake local node for rh94-dev01
[root@rh94-dev01 pacemaker]# 

In both cases, STONITH with fence_scsi also worked correctly.

Please merge this fix.

Many thanks,
Hideo Yamauchi.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 28, 2025

With this fix, we have confirmed that fenced correctly registers STONITH devices both when there is a node section in cib.xml and when there is not.

Excellent! I'm about to push an edit that adds a sanity check, ensuring that the local node is not already part of a resource's allowed nodes table. (I think that should be impossible, but the scheduler is complex :) ).

Please merge this fix.

I'll let the new CI build finish and then we'll merge it during NA hours if @clumens and @wenningerk have no objection.

@nrwahl2 nrwahl2 requested a review from clumens March 28, 2025 10:10
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 28, 2025

@clumens We'll need to make sure this gets into RHEL 10. If we can get it into 10.0 without a z-stream, that would be ideal.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 28, 2025

I'm going to bed soon, but I am starting to wonder if this approach is still too eager. If symmetric-cluster is false, then adding the fake local node to all resources' allowed nodes table might allow resources to run in places where they should not.

If that concern has merit, then another possibility might be:

  • Create the fake local node at the very beginning of cluster_status()
  • Set a node flag that indicates this is a fake/temporary node
  • Later, if we want to create the node during unpack, look up and "overwrite" the fake node. (By this I mean overwrite any fields that may differ between the fake node and the unpacked node, or remove the fake node and create a new one. Removing and recreating may get a bit complicated, since there may already be copies of the node in resources' allowed nodes tables.) My current idea is to write a "reset node" helper function, which pe_create_node() will call if it finds that the node already exists and has the pcmk__node_fake flag set.

I almost never think about clusters with symmetric-cluster=false, because they're rare in practice. However, if it is set to false, then we'd probably want to rely on the rest of the unpacking logic to see whether each resource is allowed (via constraints) to run on each node :(

BTW, I'm looking at pe__unpack_resource(), which calls resource_location() if symmetric-cluster is true or the resource represents a guest node. resource_location() in turn adds all nodes to the resource's allowed nodes table.

@clumens
Copy link
Contributor

clumens commented Mar 28, 2025

Is this something we can add a scheduler regression test for?

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 28, 2025

Is this something we can add a scheduler regression test for?

I don't think so.

  1. The observed symptom required the fencer.
  2. Even if we try to set up a contrived test to make sure the local node gets added to resources' allowed nodes if appropriate, AFAIK there's no way to set local_node_name via an input CIB.

nrwahl2 added 4 commits March 30, 2025 13:53
* Remove some nesting.
* Use const where possible.
* Improve variable names and scope.
* Use a macro for repeated string literal.
* Use sizeof() - 1 instead of strlen() for string literal.
* Remove unnecessary escaping of single quote within double quotes.
* Copy only what is needed of the xpath string, if anything.
* Log assertion on malformed patchset, since we created it in
  cib_perform_op().
* Add a comment to clarify why we're (apparently) doing all this in the
  first place.

Keep the watchdog_device_update() comment as-is because I haven't looked
into it.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
An ID is required when schema validation is enabled anyway, and this
allows us to be sure rsc_id will be non-NULL within the block. It also
allows us to shorten some strstr() searches by starting later in the
string, not that this will matter in practice.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
The sole caller has already done the checking, we don't need event, and
we don't need msg except as a way to get the patchset.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
If the child of a primitive element representing a fencing device was
deleted, and we didn't catch it earlier as instance_attributes or
meta_attributes, then previously we would remove the fencing device. We
should remove the device only if the primitive itself was deleted.

By inspection, it looks like this would unregister a fencing device if
one of its operations (op elements) were deleted.

If we find that only a child of the primitive was deleted, fall through
to the next if block -- that is, "continue" only if the primitive itself
was deleted and we call stonith_device_remove().

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
@nrwahl2 nrwahl2 changed the title Fix: libpe_status: Add fake local node to all resources' allowed_nodes Fix: fencer: Refresh CIB devices on change to nodes section Mar 30, 2025
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 30, 2025

I hope this new version works, and that temporarily unregistering devices doesn't break anything. As mentioned in the commit message, I'm afraid to rely on the allowed nodes table being correct when a fake local node is involved. I'd prefer to use the cluster layer (via the controller) as the source of truth.

This fixes a regression introduced by bf7ffcd. As of that commit, the
fake local node is created after all resources have been unpacked. So it
doesn't get added to resources' allowed_nodes tables.

This prevents registration of fencing devices when the fencer receives a
CIB diff that removes the local node. For example, the user may have
replaced the CIB with a boilerplate configuration that has an empty
nodes section. Registering a fencing device requires that the local node
be in the resource's allowed nodes table.

One option would be to add the fake local node to all resources' allowed
nodes tables immediately after creating it. However, it shouldn't
necessarily be an allowed node for all resources. For example, if
symmetric-cluster=false, then a node should not be placed in a
resource's allowed nodes table by default.

It's difficult to ensure correctness of the allowed nodes tables when a
fake local node is involved. It may involve repeated code or a fragile
and confusing dependence on the order of unpacking.

Since the fake local node is a hack in the first place, it seems better
to avoid using it where possible. Currently the only code that even sets
the local_node_name member of scheduler->priv is in:
* the fencer
* crm_mon when showing the "times" section

This commit works as follows. If the fencer receives a CIB diff
notification that contains the nodes section, it triggers a full refresh
of fencing device registrations. In our example above, where a user has
replaced the CIB with a configuration that has an empty nodes section,
this means all fencing device registrations will be removed.

However, the controller also has a CIB diff notification callback:
do_cib_updated(). The controller's callback repopulates the nodes
section with up-to-date information from the cluster layer (or its node
cache) if it finds that an untrusted client like cibadmin has sent a
modified the nodes section. Then it updates the CIB accordingly.

The fencer will receive this updated CIB and refresh fencing device
registrations again, re-registering the fencing devices that were just
removed.

Note that in the common case, we're not doing all this wasted work. The
"remove and then re-register" sequence should happen only when a user
has modified the CIB in a sloppy way (for example, by deleting nodes
from the CIB's nodes section that have not been removed from the
cluster).

In short, it seems better to rely on the controller's maintenance of the
CIB's node list, than to rely on a "fake local node" hack in the
scheduler.

See the following pull requests from Hideo Yamauchi and their
discussions:
ClusterLabs#3849
ClusterLabs#3852

Thanks to Hideo for the report and finding the cause.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Mar 31, 2025

Pushed a new version with a FIXME comment. So I'm still not totally happy with this :(

@HideoYamauchi
Copy link
Contributor

@nrwahl2

I have not yet confirmed that the fixes I made after I checked them work.
I will check that they work after this PR is finally merged.

However, there is something that concerns me.
schedullered and fenced are different processes.
It's fine for fenced to calculate the placement based on the re-notification, but if the unfencing notification from schedullerd comes first, won't unfencing fail in the same way?

I haven't looked at the contents of your fix PR in detail, but I thought I'd comment just to be sure.

Best Regards,
Hideo Yamauchi.

@clumens
Copy link
Contributor

clumens commented Apr 4, 2025

I think this is okay, but I've never even heard of the fake local node thing until this PR came up. If @HideoYamauchi confirms it works for him, and ctslab testing doesn't show any other problems, I am okay with merging this.

@HideoYamauchi
Copy link
Contributor

Hi @clumens @nrwahl2

I think this is okay, but I've never even heard of the fake local node thing until this PR came up. If @HideoYamauchi confirms it works for him, and ctslab testing doesn't show any other problems, I am okay with merging this.

Okay!
Once the final PR is complete, I will run the tests again.

Many thanks,
Hideo Yamauchi.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Apr 7, 2025

As a quick status update, these two registration flags (cib_registered and api_registered) have caused me a lot of frustration.

For some reason, in 2009 we thought it was a good idea to allow registering fence devices without adding them to the CIB. Over the years, code has been added to handle corner cases stemming from that. For example, what should the fencer do when it finds a new fence device in the CIB with the same ID as a fence device that was already registered via the C API (including stonith_admin)? What should the fencer do when a device is deleted from the CIB but was also registered via the API? And so on. These PRs give an example.

I thought a simple fix would be to avoid deleting a fence device due to CIB change if it's registered via the API, but that has its own complications (like the PRs above).

Additionally, the executor registers fence devices via the API as part of their start operation ("to avoid a race condition"). So a running fence device that exists in the CIB is registered via both the CIB and the API, and the fencer refuses to execute the monitor operation unless the device is registered via the API.

Okay, so this status update wasn't quick after all :) I feel like this is all more complicated than it needs to be, and I'm trying to disentangle it. I appreciate your patience.

@wenningerk
Copy link
Contributor

As a quick status update, these two registration flags (cib_registered and api_registered) have caused me a lot of frustration.

iirc background was the ability to use fenced independently from pacemaker. Don't know if that is still relevant or even possible.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Apr 9, 2025

iirc background was the ability to use fenced independently from pacemaker. Don't know if that is still relevant or even possible.

It's possible via stonith_admin and via the public C API, which no one uses (except for sbd and dlm in very limited ways) but which we have to maintain "just in case."

cts-fencing also runs the fencer in stand-alone mode, so there is no CIB involved. It registers devices in cts-fence-helper.c.

I find it highly unlikely that anyone is registering fencing devices directly via the API, except for cts-fencing. We can probably avoid doing so in the executor, but it's not entirely trivial. The api_registered flag has been overloaded to also mean "started," so we'll have to disentangle that.

@HideoYamauchi
Copy link
Contributor

Hi @nrwahl2,

Regarding the fix for this issue, what is the status of the fix since then?
I understand that you are busy preparing for RHEL10.0, but please let us know the status.

Best Regards,
Hideo Yamauchi.

@nrwahl2
Copy link
Contributor Author

nrwahl2 commented Apr 27, 2025

@HideoYamauchi This is my main project right now. However, I have not been able to work on it in the past 1 to 1.5 weeks. I'm hoping to return to it in this upcoming week.

I don't have major updates. #3863 and #3866 were both prep work for this fix. I wanted to make the fencer code easier for me to read and to reason about, so that I can try to come up with a fix that I'm confident in.

@HideoYamauchi
Copy link
Contributor

HideoYamauchi commented Apr 27, 2025

Hi @nrwahl2,

Thanks for your comment.
Understood.

Many thanks,
Hideo Yamauchi.

@HideoYamauchi
Copy link
Contributor

Hi @nrwahl2

I have one question.
Is this fix managed in RHEL's Bugzilla or Pacemaker's Project(https://projects.clusterlabs.org/)?

Best Regards,
Hideo Yamauchi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants