KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

jweite-amazon · 2022-11-04T14:38:50Z

What steps did you take and what happened:

Defined a CAPI/CAPC cluster with three CAPC failure domains, one of which used a network that was not routable, to simulate a transient network failure our client experiences.
Launched a single replica KCP and confirmed its machine became a node. (Repeated if it was assigned to the "bad" failure domain until it was assigned to one of the two "good" ones.)
Installed a CNI (cilium) and a MHC with maxUnhealthy==100%, 5m startup timeout and detection of Unknown and False unhealthy conditions with 5m timeout.
Added a worker machine/node (successfully).
Scaled-up the KCP to three replicas.
Observed that only a single machine was created, on the "bad" FD, which only achieved the Provisioned state. (Has not joined the cluster and become a Node).
Observed that this machine had condition NodeHealthy==False.
Observed that this machine did not have conditions APIServerPodHealthy, ControllerManagerPodHealthy, SchedulerPodHealthy, EtcdPodHealthy or EtcdMemberHealthy.
Observed that this machine is not remediated by KCP after 15 minutes.
Observed the following recurring logging messages from KCP Manager:
I1104 14:18:03.141623 1 controller.go:364] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=5f541f90-9549-496e-81c0-9befe23c1994 cluster="jweite-test" Desired=3 Existing=2 I1104 14:18:03.141831 1 scale.go:212] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="jweite-test" "name"="jweite-test-control-plane" "namespace"="default" "failures"="[machine jweite-test-control-plane-zqgfk does not have APIServerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have ControllerManagerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have SchedulerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdMemberHealthy condition]"

What did you expect to happen:
The KCP to remediate the bad machine by deleting it.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

From my code read of controlPlane/kubeadm/internal/controller/remediation.go reconcileUnhealthyMachines() insists that the cluster be fully formed (provisioned machines == desired replicas) before it will act. But the cluster cannot fully form if the machine successfully started cannot join the cluster because of an external issue such as the one I simulated. IMO remediation would be an appropriate response to this situation.

Environment:

Cluster-api version: v1.2.4
minikube/kind version: v0.11.1 go1.16.4 darwin/amd64
Kubernetes version: (use kubectl version): v1.20.10
OS (e.g. from /etc/os-release): Darwin: MacOS 12.6.1

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2022-11-04T16:28:14Z

This is as-designed right now - KCP will not do re-mediation based on MHC until there are at least the number of healthy desired KCP machines running. This is to ensure stability when a cluster is coming up. On your unhealthy machine you should see a log like:

KCP waiting for having at least 3 control plane machines before triggering remediation

If that's there then MHC is correctly labelling the machine for remediation, but KCP is specifically deciding not to remediate until there is a stable control plane.

That said if there's a safe, stable way to do this it could be interesting. One option today is to implement externalRemediation to manage this outside of core Cluster API. It's a hard problem as when the underlying infrastructure isn't working, it's likely another Control Plane Machine will also fail as there's a real environment issue in your case the network being cut off for one of the KCP nodes.

jweite-amazon · 2022-11-04T17:15:01Z

Thanks for that feedback @killianmuldoon. I certainly don't know the basis behind the design decision here (i.e., why remediating during CP formation is risky). Its downside, as demonstrated, is that the partially provisioned CP will remain stuck in that state: the new CP machine can never join the cluster, and CAPI keeps waiting for it to. Stable, yes, but not in a useful way. I'd like CAPI to be able to recover from provisioning problems occurring during cluster formation that it "knows how to" recover from after cluster formation completes.

Can you or anyone shed more light on the risk of remediating during CP formation?

killianmuldoon · 2022-11-04T17:53:26Z

The major risk at this point is that the etcd cluster is knocked into a state that it can't automatically recover from - e.g losing the leader, losing the majority.

Given that this is happening at bootstrap time it's probably easier and faster to just automatically restart if you're confident the KCP machine failure is something flaky, rather than something clearly wrong with the underlying infrastructure.

fabriziopandini · 2022-11-04T20:27:05Z

/triage accepted

I agree this is an interesting new use case to cover if we can find a safe, stable way to do this

Some context that I hope can help in shaping the discussion:

KCP remediation was not originally designed for acting with less than 3 nodes;
in a follow-up iteration, we relaxed some of the original constraints in order to support remediation errors during the rollout of single machines control planes see 🐛 Allow KCP remediation when the etcd member being remediated is missing #4591 and 🌱 Update KCP remediation docs and messages to support > 1 replicas #4594
remediation during cluster formation wasn't a use case considered in the original design nor in the follow-up iteration.
by reading the comment on the line highlighted above, it seems to me that this check has been implemented in the first iteration to prevent KCP to remediate "aggressively" when there is more than 1 machine with problems; in other words, remediate 1 failing machine, restore desired replicas, then remediate the next one instead of remediate all the failing machines in sequence "aggressively" downsizing the CP

Now, as reported above, the last condition prevents remediation during cluster formation; before relaxing this check in this new iteration, IMO we should address at least the following questions:

if aggressive remediation is still a concern, and if yes, how to continue to prevent it while allowing remediation during cluster formation
if there are other use cases where current replicas < desired replicas, e.g. when doing a rollout with scale in strategy, and if/how the proposed change impacts those use cases

/area control-plane
/remove-kind bug
/kind feature

fabriziopandini · 2022-12-19T09:24:38Z

/assign

I'm working to some idea to solve this problem; I will follow up with some more details here or in PR with an amendment to the KCP proposal

fabriziopandini · 2023-01-05T14:21:08Z

#7855 proposes an amendment to the KCP proposal so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica).

In order to make this more robust/not aggressive on the infrastructure (e.g. avoid infinite remediation if the first machine fails consistently) I have added optional support for controlling the number of retry and a delay between each retry.
I'm working on a PR that implements the proposed changes

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2022

fabriziopandini mentioned this issue Nov 7, 2022

While creating a new cluster, CAPI fails to remediate new machines that aren't functional #7353

Closed

mrog mentioned this issue Nov 7, 2022

MHC doesn't work until a cluster is fully formed aws/eks-anywhere#3979

Open

maxdrib mentioned this issue Nov 22, 2022

Enable configuration of timeout for unreachable VM's kubernetes-sigs/cluster-api-provider-cloudstack#127

Closed

k8s-ci-robot assigned fabriziopandini Dec 19, 2022

fabriziopandini mentioned this issue Jan 5, 2023

📖 Amend KCP proposal with remediation while provisioning the CP #7855

Merged

fabriziopandini mentioned this issue Jan 20, 2023

✨ Add support for KCP remediation during cluster provisioning #7963

Merged

k8s-ci-robot closed this as completed in #7963 Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

jweite-amazon commented Nov 4, 2022

killianmuldoon commented Nov 4, 2022 •

edited

Loading

jweite-amazon commented Nov 4, 2022 •

edited

Loading

killianmuldoon commented Nov 4, 2022

fabriziopandini commented Nov 4, 2022

fabriziopandini commented Dec 19, 2022

fabriziopandini commented Jan 5, 2023

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

Comments

jweite-amazon commented Nov 4, 2022

killianmuldoon commented Nov 4, 2022 • edited Loading

jweite-amazon commented Nov 4, 2022 • edited Loading

killianmuldoon commented Nov 4, 2022

fabriziopandini commented Nov 4, 2022

fabriziopandini commented Dec 19, 2022

fabriziopandini commented Jan 5, 2023

killianmuldoon commented Nov 4, 2022 •

edited

Loading

jweite-amazon commented Nov 4, 2022 •

edited

Loading