Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

Closed
jweite-amazon opened this issue Nov 4, 2022 · 6 comments · Fixed by #7963
Closed

KCP Doesn't Remediate Faulty Machines During Cluster Formation #7496

jweite-amazon opened this issue Nov 4, 2022 · 6 comments · Fixed by #7963
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jweite-amazon
Copy link

What steps did you take and what happened:

  • Defined a CAPI/CAPC cluster with three CAPC failure domains, one of which used a network that was not routable, to simulate a transient network failure our client experiences.
  • Launched a single replica KCP and confirmed its machine became a node. (Repeated if it was assigned to the "bad" failure domain until it was assigned to one of the two "good" ones.)
  • Installed a CNI (cilium) and a MHC with maxUnhealthy==100%, 5m startup timeout and detection of Unknown and False unhealthy conditions with 5m timeout.
  • Added a worker machine/node (successfully).
  • Scaled-up the KCP to three replicas.
  • Observed that only a single machine was created, on the "bad" FD, which only achieved the Provisioned state. (Has not joined the cluster and become a Node).
  • Observed that this machine had condition NodeHealthy==False.
  • Observed that this machine did not have conditions APIServerPodHealthy, ControllerManagerPodHealthy, SchedulerPodHealthy, EtcdPodHealthy or EtcdMemberHealthy.
  • Observed that this machine is not remediated by KCP after 15 minutes.
  • Observed the following recurring logging messages from KCP Manager:
    I1104 14:18:03.141623 1 controller.go:364] "Scaling up control plane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=5f541f90-9549-496e-81c0-9befe23c1994 cluster="jweite-test" Desired=3 Existing=2 I1104 14:18:03.141831 1 scale.go:212] "msg"="Waiting for control plane to pass preflight checks" "cluster-name"="jweite-test" "name"="jweite-test-control-plane" "namespace"="default" "failures"="[machine jweite-test-control-plane-zqgfk does not have APIServerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have ControllerManagerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have SchedulerPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdPodHealthy condition, machine jweite-test-control-plane-zqgfk does not have EtcdMemberHealthy condition]"

What did you expect to happen:
The KCP to remediate the bad machine by deleting it.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

From my code read of controlPlane/kubeadm/internal/controller/remediation.go reconcileUnhealthyMachines() insists that the cluster be fully formed (provisioned machines == desired replicas) before it will act. But the cluster cannot fully form if the machine successfully started cannot join the cluster because of an external issue such as the one I simulated. IMO remediation would be an appropriate response to this situation.

Environment:

  • Cluster-api version: v1.2.4
  • minikube/kind version: v0.11.1 go1.16.4 darwin/amd64
  • Kubernetes version: (use kubectl version): v1.20.10
  • OS (e.g. from /etc/os-release): Darwin: MacOS 12.6.1

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2022
@killianmuldoon
Copy link
Contributor

killianmuldoon commented Nov 4, 2022

This is as-designed right now - KCP will not do re-mediation based on MHC until there are at least the number of healthy desired KCP machines running. This is to ensure stability when a cluster is coming up. On your unhealthy machine you should see a log like:

KCP waiting for having at least 3 control plane machines before triggering remediation

If that's there then MHC is correctly labelling the machine for remediation, but KCP is specifically deciding not to remediate until there is a stable control plane.

That said if there's a safe, stable way to do this it could be interesting. One option today is to implement externalRemediation to manage this outside of core Cluster API. It's a hard problem as when the underlying infrastructure isn't working, it's likely another Control Plane Machine will also fail as there's a real environment issue in your case the network being cut off for one of the KCP nodes.

@jweite-amazon
Copy link
Author

jweite-amazon commented Nov 4, 2022

Thanks for that feedback @killianmuldoon. I certainly don't know the basis behind the design decision here (i.e., why remediating during CP formation is risky). Its downside, as demonstrated, is that the partially provisioned CP will remain stuck in that state: the new CP machine can never join the cluster, and CAPI keeps waiting for it to. Stable, yes, but not in a useful way. I'd like CAPI to be able to recover from provisioning problems occurring during cluster formation that it "knows how to" recover from after cluster formation completes.

Can you or anyone shed more light on the risk of remediating during CP formation?

@killianmuldoon
Copy link
Contributor

The major risk at this point is that the etcd cluster is knocked into a state that it can't automatically recover from - e.g losing the leader, losing the majority.

Given that this is happening at bootstrap time it's probably easier and faster to just automatically restart if you're confident the KCP machine failure is something flaky, rather than something clearly wrong with the underlying infrastructure.

@fabriziopandini
Copy link
Member

/triage accepted

I agree this is an interesting new use case to cover if we can find a safe, stable way to do this

Some context that I hope can help in shaping the discussion:

  • KCP remediation was not originally designed for acting with less than 3 nodes;
  • in a follow-up iteration, we relaxed some of the original constraints in order to support remediation errors during the rollout of single machines control planes see 🐛 Allow KCP remediation when the etcd member being remediated is missing #4591 and 🌱 Update KCP remediation docs and messages to support > 1 replicas #4594
  • remediation during cluster formation wasn't a use case considered in the original design nor in the follow-up iteration.
  • by reading the comment on the line highlighted above, it seems to me that this check has been implemented in the first iteration to prevent KCP to remediate "aggressively" when there is more than 1 machine with problems; in other words, remediate 1 failing machine, restore desired replicas, then remediate the next one instead of remediate all the failing machines in sequence "aggressively" downsizing the CP

Now, as reported above, the last condition prevents remediation during cluster formation; before relaxing this check in this new iteration, IMO we should address at least the following questions:

  • if aggressive remediation is still a concern, and if yes, how to continue to prevent it while allowing remediation during cluster formation
  • if there are other use cases where current replicas < desired replicas, e.g. when doing a rollout with scale in strategy, and if/how the proposed change impacts those use cases

/area control-plane
/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Nov 4, 2022
@fabriziopandini
Copy link
Member

/assign

I'm working to some idea to solve this problem; I will follow up with some more details here or in PR with an amendment to the KCP proposal

@fabriziopandini
Copy link
Member

#7855 proposes an amendment to the KCP proposal so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica).

In order to make this more robust/not aggressive on the infrastructure (e.g. avoid infinite remediation if the first machine fails consistently) I have added optional support for controlling the number of retry and a delay between each retry.
I'm working on a PR that implements the proposed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants