Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While creating a new cluster, CAPI fails to remediate new machines that aren't functional #7353

Closed
mrog opened this issue Oct 5, 2022 · 29 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@mrog
Copy link

mrog commented Oct 5, 2022

What steps did you take and what happened:
I'm using EKS Anywhere (EKS-A) to create clusters. EKS-A uses CAPI. I'm using Apache CloudStack via CAPC for the infrastructure.

CAPI's machine health checker (MHC) works well, but only after a cluster is fully created. If there's a problem during cluster creation that makes a machine unusable, CAPI is unable to identify and remediate the situation. These machine problems could include a VM that boots, but lacks the necessary network connectivity to join the cluster; or a VM that starts running, but the necessary services fail to start on the VM because of a configuration problem.

I started by creating a management cluster. Then I created a workload cluster. To simulate a failure, I ran a script on one of the new workload cluster VMs as soon as it was reachable by SSH. The script disabled the VM's network adapters.

What did you expect to happen:
CAPI should notice that the Machine associated with the failed VM is stuck in the Provisioned phase for far too long, and then replace it. This never happens. EKS-A eventually times out after about 2 hours and leaves the workload cluster in an incomplete state.

Anything else you would like to add:
I haven't done extensive testing with cluster upgrades, but it seems that cluster upgrades can be affected in a similar way to cluster creation.

EKS-A normally adds the machine health checks to the cluster at the very end of the process, after all the machines are created and Cilium and kube-vip are installed. I made a custom build that adds those same health checks before machine creation instead of at the end. The CAPI log showed that MHC was unable to connect to the workload cluster's endpoint, which makes sense because that cluster hadn't been created yet.

MHC should be able to use the management cluster's endpoint during cluster creation because all the objects initially exist on the management cluster. So, I modified the MHC code in CAPI to make it connect to the management cluster endpoint instead of the workload cluster endpoint. That resolved the errors in the CAPI log, but MHC saw all the new VMs as unhealthy and started endlessly deleting and replacing them as fast as they could be provisioned. This may be because the new Machines weren't associated with Nodes yet.

I also experimented with solutions that don't use MHC. The most reliable way to detect that a new Machine needs to be replaced seems to be if it stays in the Provisioned phase for more than a few minutes. When this happens, deleting the Machine object usually results in the machine being replaced.

If a Machine object in the Provisioned phase is deleted using code inside CAPI, this seems to be safe to do. But it often results in problems with the cluster, especially if the Machine being deleted is the first control plane machine. The first CP machine is used as the cluster endpoint by kube-vip, and the replacement Machine is somehow initialized differently from the first one. That leaves the cluster in an unusable state.

If a workload Machine object in the Provisioned phase is deleted using code outside of CAPI, it sometimes leaves the management cluster in a bad state. This seems to be caused by a race condition. It happens more frequently if the management cluster has 3 CP nodes, than if it only has 1 CP node. The result of this race condition is that the Machine never gets replaced and the remnants of the workload cluster have to be manually discovered and removed from the management cluster. (Deleting the workload clusters.cluster.x-k8s.io object from the management cluster doesn't fully clean up the workload cluster like it normally would.) This demonstrates the need to remediate the failure using code inside CAPI, instead of using an external process.

Environment:

  • Cluster-api version: 1.2.0
  • minikube/kind version: 0.16.0
  • Kubernetes version: (use kubectl version): 1.24.2
  • OS (e.g. from /etc/os-release):
    • Kind and kubectl are running on MacOS Monterrey (12.6).
    • The management cluster and workload cluster VMs are running RHEL 8.

/kind bug
/area/health
/area/machine

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 5, 2022
@fabriziopandini
Copy link
Member

fabriziopandini commented Oct 6, 2022

/triage support

might be I'm missing something from the issue description, but it seems to me that what you are describing can be achieved by MHC's nodeStartupTimeout, am I wrong?

otherwise, I'm not sure I fully grasp the meaning of

CAPI's machine health checker (MHC) works well, but only after a cluster is fully created

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support

might be I'm missing something in the issue description, but it seems to me that what you are describing can be achieved by MHC's nodeStartupTimeout, am I wrong?

otherwise, I'm not sure I fully grasp the meaning of

CAPI's machine health checker (MHC) works well, but only after a cluster is fully created

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabriziopandini
Copy link
Member

/triage accepted
/kind support

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/support Categorizes issue or PR as a support question. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 6, 2022
@mrog
Copy link
Author

mrog commented Oct 6, 2022

might be I'm missing something from the issue description, but it seems to me that what you are describing can be achieved by MHC's nodeStartupTimeout, am I wrong?

Even with nodeStartupTimeout set to 5 minutes, MHC sees the newly provisioned machines as unhealthy. It replaces them a few seconds after they transition to the Provisioned phase.

otherwise, I'm not sure I fully grasp the meaning of

CAPI's machine health checker (MHC) works well, but only after a cluster is fully created

EKS-A uses clusterctl to create a new cluster. Once this cluster creation process is complete, MHC works as designed. But, it doesn't work properly during the creation of a new cluster. During the cluster creation process, MHC replaces machines that shouldn't be replaced. And if MHC is disabled after it replaced a machine, the machine replacement can leave the cluster in a bad state. It's possible that the replacement machine isn't being configured the same way as the the original, or maybe there's machine-specific information somewhere in the cluster that's not being updated when the machine is replaced.

@fabriziopandini
Copy link
Member

/remove-triage support
/triage needs-information

I'm trying to better understand your problem...
@JoelSpeed if it can take a look and seek if something in the description above rings some bell

Even with nodeStartupTimeout set to 5 minutes, MHC sees the newly provisioned machines as unhealthy. It replaces them a few seconds after they transition to the Provisioned phase.

MHC set the machine as unhealthy according to the rules defined in the spec, what is your MHC configuration? Can you provide more info about your setup (yaml resources) or evidence from logs to look at?

During the cluster creation process, MHC replaces machines that shouldn't be replaced.

Are we sure that machine selectors are properly set and that there are no overlapping MHC in the cluster?

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Those labels are not set on the issue: triage/support

In response to this:

/remove-triage support
/triage needs-information

I'm trying to better understand your problem...
@JoelSpeed if it can take a look and seek if something in the description above rings some bell

Even with nodeStartupTimeout set to 5 minutes, MHC sees the newly provisioned machines as unhealthy. It replaces them a few seconds after they transition to the Provisioned phase.

MHC set the machine as unhealthy according to the rules defined in the spec, what is your MHC configuration? Can you provide more info about your setup (yaml resources) or evidence from logs to look at?

During the cluster creation process, MHC replaces machines that shouldn't be replaced.

Are we sure that machine selectors are properly set and that there are no overlapping MHC in the cluster?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 6, 2022
@fabriziopandini fabriziopandini removed kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Oct 6, 2022
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 6, 2022
@k8s-ci-robot
Copy link
Contributor

@mrog: This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrog
Copy link
Author

mrog commented Oct 7, 2022

I reproduced the issue again using unmodified versions of EKS-A and all the k8s tools. The management cluster is named mark-mgmt, and the workload cluster that's being created is named mark-work. The attached cluster yaml file was generated by EKS-A. I used kubectl to manually apply the health checks near the beginning of the cluster creation process.

As soon as the first MD VM was running in the new workload cluster, I disabled its network adatpers. The machine name was mark-work-md-0-5fff4f8596-hkjzd. MHC replaced that machine with a new one, and then began an endless loop of deleting and replacing all the machines in the cluster, even though all the other machines were fully functional.

capi-controller-manager.log.gz
machinehealthchecks-mark-work.yaml.gz
mark-work-eks-a-cluster.yaml.gz

@mrog
Copy link
Author

mrog commented Oct 7, 2022

Here's another example, again using unmodified code. The setup is the same as the last one, except this time I disabled the first CP machine as it was being added to the new cluster. The machine name was mark-work-5c42l. That machine remained in the Provisioned phase for over 20 minutes before I manually deleted the cluster. During this time, the cluster creation process was blocked. Kubectl showed that the machine's status was unhealthy.

status:
  addresses:
  - address: 10.11.129.157
    type: InternalIP
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2022-10-07T21:03:28Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-10-07T21:02:49Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2022-10-07T21:03:28Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2022-10-07T21:02:49Z"
    reason: WaitingForNodeRef
    severity: Info
    status: "False"
    type: NodeHealthy

capi-controller-manager.log.gz
machine_status.yaml.gz
machinehealthchecks-mark-work.yaml.gz
mark-work-eks-a-cluster.yaml.gz

@JoelSpeed
Copy link
Contributor

Even with nodeStartupTimeout set to 5 minutes, MHC sees the newly provisioned machines as unhealthy. It replaces them a few seconds after they transition to the Provisioned phase.

When you were testing this, did you have the MHC attached only to the management cluster? Bear in mind that the MHC needs to check both workload and management clusters as part of its logic, if you've connected it to just the management cluster then it won't be able to review the Nodes from the workload cluster and won't operate as intended.

might be I'm missing something from the issue description, but it seems to me that what you are describing can be achieved by MHC's nodeStartupTimeout, am I wrong?

From what I've read, it seems like Fabrizio is on to something here. The MHC deliberately doesn't do anything until the cluster is bootstrapped (IIRC it's been a while), because, as mentioned above, we need the workload API to be stable before we can start making decisions.

Can you confirm if the experiments from your last two comments were with the MHC attached as intended or only attached to the management cluster as you described in the original post?

From what I can see, the deletion loop is because the MHC can't see the nodes for the Machines joining, can you confirm that the Nodes are joining the guest cluster correctly?

@mrog
Copy link
Author

mrog commented Oct 12, 2022

In both scenarios, the MHC is only attached to the management cluster.

MHC was only attached to the management cluster. In the scenario where I disabled one of the MD VMs, the corresponding Machine object never got a nodeRef value, which suggests to me that it did not join the cluster. The other Machines got nodeRef values, so it seems that they joined the cluster successfully.

I wasn't able to attach the MHC to the workload cluster because the CAPI controller and CRDs weren't yet installed on the workload cluster at that point in time.

% kubectl apply -f machinehealthchecks-mark-work.yaml
resource mapping not found for name: "mark-work-kcp-unhealthy-test" namespace: "eksa-system" from "machinehealthchecks-mark-work.yaml": no matches for kind "MachineHealthCheck" in version "cluster.x-k8s.io/v1beta1"
ensure CRDs are installed first
resource mapping not found for name: "mark-work-md-0-worker-unhealthy-test" namespace: "eksa-system" from "machinehealthchecks-mark-work.yaml": no matches for kind "MachineHealthCheck" in version "cluster.x-k8s.io/v1beta1"
ensure CRDs are installed first

In the scenario where I disabled the first CP VM, it would have been impossible to attach MHC to the workload cluster because no CP VMs were running besides the one that I sabotaged. Until the first CP Machine is in the Running phase, no other VMs get provisioned, and the cluster is completely non-functional. The corresponding Machine object also never joined the cluster.

@mrog
Copy link
Author

mrog commented Oct 13, 2022

In my last comment, I said that the CAPI controller and CRDs weren't installed on the workload cluster at that point in time. That suggests that they would be installed later. However, that's not the case. CAPI never gets installed on the workload cluster, so the MHC has to monitor the workload cluster from the management cluster.

@JoelSpeed
Copy link
Contributor

I think there may have been some confusion around what is meant by where it is attached to.
The MHC runs in the management cluster and IIRC uses a service account to communicate with that cluster to look up machines etc.
The MHC also uses a client to connect to the workload cluster to look up objects such as the Nodes. It does this by fetching the kubeconfig generated by the ControlPlane implementation.

The Nodes will never exist in the management cluster so the modifications you made to the code (perhaps you can share a link?), if they are to remove the remote cluster tracker, will cause the MHC to malfunction in unpredictable ways. This is not how it was designed to operate. It needs to observe the Node objects from the workload cluster, not from the management cluster.

@mrog
Copy link
Author

mrog commented Oct 17, 2022

Thanks for the clarification. MHC on the management cluster isn't able to connect to the workload cluster until Cilium and kube-vip are added to it, and that doesn't happen until all the machines are seen as running. While the cluster is still being created, the CAPI controller log shows many error messages from MHC's failed attempts to contact the workload cluster.

Even if MHC could connect to the workload cluster, it still wouldn't be able to help if there's a problem with the first machine to be created. That machine is always a CP machine, and seems to get special treatment. CAPI waits for it to be in the Running phase before it allows any other machines to advance to Running.

I didn't save any of my code changes because they weren't helpful. MHC fails with or without those changes.

@JoelSpeed
Copy link
Contributor

Thanks for the clarification. MHC on the management cluster isn't able to connect to the workload cluster until Cilium and kube-vip are added to it, and that doesn't happen until all the machines are seen as running. While the cluster is still being created, the CAPI controller log shows many error messages from MHC's failed attempts to contact the workload cluster.

Based on this, I don't think MHC is the component we should be looking at here. It seems that MHC just isn't going to work until the cluster is up and running. The design of the component prevents it being used this early in the cluster bootstrap

I wonder if there's some other way to help resolve your scenario

@mrog
Copy link
Author

mrog commented Oct 19, 2022

The conversation in #1205 suggests that remediation should be done by CAPI, and I agree with that. My own experiments showed that remediation outside of CAPI (by deleting the CAPI Machine) sometimes causes problems. My guess is that it's due to a race condition when the Machine is deleted while CAPI is busy doing something else with it. The result is that the machine never gets replaced and the cluster is left in a bad state. (Deleting the workload cluster's clusters.cluster.x-k8s.io object fails to completely clean up the cluster when it's in this state.)

It's easy to add a check in CAPI that detects when a Machine has been in the Running phase for too long. The second step is remediation, and I don't know how to handle that. Simply deleting the Machine and waiting for a replacement isn't enough. Some work needs to be done to make sure the replacement is added correctly, and it might involve moving some other objects to previous states.

@JoelSpeed
Copy link
Contributor

MHC currently looks at the Node conditions, if it also looked at Machine conditions then perhaps we could use the bootstrap conditions to determine that the Machine didn't bootstrap within a certain time period, you would have to make sure the MHC gracefully handles not being able to find the Node for the Machine though

@mrog
Copy link
Author

mrog commented Nov 1, 2022

I tried another experiment. This time, I added MHC and CNI after the first CP machine was running, and before the other machines were provisioned. Then I disabled the network interfaces on one of the new worker machines as soon as it started running. This ensured that the cluster would take longer than 5 minutes to be ready. Even with the CNI installed, MHC still marked all the machines as unhealthy (not just the disabled machine).

@fabriziopandini
Copy link
Member

can you provide your MHC configuration so someone can take a look if they have some bandwidth?

@mrog
Copy link
Author

mrog commented Nov 2, 2022

I discovered a mistake in my experiment 2 days ago. The EKS-A changes I made weren't always working, so the CNI might not have been installed in time. I fixed my code changes today and tried a couple more times. With the CNI working, MHC was able to detect the health of all the machines, and it remediated the unhealthy ones.

I only made worker machines unhealthy in this test. And EKS-A failed once after MHC replaced a machine, but that might be more of an EKS-A problem than a CAPI problem. I still need to test these changes with unhealthy control plane machines.

@fabriziopandini
Copy link
Member

fabriziopandini commented Nov 3, 2022

Thanks for the feedback
I'm closing the issue for now. if you find some problems you could either re-open or create a new issue focused on your findings
/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

Thanks for the feedback
I'm closing the issue for now. if you found some problem you could either re-open or create a new issue focused on your findings
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrog
Copy link
Author

mrog commented Nov 3, 2022

This looks like a combination of at least two issues. I can partially fix it in EKS-A by adding the machine health checks and CNI earlier in the cluster creation process. This allows MHC to remediate worker machines. There are still sometimes failures in EKS-A after MHC does this, and I'm investigating the reason for that. So, for worker machines, this seems to be an EKS-A issue.

CP machines are another matter. When a CP machine fails to join the cluster, EKS-A gets stuck waiting for the control plane to be ready. (It waits for kubectl wait --timeout 3600.00s --for=condition=ControlPlaneReady clusters.cluster.x-k8s.io/mark-work --kubeconfig mark-mgmt/mark-mgmt-eks-a-cluster.kubeconfig -n eksa-system to complete.) Because EKS-A can't install the CNI until the control plane is ready, MHC isn't able to connect to the workload cluster. That prevents MHC from determining the machine's health.

There's a second obstacle when a CP machine needs to be remediated. CAPI refuses to remediate any CP machines until every CP machine has joined the cluster. When a CP machine fails to join the cluster, we get an error like this from CAPI:

I1103 16:24:04.908783       1 remediation.go:123] "A control plane machine needs remediation, but the current number of replicas is lower that expected. Skipping remediation" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=fb6ee40a-c800-45ac-8483-1919b0cebdc4 Replicas=3 CurrentReplicas=2

I'm going to reopen the issue to maintain visibility while we learn more about what needs to be fixed.

/reopen

@k8s-ci-robot
Copy link
Contributor

@mrog: Reopened this issue.

In response to this:

This looks like a combination of at least two issues. I can partially fix it in EKS-A by adding the machine health checks and CNI earlier in the cluster creation process. This allows MHC to remediate worker machines. There are still sometimes failures in EKS-A after MHC does this, and I'm investigating the reason for that. So, for worker machines, this seems to be an EKS-A issue.

CP machines are another matter. When a CP machine fails to join the cluster, EKS-A gets stuck waiting for the control plane to be ready. (It waits for kubectl wait --timeout 3600.00s --for=condition=ControlPlaneReady clusters.cluster.x-k8s.io/mark-work --kubeconfig mark-mgmt/mark-mgmt-eks-a-cluster.kubeconfig -n eksa-system to complete.) Because EKS-A can't install the CNI until the control plane is ready, MHC isn't able to connect to the workload cluster. That prevents MHC from determining the machine's health.

There's a second obstacle when a CP machine needs to be remediated. CAPI refuses to remediate any CP machines until every CP machines has joined the cluster. When a CP machine fails to join the cluster, we get an error like this from CAPI:

I1103 16:24:04.908783       1 remediation.go:123] "A control plane machine needs remediation, but the current number of replicas is lower that expected. Skipping remediation" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="default/jweite-test-control-plane" namespace="default" name="jweite-test-control-plane" reconcileID=fb6ee40a-c800-45ac-8483-1919b0cebdc4 Replicas=3 CurrentReplicas=2

I'm going to reopen the issue to maintain visibility while we learn more about what needs to be fixed.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 3, 2022
@fabriziopandini
Copy link
Member

EKS-A gets stuck waiting for the control plane to be ready.

from a first look this should be fixed in EKS-A, not sure what we can do to help here

@mrog
Copy link
Author

mrog commented Nov 3, 2022

I agree. That part is an EKS-A issue, and I think I have a fix for it (as long as the first CP machine succeeds). But we're still faced with CAPI's inability to remediate CP machines that fail during cluster creation.

@fabriziopandini
Copy link
Member

Thanks for reporting back feedback from EKS-A
if the remaining part is implemented in KCP remediation while provisioning the cluster, then this is a duplicate of #7496

Let's continue discussion there
/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

Thanks for reporting back feedback from EKS-A
if the remaining part is implemented in KCP remediation while provisioning the cluster, then this is a duplicate of #7496

Let's continue discussion there
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrog
Copy link
Author

mrog commented Nov 7, 2022

I agree with closing the issue at this time. Let's use #7496 for the CAPI changes. The needed EKS-A changes are at aws/eks-anywhere#3979.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants