Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue where rancher-machine does not clean up vSphere VMs that are in failed status #254

Merged
merged 2 commits into from
Jul 31, 2024

Conversation

jiaqiluo
Copy link
Member

@jiaqiluo jiaqiluo commented Jul 22, 2024

Issue

rancher/rancher#44212
Possibly: rancher/rancher#46340

Problem

In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM.
Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing.

Solution

Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.

Validation

Steps:

  • build rancher-machine for Linux amd64 from this branch
  • run Rancher v2.8.5 in docker-install mode
  • replace the rancher-machine binary in the container by using the docker scp command
  • create vSphere node template with four different creation methods: Deploy from template: Data Center, Deploy from template: Content Library, Clone an existing virtual machine, and Install from boot2docker ISO(Legacy)
  • on the node template set the CPUs to 36 to ensure the VM can NOT be assigned to any vSpherer server
  • provision RKE1 vSphere node-driver clusters with those 4 node templates
  • provision RKE2 vSphere node-driver clusters with those 4 node templates

Results:

During the cluster provisioning process, Rancher keeps creating new nodes and removing old ones from the DS clusters, meanwhile creating new VMs and removing old ones on the vSphere server.

We also see the following messages in Rancher logs:

2024/07/23 15:33:37 [INFO] [node-controller] Removing node jiaqi-a5
2024/07/23 15:33:37 [INFO] [node-controller] About to remove jiaqi-a5
2024/07/23 15:33:38 [INFO] [node-controller] (jiaqi-a5) Using datacenter /Datacenter
2024/07/23 15:33:39 [INFO] [node-controller] (jiaqi-a5) Using network /Datacenter/network/VM Network
2024/07/23 15:33:44 [INFO] [node-controller] (jiaqi-a5) Using hostsystem /Datacenter/host/Cluster/xx.xx.xx.xx
2024/07/23 15:33:44 [INFO] [node-controller] (jiaqi-a5) Using ResourcePool /Datacenter/host/Cluster/Resources/jiaqi-resource-pool
2024/07/23 15:33:48 [INFO] [node-controller] (jiaqi-a5) Removing user-data.iso
2024/07/23 15:33:49 [INFO] [node-controller] (jiaqi-a5) Destroying VM jiaqi-a5 (ID: 42183429-a34e-3fce-7857-a912af50c0af)
2024/07/23 15:33:51 [INFO] [node-controller] Successfully removed jiaqi-a5
2024/07/23 15:33:51 [INFO] [node-controller] (jiaqi-a5) Closing plugin on server side
2024/07/23 15:33:51 [INFO] [node-controller] (jiaqi-a5) Closing plugin on server side
2024/07/23 15:33:51 [INFO] [node-controller] Removing node jiaqi-a5 done
2024/07/23 15:33:51 [DEBUG] Cleaning up [management-state/node/nodes/jiaqi-a5]

… the provision fails.

In general, rancher-machine takes at least three steps to provision a vSphere machine: first, create the VM instance; then customize the VM's configuration; and finally bootstrap the VM.
Previously, rancher-machine saves the VM's ID only after the bootstrapping step succeeds. If rancher-machine creates but fails to start the VM, it will not destroy the VM because the VM's uuid is missing.

Now, rancher-machine retrieves and saves the uuid as machineID as soon as the VM is created, and saves the updated Host object to the store even if the creation process fails. This ensures the machineID is always preserved so that rancher-machine can find and destroy the VM on the vSphere server side properly.
@jiaqiluo jiaqiluo marked this pull request as ready for review July 23, 2024 22:52
@jiaqiluo jiaqiluo requested a review from a team July 23, 2024 22:52
@@ -136,6 +136,11 @@ func (api *Client) Create(h *host.Host) error {
log.Info("Creating machine...")

if err := api.performCreate(h); err != nil {
// it is possible that the VM is instantiated but fails to bootstrap,
// save the machine to the store, so the VM and associated resources can be found and destroyed later
if err := api.Save(h); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be a little more nuanced with this. We should try to analyze the error we've received downstream (i.e. in the driver) to determine whether we can safely save the data we have retrieved.

Notably, I believe this will cause a subsequent rm to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create command and the local rancher-machine state has no record of it.

If we can determine what type of error we've received from govmomi, then we can more safely determine whether we should retrieve/safely store the machine ID.

Copy link
Member Author

@jiaqiluo jiaqiluo Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that it is not a good idea to save the machine when api.performCreate(h) fails because the driver being invoked by api.performCreate(h) could be any supported driver depending on the --driver flag being passed to the rancher-machine rm command. Such change will affect all drives by theory.

Each driver has different requirements for creating and removing a VM, I don't see a universal pattern for the errors or error messages returned when the creation fails.

Even more interestingly, the Amazonec2 driver removes the failed VM and associated resources if the creation fails (code).

That being said, I think instead of the above change, I should keep the changes inside the vmwarevsphere driver.

I see 2 options to implement it:

option 1: change the creation function in the vmwarevsphere driver such that it removes the VM and associated resources automatically if the creation fails.
I tend to treat this as a (breaking) behavior change rather than a bug fixing, so we have option 2.

option 2: add a new driver-level flag --vmwarevsphere-cleanup-on-creation-failure which is false by default, if the flag is passed with the value true, then rancher-machine will remove the VM and associated resource if creation fails. On Rancher side, we need to expose the flag on the Vsphere node template. For reference, this is similar to how we add the support for graceful shutdown: rancher/rancher#42308.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Oats87 @snasovich, I'd like to hear your thoughts about the above ^^

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new driver flags please, they are a headache to trickle through the various release lines.

I don't see the issue with adding logic to performCreate to check the type of error and save if it's a specific type of error. For all of the other drivers, it won't save. You can see how we do this in the planner here: https://github.com/rancher/rancher/blob/release/v2.9/pkg/capr/planner/errors.go

Copy link
Member Author

@jiaqiluo jiaqiluo Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Oats87,

Notably, I believe this will cause a subsequent rm to accidentally tear down a machine that shouldn't be torn down, if it matches the machine name specified on the create command and the local rancher-machine state has no record of it.

Further testing shows that the above scenario will not happen because rancher-machine needs the machineID to destroy the VM on the vSphere side, and it doesn't have the machineID if the VM is not created by it or if the creation fails due to name collision.

In the case of name collision, the rancher-machine create -d vmwarevsphere ... xxx command will return the following error:

 "Error creating machine: Error in driver during machine creation: The name 'xxx' already exists."

And more importantly, no machineID will be returned from govmomi.

Now if we run rancher-machine rm -y xxx, it will not attempt to remove the VM from vSphere because it doesn't have the machineID, we will see the following message that proves that:

"(xxx) MachineID is empty, skipping removing VM"

rancher-machine will get and save the machineID only if it creates the VM successfully, no matter if it can start the VM at a later step or not.

That being said, I think we don't need to detect the error types.

Besides that, the PR is updated such that it attempts to save the machine only if the driver is vmwarevsphere, this way all other drives will not be affected.

I hope I didn't misunderstand anything. Please let me what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, @Oats87 suggested another more general approach: define a new error type, which is errResourcesCreated in this case, that represents the failure where external resources have been created, return this error at the appropriate places in the vSphere driver, and then check the error types instead of the driver name to determine if we should save the machine config when the creation fails.

I attempted to implement the approach but it turned out not to work. The function api.performCreate(h) returns an error whose type is rpc.ServerError which is a string that implements the error interface. it indicates that the rancher-machine does not invoke the underlying driver directly, instead, it calls an RPC server and the server handles the requests to the underlying driver. When any error is returned from the driver, the RPC server creates and returns an error of type rpc.ServerError and wipes out all other information, including the error type, about the original error. As a result, we could not use the errors.As(err, &errResourcesCreated) method to check if the error is of the type errResourcesCreated.

That being said, we decided to stay with the current approach.

@jiaqiluo jiaqiluo requested a review from Oats87 July 26, 2024 22:46
@jakefhyde jakefhyde self-requested a review July 30, 2024 16:24
@jiaqiluo jiaqiluo requested a review from a team July 30, 2024 23:53
@jiaqiluo jiaqiluo merged commit 42a1241 into rancher:master Jul 31, 2024
1 check passed
@jiaqiluo jiaqiluo deleted the fix-vsphere branch July 31, 2024 23:31
@janeczku
Copy link

@jiaqiluo Could you please check if this is expected to fix the following issue: rancher/rancher#46210?

@jiaqiluo
Copy link
Member Author

@jiaqiluo Could you please check if this is expected to fix the following issue: rancher/rancher#46210?

@janeczku, yes, this PR should fix the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants