Provisioning of VM extension 'vmssCSE' has timed out #1860

vijaygos · 2019-08-26T18:55:57Z

What happened:
VMSS status is set to 'failed' with error message - "Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained."

As a result of this, SLB does not allow for binding of service (type Load balancer) with a public IP resource. The service status is always Pending:

NAME...................TYPE..................... CLUSTER-IP.......EXTERNAL-IP......PORT(S) ........................................AGE

my-service-1---ClusterIP -------10.0.52.61-----<none> ---------8125/UDP -------------------57m

my-service-2---LoadBalancer---10.0.165.41 ---<pending>----80:31330/TCP,443:30354/TCP ---46m

What you expected to happen:
No CSE errors and the service should bind to a given public IP resource with no errors.
How to reproduce it (as minimally and precisely as possible):
No real steps. Happens at random when the cluster attempts to scale and add a new VM

Anything else we need to know?:
AKS engine version is 0.28.1
Environment:

Kubernetes version (use kubectl version): 1.12.2
Size of cluster (how many worker nodes are in the cluster?) 31
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
Others: Azure US Government environment

While I am tempted to say this looks like a duplicate of #802, I would appreciate another look.

The text was updated successfully, but these errors were encountered:

welcome · 2019-08-26T18:55:58Z

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

CecileRobertMichon · 2019-08-26T20:31:12Z

@vijaygos are you sure these are logs from an instance with a failure? These logs look like a successful run to me. (Custom script finished successfully)

vijaygos · 2019-08-26T22:29:24Z

@CecileRobertMichon , I can see the VMSS status as "Failed". However, I am not sure how to determine which is the "failed instance". Is there any way to find the failed instance from the portal?

CecileRobertMichon · 2019-08-26T23:46:54Z

@vijaygos I don't think the portal shows which instance failed unfortunately. We have an issue open to improve extension logs to print the hostname but in the meantime there's no easy way to get the instance ID that I know of. See #1496. If you are able to repro the issue with scaling a few instances at a time that might be the easiest way to know which instance to get the logs from.

CecileRobertMichon · 2019-08-27T18:30:37Z

Apparently you can also get to the extension status in the portal from the instance status:

vijaygos · 2019-08-27T20:30:45Z

Ah! Thanks @CecileRobertMichon for pointing me to that. This is rather strange. I have looked at the status for all the 31 VMs in our cluster and they all show "ProvisioningState/succeeded". However, the VMSS Overview page shows a Failed status.

chpspiir · 2019-10-16T11:20:44Z

We are experiencing the exact same thing. Any updates on what could be wrong?

rsingh612 · 2019-10-16T22:54:23Z

Also, Experiencing the same issue after updating ServicePrincipal credentials
'Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.'

sylus · 2019-10-16T23:26:13Z

Also getting this issue today though no changes from cluster perspective. We were getting a rate limit issue earlier.

jackfrancis · 2019-10-16T23:31:47Z

Is there a common pattern in terms of cluster operations? Are these new cluster buildouts? Or a result of scale events on existing clusters?

zachomedia · 2019-10-16T23:41:30Z

Answering for @sylus - it's an existing cluster. We haven't made changes to the scale set ourselves - the last scale operation was last week. We found the issue when we had teams reporting that pods with disks weren't coming up.

It now seems that it's unable to mount the disks because it can't unmount them from the old nodes.

We seemed to hit a subscription write limit earlier today - though I'm not sure if that's related to this issue or not (if it was retrying too often).

MoergJ · 2019-10-17T07:39:39Z

We experience the same issue. Existing cluster, no changes regarding scale sets, but pods can't mount data disks and keep hanging in 'Init'.

lorenzo-biava · 2019-10-17T12:23:13Z

Same here; it worked fine last evening and today no VM is able to attach any disks, displaying the Provisioning of VM extension 'vmssCSE' has timed out. failure state in the VMSS (the AKS cluster is in westeurope).

sylus · 2019-10-17T12:52:46Z

We filed an Azure support ticket and supposed to get a call back this morning will post back any info.

rsingh612 · 2019-10-17T12:59:39Z

I was able to restart our VMSS nodes gracefully by running the following az cli command. Previously restarting the vmss node using the GUI also resulted in Provisioning of VM extension 'vmssCSE' has timed out.

Command:
az vmss update-instances -g {RESOUCE_GROUP} --name {VMSS_AGENTPOOL} --instance-id {INSTANCE_ID}

And the Cluster status resolved, All hung up Disks freed themselves and reattach without an issue.

jarobar435 · 2019-10-18T09:29:41Z

Unfortunately, in my case, the outcome of the above wasn't any different than the GUI triggered process:

$ az vmss update-instances -g {group} --name {vmss} --instance-id {id}
Deployment failed. Correlation ID: {id}. Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

sharkymcdongles · 2019-10-18T13:05:31Z

We have the same issue except it affects both vmssCSE and RunCommandLinux extensions. We didn't enable or disable any extensions. I tried to do the update command but it never completes successfully failing with this:

cli.azure.cli.core.util : Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.
Deployment failed. Correlation ID: abeff543-0a78-4c26-aa0f-dc2d01b71222. Provisioning of VM extension 'RunCommandLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

I tried to reinstall the extension with force and it also failed. This is a critical bug and is affecting our production. How has this not been addressed? We don't need this extension. Nobody needs to run linux commands via a cli extension and this should not be there by default. It's a bad design period to force this on people.

jackfrancis · 2019-10-18T14:12:28Z

How many folks on this thread can correlate these failures with active Azure API rate limiting events?

@rsingh612, were you able to determine operational differences between your original failure events, and your successful (if manual) az vmss update-instances remediation?

sharkymcdongles · 2019-10-18T14:16:12Z

@jackfrancis we saw this:

failed with failed to get azure instance id for node "aks-e16sv3-79416110-vmss00000w" (compute.VirtualMachineScaleSetVMsClient#Get: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="OperationNotAllowed" Message="The server rejected the request because too many requests have been received for this subscription." Details=[{"code":"TooManyRequests","message":"{\"operationGroup\":\"GetVMScaleSetVM30Min\",\"startTime\":\"2019-10-18T08:32:36.6723441+00:00\",\"endTime\":\"2019-10-18T08:47:36.6723441+00:00\",\"allowedRequestCount\":2500,\"measuredRequestCount\":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"})

I assumed what happened was a vm extension autoupdated leading to the vmss enterring a failed state. While in a failed state disk attach detach operations fail but will keep retrying. My hypothesis was maybe because of the failed state the disks retried triggering the api limit, but maybe that is an incorrect hypothesis.

jackfrancis · 2019-10-18T14:18:31Z

@feiskyer @khenidak will cloudprovider backoff v2 help recover from these events?

sylus · 2019-10-18T14:56:02Z

We are now getting this on a completely different cluster that has been up for months after resolving issue in another cluster in separate subscription. Are we 100% there was no roll out change? It seems suspicious with all the people here having same problem and over in:

Azure/AKS#1278

Obviously something has had to change, is Microsoft looking into this?

It almost is if all the extension status for the VMSS have been lost all of a sudden.

Failed to upgrade virtual machine instance 'k8s-linuxpool1-13820493-vmss_1'. Error: Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

On the instance itself however it shows this:

vmssCSE (Microsoft.Azure.Extensions.CustomScript, 2.0.7)
Provisioning succeeded
Info
ProvisioningState/succeeded

Can't even enable boot diagnostics, going to try to slowly replace each instance.

sylus · 2019-10-18T15:16:24Z

We were able to add a node and that process seems to have fixed the scaleset at least for now. ^_^

devigned · 2019-10-18T17:33:10Z

We are actively working on a resolution for this issue.

sharkymcdongles · 2019-10-18T17:39:18Z

Adding a node didn't work for us sadly.

@devigned Any idea on when you might expect a resolution to this and whether or not it will involve upgrading or migrating to a new aks / nodepool?

devigned · 2019-10-18T18:29:21Z

It seems like most of these issue are related to Azure API request limits being exceeded. It would be extremely helpful if everyone experiencing this issue could ping back with the following:

Number of nodes in your cluster?
Number of disks attached to those nodes?
Is cluster auto scaler enabled?
What operation did you run when this issue was triggered?
What version of aks-engine?
What version of K8s?

Thank you all. We will do our best to root cause this and get a fix out as soon as possible.

sharkymcdongles · 2019-10-18T18:39:46Z

Number of nodes in your cluster?
30

Number of disks attached to those nodes?
varies but we have 131 pvcs with 5 or so a nod
Is cluster auto scaler enabled?
no it is disabled
What operation did you run when this issue was triggered?
we didn't do anything. the cluster just stopped mounting pvcs and said there was an api limit. we then checked and saw the vmss was in a failed state due to vm extension RunCommandLinux and vmssCSE
What version of aks-engine?
aksEngineVersion : v0.39.2-aks

What version of K8s?
1.14.6

devigned · 2019-10-18T18:47:55Z

@sharkymcdongles give this a run please:

kubectl logs {{ your master named like: kube-controller-manager-k8s-master-XXXXXXX-0 }} \
-n kube-system | grep azure

Please report back any interesting rate limiting logs you find.

sharkymcdongles · 2019-10-18T19:07:55Z

@devigned this is aks, I dont have access to the masters.

this is the affected vmss in azure:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourceGroups/MC_prod-eu-mip-aks-rg_euprod_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-e16sv3-79416110-vmss

and aks:

/subscriptions/63ffcd66-455e-4369-87e6-a4a55983f44e/resourcegroups/prod-eu-mip-aks-rg/providers/Microsoft.ContainerService/managedClusters/euprod

zachomedia · 2019-10-23T14:19:56Z

@jackfrancis We upgraded to 1.15.5 with aks-engine 0.42.2 and enabled the v2 backoff. But it seems that when it hits a rate limit scenario it just hammers the Azure API and the only way to recover is to turn off controller-manager for a while to clear it

For example:

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

(We also hit the HighCostGetVMScaleSet3Min one too)

AustinSmart · 2019-10-23T14:25:46Z

Will the fixes being addressed over at the AKS issue be applicable here?

sylus · 2019-10-23T16:16:19Z

Yeah as @zachomedia said it appears worse and we barely get 15-20 mins of proper Kubernetes operation before rate limited again. @devigned @jackfrancis can we arbitratrily increase our limit?

Our clients are getting pretty insistent lol at the state of things this week and is putting a lot of pressure on us. Really worried will result in us having to move workloads somewhere else maybe not with VMSS but love the rolling upgrades so don't particularly want. Have been stalled for the past 2-3 days. We can maybe increase our support ticket to priority.

Note: Do seem to be able to find the disk potentially causing problem but lot of moving parts so trying to further isolate.

devigned · 2019-10-23T16:58:25Z

/cc @aramase for insight into the Cloud Provider for Azure which is what's issuing the calls to ARM.

jackfrancis · 2019-10-23T22:18:32Z

For folks who are having issues w/ VMSS CSE timeouts (not necessarily related to throttling), there has been an identified CSE bug being triaged. This bug correlates with folks experiencing this issue last week. (This CSE bug has nothing to do w/ AKS Engine's CSE script(s).)

If you have one or more VMSS instances in this state, please try manually re-imaging the instance. We've seen that workaround help restore nodes for folks.

And please report back if that unblocks you.

And apologies. :(

zachomedia · 2019-10-24T00:34:57Z

Ok, so I managed to get our cluster back into a good state for now. Seems that running most operations that cause a disk unmount can trigger the problem again (@sylus can fill in more there)

Basically to find out which disk is stuck, I looked into the controller-manager logs and saw a bunch of:

I1023 20:02:29.782144       1 attacher.go:89] Attach volume "/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" to instance "k8s-linuxpool1-12345678-vmss00000c" failed with disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)
I1023 20:02:29.782187       1 actual_state_of_world.go:322] Volume "kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME" is already added to attachedVolume list to node "k8s-linuxpool1-12345678-vmss_10", update device path ""
E1023 20:02:29.782390       1 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\"" failed. No retries permitted until 2019-10-23 20:02:30.282260017 +0000 UTC m=+391.358857249 (durationBeforeRetry 500ms). Error: "AttachVolume.Attach failed for volume \"pvc-$PVC_NAME\" (UniqueName: \"kubernetes.io/azure-disk//subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME\") from node \"k8s-linuxpool1-12345678-vmss00000c\" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be attached to node(k8s-linuxpool1-12345678-vmss00000c)"
I1023 20:02:29.782447       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"adp", Name:"appname-7587887f5d-8pndl", UID:"2b777c28-859e-45cd-984f-af2734a436a5", APIVersion:"v1", ResourceVersion:"10305952", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-$PVC_NAME" : disk(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/disks/k8s-cancentral-01-development-dyna-pvc-$PVC_NAME) already attached to node(/subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_10), could not be
attached to node(k8s-linuxpool1-12345678-vmss00000c)

To fix it, I ran az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]' | less, found the disk and got the lun number. I then ran az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id 12 --lun 3 to detach it.

Once it was detached, the cluster slowly started to recover.

sylus · 2019-10-24T00:56:55Z

Kubernetes Version: v1.15.5 (1 master, 4 nodes)
AKS Engine: v0.42.2
Kernel Version: 5.0.0-1023-azure
OS Image: Ubuntu 18.04.3 LTS
Date Issue Started: Last week (Oct 15, 2019) on clusters that running w/out issue for 3 months

Note: We don't seem to have the CSE issue anymore but still keeping this here for the moment since all seems related. We did also add as @jackfrancis indicated cloudProviderBackoffMode: v2 and removed the cloudProviderBackoffExponent and cloudProviderBackoffJitter.

Re: The comment above. The disk (un)mount does seem to be the problem. I was able to reproduce a base case by just deleting a few pods and / or running a helm upgrade deployment and then this started to trigger the following errors related to a disk unmount almost right away.

I1023 23:49:18.644247       1 pv_controller.go:1270] isVolumeReleased[pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c]: volume is released
I1023 23:49:18.644258       1 pv_controller.go:1270] isVolumeReleased[pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf]: volume is released
E1023 23:49:19.821907       1 goroutinemap.go:150] Operation for "delete-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c[52a16d3e-b5dd-4cc1-a64e-b03f6d61948b]" failed. No retries permitted until 2019-10-23 23:51:21.821868366 +0000 UTC m=+12222.629161185 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-b2220d20-9e95-4973-923d-95cc6e49ff4c is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""
E1023 23:49:19.826714       1 goroutinemap.go:150] Operation for "delete-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf[4c641582-1171-4c98-8189-29185623fc1c]" failed. No retries permitted until 2019-10-23 23:51:21.826677075 +0000 UTC m=+12222.633969894 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-653fdf5a-7408-460e-89d3-fdd0a6dd5fdf is attached to VM /subscriptions/$SUBSCRIPTION/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-28391316-vmss/virtualMachines/k8s-linuxpool1-28391316-vmss_12.\""

The suspicion is as we have multiple teams (re)-deploying their apps etc that enough of these disk failure issues eventually make us hit the rate limits set by Azure. Therefore other operations against the VMSS don't succeed once it gets into this state. Then as mentioned above we need to stop the controller-manager pod for a while to clear the rate limit as illustrated below.

The server rejected the request because too many requests have been received for this subscription. (Code: OperationNotAllowed) {"operationgroup":"HighCostGetVMScaleSet30Min","starttime":"2019-10-23T14:18:11.960853+00:00","endtime":"2019-10-23T14:33:11.960853+00:00","allowedrequestcount":900,"measuredrequestcount":3157} (Code: TooManyRequests, Target: HighCostGetVMScaleSet30Min)

In the controller-manager it will list the PVC and instance ID of the disk that can't be unattached and we can use the PVC number to find the LUN using the command below.

az vmss list-instances --resource-group k8s-cancentral-01-dev-rg --name k8s-linuxpool1-12345678-vmss --query '[].[name, storageProfile.dataDisks[]]'

We then have to for ALL disks listed in controller manager that have this problem need to run the following.

az vmss disk detach --resource-group k8s-cancentral-01-dev-rg --vmss-name k8s-linuxpool1-12345678-vmss --instance-id $ID --lun $LUN

The cluster is then now back in a working state until the next deployment which will trigger the PVC issue and then rinse / repeat ^_^

Related Issues

a) All this is explained in further detail over at AKS issue by Microsoft: Azure/AKS#1278 (comment)

b) Additionally this looks directly related as well although we have a order of magnitude smaller cluster: kubernetes-sigs/cloud-provider-azure#247

jackfrancis · 2019-10-24T21:05:05Z

@sylus @zachomedia in your failure scenarios, are you ever encountering this error:

Cannot attach data disk '<disk_id>' to VM '<vmss_id>' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

And if so, are you observing that the disk it's complaining about seems, in fact, to be totally unattached?

We encountered this with another customer, and were able to follow your guidance to manually detach the offending disk (even though it wasn't attached to anything! — we detached it from the vmss instance id that it was trying to attach itself to 🤷‍♂ ).

In any event, FYI for folks continuing to struggle with this.

zachomedia · 2019-10-24T21:35:17Z

@jackfrancis Yeah, we've seen that error in our logs too. I don't think we've ever checked if it was actually attached or not, usually we just detach it through the cli.

We did have a weird state today where apparently one of our instances had a disk attached that no longer existed so all other disk operations failed. Once we removed that attachment, it recovered.

jackfrancis · 2019-10-24T22:23:12Z

@zachomedia How did you get the lun number in the case where the disk is not actually attached to any VMSS instances?

In our troubleshooting the following command didn't yield the lun in such a scenario:

az vmss list-instances --resource-group <rg> --name <vmss_name> --query '[].[name, storageProfile.dataDisks[]]'

zachomedia · 2019-10-24T22:24:34Z

@jackfrancis Oh, I see, for all of our cases the disk was in the list. So I guess that means it was attached.

jackfrancis · 2019-10-24T23:06:22Z

@sylus @zachomedia do you think this is an appropriate repro?:

Install this statefulset on a cluster w/ 50 VMSS nodes:

$ cat statefulset.yaml 
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  podManagementPolicy: "Parallel"
  replicas: 50
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "managed-standard"
      resources:
        requests:
          storage: 1Gi

(note the ratio of nodes-replicas is 1:1)

When pods come online, start deleting them
Statefulset reconciliation will spin up more pods to fulfill replica count

I wonder if the above will induce the weird zombie detach state we're seeing.

zachomedia · 2019-10-24T23:29:24Z

@jackfrancis

I would say that you should have a pretty reasonable chance at reproducing the issue with that setup. Our cluster is much smaller (about 5 nodes) and usually just a couple of pods with disks being deleted can trigger it.

jackfrancis · 2019-10-24T23:32:44Z

@zachomedia Node count is static? I.e., disk re-attachment operations aren't happening as a result of underlying VMSS instances disappearing, re-appearing, etc?

zachomedia · 2019-10-24T23:34:10Z

@jackfrancis That's correct, node count is static.

jackfrancis · 2019-10-24T23:36:02Z

(So far unable to repro, but will keep trying.)

It's also possible that disk detach/attach operations during throttle events are the edge case causing this behavior (my test cluster is not being actively throttled atm).

zachomedia · 2019-10-25T15:28:57Z

@jackfrancis So something you can try.. Seems most of our problems now stem from PVCs being deleted (one of our teams deletes their deployments and re-recreates it right now). We seem to get two things:

A disk is still attached to a VMSS instance but apparently no longer exists. No disk operations succeed until we detach that disk (the error appears when we attempt to manually detach a different disk that was failing - sorry, I forgot to save the message)
Some deletes seem to never unmount the disk (maybe it passed some timeouts and gave up trying to unmount). See a bunch of:

I1025 15:21:21.911563       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC1]: volume is released
I1025 15:21:21.911914       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC2]: volume is released
I1025 15:21:21.913827       1 pv_controller.go:1270] isVolumeReleased[pvc-PVC3]: volume is released
E1025 15:21:23.028284       1 goroutinemap.go:150] Operation for "delete-pvc-PVC2[5dcfd7e1-e61b-4206-9cac-6e7758e3366c]" failed. No retries permitted until 2019-10-25 15:23:25.028194036 +0000 UTC m=+154545.835486755 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC2 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.029986       1 goroutinemap.go:150] Operation for "delete-pvc-PVC3[3507cbda-868e-4198-b771-620234d258b5]" failed. No retries permitted until 2019-10-25 15:23:25.029908242 +0000 UTC m=+154545.837201061 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC3 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""
E1025 15:21:23.069595       1 goroutinemap.go:150] Operation for "delete-pvc-PVC1[4d9dbfa3-1dba-4478-9eae-f475d033257a]" failed. No retries permitted until 2019-10-25 15:23:25.069557061 +0000 UTC m=+154545.876849880 (durationBeforeRetry 2m2s). Error: "compute.DisksClient#Delete: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"Disk k8s-cancentral-01-development-dyna-pvc-PVC1 is attached to VM /subscriptions/<subscription>/resourceGroups/k8s-cancentral-01-dev-rg/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-linuxpool1-12345678-vmss/virtualMachines/k8s-linuxpool1-12345678-vmss_12.\""

andyzhangx · 2019-10-25T15:36:35Z

@zachomedia You need to drain k8s-linuxpool1-12345678-vmss_12 and then reimage k8s-linuxpool1-12345678-vmss_12 to clean that vmss instance state as a workaround.

sylus · 2019-10-29T13:26:12Z

Problems again today with 5 of our 6 VM's in a failed state and can't even reimage as get this error. I also did launch a few AKS clusters over the weekend and as soon as turned off over night, they all are in failed state with disk issues. Really hoping a fix is forthcoming, this is plainly reproducible.

Failed to reimage virtual machine instances k8s-linuxpool1-12345678-vmss_12, k8s-linuxpool1-12345678-vmss_10, k8s-linuxpool1-12345678-vmss_9, k8s-linuxpool1-12345678-vmss_11. Error: The processing of VM 'k8s-linuxpool1-12345678-vmss_10' is halted because of one or more disk processing errors encountered by VM 'k8s-linuxpool1-12345678-vmss_12' in the same Availability Set. Please resolve the error with VM 'k8s-linuxpool1-12345678-vmss_12' before retrying the operation.

jackfrancis · 2019-10-29T16:39:25Z

@sylus is the StatefulSet spec here not a viable repro input for inducing this symptom on a test cluster?

#2221

As the issue describes I was able to witness some badness (described in the issue w/ the working remediation steps I came up with at the time), but I haven't been able to reliably repeat all the badness so many folks are seeing now. Would love to get that repro process so that we can more effectively help drive fixes.

Thanks for hanging in there. :/

craiglpeters · 2019-11-01T16:55:58Z

A VMSS bug in the update on Oct 17 was identified and remediated globally over Oct 28 and 29. Disks should no longer be stuck in the 'detaching' state in VMSS, and so any Kubernetes operations should now be able to proceed without running into this issue. If you observe any new instances of this same problem please reopen this bug and I'll work to determine the cause.

mbarry-msdn · 2022-07-12T12:39:13Z

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

Lu1234 · 2022-11-03T01:00:59Z

We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday,"

I'm facing the same error while nothing being changed to the cluster,really need some advice with this.

sharkymcdongles · 2022-11-03T18:30:20Z

Advice is switch to a better cloud provider. Our entire company had to switch because of this issue 3 years ago. Surprised they still have these problems.

…

On Thu, 3 Nov 2022, 02:01 Lu1234, ***@***.***> wrote: We're getting error with version 1.22.6: "MessageProvisioning of VM extension vmssCSE has timed out. Extension provisioning has taken too long to complete. The extension last reported "Plugin enabled". More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshootTimeMonday," I'm facing the same error while nothing being changed to the cluster,really need some advice with this. — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG6KEUKCCUAEGWR4SJ7TQXLWGMFFTANCNFSM4IPTUGQA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CecileRobertMichon · 2022-11-03T19:29:26Z

Hi all, I suggest opening a new issue in https://github.com/Azure/AKS/issues with details of the problem/error you are facing.

I want to make sure you're getting the help you need. This is a closed issue from 3 years ago in a deprecated project (https://github.com/Azure/aks-engine#project-status) so commenting on here likely won't get the right people to look into it.

vijaygos added the bug Something isn't working label Aug 26, 2019

This comment has been minimized.

Sign in to view

sylus mentioned this issue Oct 24, 2019

Support large clusters without throttling kubernetes-sigs/cloud-provider-azure#247

Closed

19 tasks

craiglpeters closed this as completed Nov 1, 2019

Provisioning of VM extension 'vmssCSE' has timed out #1860

Provisioning of VM extension 'vmssCSE' has timed out #1860

Comments

vijaygos commented Aug 26, 2019

welcome bot commented Aug 26, 2019

This comment has been minimized.

CecileRobertMichon commented Aug 26, 2019

vijaygos commented Aug 26, 2019

CecileRobertMichon commented Aug 26, 2019

CecileRobertMichon commented Aug 27, 2019

vijaygos commented Aug 27, 2019

chpspiir commented Oct 16, 2019

rsingh612 commented Oct 16, 2019

sylus commented Oct 16, 2019

jackfrancis commented Oct 16, 2019

zachomedia commented Oct 16, 2019

MoergJ commented Oct 17, 2019 • edited Loading

lorenzo-biava commented Oct 17, 2019

sylus commented Oct 17, 2019

rsingh612 commented Oct 17, 2019

jarobar435 commented Oct 18, 2019

sharkymcdongles commented Oct 18, 2019

jackfrancis commented Oct 18, 2019

sharkymcdongles commented Oct 18, 2019

jackfrancis commented Oct 18, 2019

sylus commented Oct 18, 2019 • edited Loading

sylus commented Oct 18, 2019

devigned commented Oct 18, 2019

sharkymcdongles commented Oct 18, 2019

devigned commented Oct 18, 2019

sharkymcdongles commented Oct 18, 2019

devigned commented Oct 18, 2019 • edited Loading

sharkymcdongles commented Oct 18, 2019

zachomedia commented Oct 23, 2019

AustinSmart commented Oct 23, 2019

sylus commented Oct 23, 2019 • edited Loading

devigned commented Oct 23, 2019

jackfrancis commented Oct 23, 2019

zachomedia commented Oct 24, 2019

sylus commented Oct 24, 2019 • edited Loading

Related Issues

jackfrancis commented Oct 24, 2019

zachomedia commented Oct 24, 2019

jackfrancis commented Oct 24, 2019

zachomedia commented Oct 24, 2019

jackfrancis commented Oct 24, 2019

zachomedia commented Oct 24, 2019

jackfrancis commented Oct 24, 2019

zachomedia commented Oct 24, 2019

jackfrancis commented Oct 24, 2019

zachomedia commented Oct 25, 2019

andyzhangx commented Oct 25, 2019

sylus commented Oct 29, 2019 • edited Loading

jackfrancis commented Oct 29, 2019

craiglpeters commented Nov 1, 2019

mbarry-msdn commented Jul 12, 2022

Lu1234 commented Nov 3, 2022

sharkymcdongles commented Nov 3, 2022 via email

CecileRobertMichon commented Nov 3, 2022

MoergJ commented Oct 17, 2019 •

edited

Loading

sylus commented Oct 18, 2019 •

edited

Loading

devigned commented Oct 18, 2019 •

edited

Loading

sylus commented Oct 23, 2019 •

edited

Loading

sylus commented Oct 24, 2019 •

edited

Loading

sylus commented Oct 29, 2019 •

edited

Loading