Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Provisioning of VM extension "cse" Fails #802

Closed
barakAtSoluto opened this issue Mar 20, 2019 · 4 comments
Closed

Provisioning of VM extension "cse" Fails #802

barakAtSoluto opened this issue Mar 20, 2019 · 4 comments

Comments

@barakAtSoluto
Copy link

This is an ISSUE


What version of aks-engine?:
Version: v0.32.0
GitCommit: c2b6148
GitTreeState: clean


Kubernetes version: 1.11.8

What happened:
When executing aks-engine deploy the deployment completes after ~5 min but one of the master nodes which is stuck on installing VM extensions. Eventually the deployment fails.

The error from the ARM failed deployment:

{
   "id":"<path_to_RG>/providers/Microsoft.Resources/deployments/<RG_name>-1122390847/operations/88B1F2D60893E36E",
   "operationId":"88B1F2D60893E36E",
   "properties":{
      "provisioningOperation":"Create",
      "provisioningState":"Failed",
      "timestamp":"2019-03-19T15:05:02.9439217Z",
      "duration":"PT1H30M30.253074S",
      "trackingId":"9a393c08-d428-495a-93b3-341e522e2491",
      "serviceRequestId":"107860d4-cdff-4157-b202-b0fdbb4312b1",
      "statusCode":"Conflict",
      "statusMessage":{
         "status":"Failed",
         "error":{
            "code":"ResourceDeploymentFailure",
            "message":"The resource operation completed with terminal provisioning state 'Failed'.",
            "details":[
               {
                  "code":"VMExtensionProvisioningTimeout",
                  "message":"Provisioning of VM extension 'cse-master-0' has timed out. Extension installation may be taking too long, or extension status could not be obtained."
               }
            ]
         }
      },
      "targetResource":{
         "id":"<path_to_RG>/providers/Microsoft.Compute/virtualMachines/k8s-master-24726254-0/extensions/cse-master-0",
         "resourceType":"Microsoft.Compute/virtualMachines/extensions",
         "resourceName":"k8s-master-24726254-0/cse-master-0"
      }
   }
}

The aks-engine config file:

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.11",
        "orchestratorVersion": "1.11.8",
        "kubernetesConfig": {
          ""networkPolicy": "calico",
          "cloudProviderRateLimit": false,
          "kubeletConfig": {
            "--max-pods": "100"
          },
          "apiServerConfig": {
            "--admission-control":  "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,DenyEscalatingExec,AlwaysPullImages,ValidatingAdmissionWebhook,ResourceQuota"
          },
          "addons": [
            {
              "name": "cluster-autoscaler",
              "enabled": true,
              "config": {
                "min-nodes": "10",
                "max-nodes": "20"
              }
            },
            {
              "name": "smb-flexvolume",
              "enabled" : true,
              "containers": [
                {
                  "name": "keyvault-flexvolume",
                  "cpuRequests": "50m",
                  "memoryRequests": "10Mi",
                  "cpuLimits": "50m",
                  "memoryLimits": "10Mi"
                }
              ]
            },
            { 
              "name": "blobfuse-flexvolume",
              "enabled": true,
              "containers": [
                {
                  "name": "keyvault-flexvolume",
                  "cpuRequests": "50m",
                  "memoryRequests": "10Mi",
                  "cpuLimits": "50m",
                  "memoryLimits": "10Mi"
                }
              ]
            },
            {
              "name": "keyvault-flexvolume",
              "enabled" : true,
              "containers": [
                {
                  "name": "keyvault-flexvolume",
                  "cpuRequests": "50m",
                  "memoryRequests": "10Mi",
                  "cpuLimits": "50m",
                  "memoryLimits": "10Mi"
                }
              ]
            },
            {
              "name": "tiller",
              "enabled": false
            },
            {
              "name": "kubernetes-dashboard",
              "enabled": false
            }
          ]
        }
      },
      "masterProfile": {
        "count": 5,
        "vmSize": "Standard_D2_v2",
        "vnetSubnetID": "<cidr>",
        "vnetCidr": "<cidr>",
        "firstConsecutiveStaticIP": "<ip>",
        "storageProfile": "StorageAccount"
      },
      "agentPoolProfiles": [
        {
          "name": "devops",
          "count": 10,
          "vmSize": "Standard_D4_v2",
          "osType": "Linux",
          "availabilityProfile": "VirtualMachineScaleSets",
	        "diskSizesGB": [100],
          "vnetSubnetID": "<agent_subnet_id>",
          "acceleratedNetworkingEnabled": false,
          "storageProfile": "ManagedDisks"
        }
      ],
      "linuxProfile": {
        "adminUsername": "<user_name>",
        "ssh": {
          "publicKeys": [
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "",
        "secret": "",
        "objectId": ""
      },
      "aadProfile": {
        "clientAppID": "<guid>",
        "serverAppID": "<guid>",
        "tenantID": "<guid>",
        "adminGroupID": "<guid>"
      }
    }
  }
  

What you expected to happen:
The deploy command should have provisioned a full cluster without any VMs failing to provision.
All of the installed extensions should have completed successfully.
Upon failure get an informative error and log whats going on better so there's no need to wait 90 min for a timeout.

How to reproduce it (as minimally and precisely as possible):
Fill in the above json conf and run:

aks-engine deploy --subscription-id <sub_id> --dns-prefix <RG_name> \
--location eastus --auto-suffix --api-model kubernetes_1.11.8.json  \
--resource-group <RG_name> --auth-method device --debug

Anything else we need to know:

I tried to get the cse run output from Azure support and was referred back to the AKS-Engine team.
This failure is sporadic but at times can happen a lot, for example yesterday 4 out 6 clusters failed this way.

@welcome
Copy link

welcome bot commented Mar 20, 2019

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

@barakAtSoluto barakAtSoluto changed the title Provisioning of VM extension cse Fails Provisioning of VM extension "cse" Fails Mar 20, 2019
@CecileRobertMichon
Copy link
Contributor

Can you please try with aks-engine v0.32.3? https://github.com/Azure/aks-engine/releases/tag/v0.32.3

There was a new walinuxagent release last week which revealed a bug in CSE implementation that causes race conditions. The fix is in 0.32.3.

https://github.com/Azure/aks-engine/blob/master/docs/howto/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout

@barakAtSoluto
Copy link
Author

I had multiple successes with 1 master 2 workers conf and once with full scale conf - 5 masters and 10 workers.

@CecileRobertMichon
Copy link
Contributor

Closing as the bug is fixed in 0.32.3.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants