Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating #1278

Closed
kwikwag opened this issue Oct 17, 2019 · 123 comments
Closed
Assignees

Comments

@kwikwag
Copy link

kwikwag commented Oct 17, 2019

What happened:
Pods with PVCs are stuck in ContainerCreating state, due to a problem with attachment/mounting.

I am using a VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks) cluster. Following a crash for the Kafka pods (using Confluent helm charts v5.3.1; see configuration below, under Environment), 2 of the 3 got stuck in the ContainerCreating state. The dashboard seems to show that all the PVCs are failing to mount because of one volume that has not been detached properly:

kafka-cp-kafka
Unable to mount volumes for pod "kafka-cp-kafka-0_default(kafkapod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "default"/"kafka-cp-kafka-0". list of unmounted volumes=[datadir-0]. list of unattached volumes=[datadir-0 jmx-config default-token-xxxcc]

kafka-cp-zookeeper
AttachVolume.Attach failed for volume "pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "kafka-cp-zookeeper-0_default(zookepod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "default"/"kafka-cp-zookeeper-0". list of unmounted volumes=[datadir datalogdir]. list of unattached volumes=[datadir datalogdir jmx-config default-token-xxxcc]

es-data-efk-logging-cluster-default
AttachVolume.Attach failed for volume "pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "es-data-efk-logging-cluster-default-0_logging(eslogdpod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "logging"/"es-data-efk-logging-cluster-default-0". list of unmounted volumes=[es-data]. list of unattached volumes=[es-data default-token-xxxdd]

es-master-efk-logging-cluster-default
AttachVolume.Attach failed for volume "pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "es-master-efk-logging-cluster-default-0_logging(eslogmpod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "logging"/"es-master-efk-logging-cluster-default-0". list of unmounted volumes=[es-data]. list of unattached volumes=[es-data default-token-xxxdd]

prometheus-prom-prometheus-operator-prometheus
AttachVolume.Attach failed for volume "pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "prometheus-prom-prometheus-operator-prometheus-0_monitoring(promppod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-prom-prometheus-operator-prometheus-0". list of unmounted volumes=[prometheus-prom-prometheus-operator-prometheus-db]. list of unattached volumes=[prometheus-prom-prometheus-operator-prometheus-db config config-out prometheus-prom-prometheus-operator-prometheus-rulefiles-0 prom-prometheus-operator-prometheus-token-xxxee]

alertmanager-prom-prometheus-operator-alertmanager
AttachVolume.Attach failed for volume "pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx" : Attach volume "kubernetes-dynamic-pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx" to instance "aks-nodepool1-00011122-vmss000000" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'diskfail-guid-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-00011122-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."
Unable to mount volumes for pod "alertmanager-prom-prometheus-operator-alertmanager-0_monitoring(promapod-guid-xxxx-xxxx-xxxxxxxxxxxx)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"alertmanager-prom-prometheus-operator-alertmanager-0". list of unmounted volumes=[alertmanager-prom-prometheus-operator-alertmanager-db]. list of unattached volumes=[alertmanager-prom-prometheus-operator-alertmanager-db config-volume prom-prometheus-operator-alertmanager-token-xxxff]

Running kubectl get pvc shows the PVC in Bound state (full YAML-JSON from Dashboard below in Environment):

NAMESPACE    NAME                                                                                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
default      datadir-0-kafka-cp-kafka-0                                                                                   Bound    pvc-kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-0-kafka-cp-kafka-1                                                                                   Bound    pvc-kafkad02-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-0-kafka-cp-kafka-2                                                                                   Bound    pvc-kafkad03-guid-xxxx-xxxx-xxxxxxxxxxxx  200Gi      RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-0                                                                                 Bound    pvc-zookepvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-1                                                                                 Bound    pvc-zooked02-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datadir-kafka-cp-zookeeper-2                                                                                 Bound    pvc-zooked03-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-0                                                                              Bound    pvc-zookel01-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-1                                                                              Bound    pvc-zookel02-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
default      datalogdir-kafka-cp-zookeeper-2                                                                              Bound    pvc-zookel03-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        3d16h
logging      es-data-es-data-efk-logging-cluster-default-0                                                                Bound    pvc-eslogdpvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
logging      es-data-es-master-efk-logging-cluster-default-0                                                              Bound    pvc-eslogmpvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   alertmanager-prom-prometheus-operator-alertmanager-db-alertmanager-prom-prometheus-operator-alertmanager-0   Bound    pvc-promapvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   prom-grafana                                                                                                 Bound    pvc-grafad01-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d
monitoring   prometheus-prom-prometheus-operator-prometheus-db-prometheus-prom-prometheus-operator-prometheus-0           Bound    pvc-promppvc-guid-xxxx-xxxx-xxxxxxxxxxxx  10Gi       RWO            default        10d

I tried scaling the Kafka StatefulSet down to 0, then wait a long while, then scale back to 3, but they didn't recover.

Then I tried to scale all Deployments and StatefulSets down to 0, and do a same-version upgrade the K8S cluster. Unfortunately, because of a problem (reported here) with the VMAccessForLinux extension I installed on the VMSS (following this guide to update SSH credentials on the nodes), the upgrade failed, 2.5 hours later, and the cluster remained in a Failed state. Now all of the pods with PVCs got stuck in ContainerCreating. I tried adding a second nodepool successfully, but pods placed on the new nodes still reported the same error, so I removed the second nodepool and scaled down the first nodepool to 1. I then tried to reboot the node using the Azure portal and from within an SSH connection. They all fail because of the issue with the extesnion. I then tried to gradually scale down all StatefulSets (I had to uninstall the prometheus-operator helm since it insisted on scaling the alertmanager StatefulSet back up), and enable only the logging StatefulSets, as they are smaller. It didn't help.

After taking down all StatefulSets, when running kubectl get nodes --output json | jq '.items[].status.volumesInUse' I get null.

What you expected to happen:
Pods with PVCs should start normally, and if mounting fails, it should eventually (and somewhat quickly) retry and succeed.

How to reproduce it (as minimally and precisely as possible):

I have no idea. This happens randomly.
Up to now, we have worked around it by removing our PVCs, but I don't want to do this any more, I need a solution.

Anything else we need to know?:

This is similar to the following issues, reported on Kubernetes and AKS. All of them have been closed, but none with a real solution AFAIK.

I replaced the GUIDs to anonimize the logs, but I kept it so that GUIDs are kept distinct.

Environment:

  • Kubernetes version (use kubectl version): VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks)
  • Size of cluster (how many worker nodes are in the cluster?) 1 nodepool with 3 Standard_DS3_v2 instances.
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Kafka, several dotnet core HTTP microservices, logging (FluentBit + ElasticSearch + Kibana stack), monitoring (prometheus + grafana).
  • Others:
cp-kafka:
  enabled: true
  brokers: 3
  persistence:
    enabled: true
    size: 200Gi
    storageClass: ~
    disksPerBroker: 1

  configurationOverrides:
   "auto.create.topics.enable": "true"
   "num.partitions": "10"
   "log.retention.bytes": "180000000000"
    • Kafka PVC YAML (kubectl get pvc xxx --output json):
{
    "apiVersion": "v1",
    "kind": "PersistentVolumeClaim",
    "metadata": {
        "annotations": {
            "pv.kubernetes.io/bind-completed": "yes",
            "pv.kubernetes.io/bound-by-controller": "yes",
            "volume.beta.kubernetes.io/storage-provisioner": "kubernetes.io/azure-disk"
        },
        "creationTimestamp": "2019-10-13T12:00:00Z",
        "finalizers": [
            "kubernetes.io/pvc-protection"
        ],
        "labels": {
            "app": "cp-kafka",
            "release": "kafka"
        },
        "name": "datadir-0-kafka-cp-kafka-0",
        "namespace": "default",
        "resourceVersion": "3241128",
        "selfLink": "/api/v1/namespaces/default/persistentvolumeclaims/datadir-0-kafka-cp-kafka-0",
        "uid": "kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx"
    },
    "spec": {
        "accessModes": [
            "ReadWriteOnce"
        ],
        "resources": {
            "requests": {
                "storage": "200Gi"
            }
        },
        "storageClassName": "default",
        "volumeMode": "Filesystem",
        "volumeName": "pvc-kafkad01-guid-xxxx-xxxx-xxxxxxxxxxxx"
    },
    "status": {
        "accessModes": [
            "ReadWriteOnce"
        ],
        "capacity": {
            "storage": "200Gi"
        },
        "phase": "Bound"
    }
}
@Aaron-ML
Copy link

Aaron-ML commented Oct 17, 2019

I'm also seeing this in west-europe. I can occasionally get them to connect over time, but it's hit or miss.

Also using a VMSS backed AKS cluster

@jason-epp
Copy link

I have a similar problem.
Also using a VMSS cluster
v1.14.6
If the pod moves to a different node, the drive (pvc) is not detaching from the old node and reattaching to the new node.

@worldspawn
Copy link

I'm using a VMSS cluster, 1.14.6, australia south east. I have this same issue trying to start the redis helm chart. I made a simple yaml and also get the same problem.

apiVersion: v1
kind: Service
metadata:
  name: redis
spec:
  ports:
    - port: 6379
      name: redis
    - port: 16379
      name: gossip
  clusterIP: None
  selector:
    app: redis
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: redis
    role: master
  name: redis-data-drawboard-teams
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: default
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: redis
spec:
  selector:
    matchLabels:
      app: redis  # has to match .spec.template.metadata.labels
  serviceName: redis
  replicas: 1
  template:
    metadata:
      labels:
        app: redis  # has to match .spec.selector.matchLabels
    spec:
      containers:
        - name: redis
          image: redis:5.0.6-alpine
          imagePullPolicy: Always
          #args: ["--requirepass", "$(REDIS_PASS)", "--appendonly", "yes", "--save", "900", "1", "--save", "30", "2"]
          ports:
            - containerPort: 6379
              name: redis
          env:
          - name: REDIS_PASS
            valueFrom:
              secretKeyRef:
                name: env-secrets
                key: REDIS_PASS
          volumeMounts:
            - name: redis-volume
              mountPath: /data
      volumes:
        - name: redis-volume
          persistentVolumeClaim:
            claimName: redis-data-drawboard-teams
  ----     ------              ----               ----                     -------
  Normal   NotTriggerScaleUp   41s                cluster-autoscaler       pod didn't trigger scale-up (it wouldn't fit if a new node is added):
  Warning  FailedScheduling    39s (x2 over 49s)  default-scheduler        pod has unbound immediate PersistentVolumeClaims (repeated 2 times)
  Normal   Scheduled           37s                default-scheduler        Successfully assigned au-prod-preprod/redis-0 to aks-default-15847111-vmss000002
  Warning  FailedAttachVolume  1s (x7 over 36s)   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-1b8dd06c-f135-11e9-9b7a-5eca0f4c09d2" : Attach volume "kubernetes-dynamic-pvc-1b8dd06c-f135-11e9-9b7a-5eca0f4c09d2" to instance "aks-default-15847111-vmss000002" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'bc080a10-b994-4a12-8ff7-f21d0575c0f9' to VM 'aks-default-15847111-vmss_2' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."```

@Glorf
Copy link

Glorf commented Oct 18, 2019

The same problem appears in europe-north, after pvc were deleted and then reinstatiated

@AustinSmart
Copy link

Same issue:
VMSS
West US
v1.13.10

@sharkymcdongles
Copy link

Also have this issue. Discussed there: Azure/aks-engine#1860

but was sent back here.

Number of nodes in your cluster?
30

Number of disks attached to those nodes?
varies but we have 131 pvcs with 5 or so a nod
Is cluster auto scaler enabled?
no it is disabled
What operation did you run when this issue was triggered?
we didn't do anything. the cluster just stopped mounting pvcs and said there was an api limit. we then checked and saw the vmss was in a failed state due to vm extension RunCommandLinux and vmssCSE
What version of aks-engine?
aksEngineVersion : v0.39.2-aks

What version of K8s?
1.14.6

@palma21 palma21 self-assigned this Oct 18, 2019
@andyzhangx
Copy link
Contributor

andyzhangx commented Oct 19, 2019

this could be a vmss issue due to rate limit, would you try Update Scale Set VM manually:

az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID(number)>

@jayakishorereddy
Copy link

Even after running the command. its the same issue.
Provisioning of VM extension 'vmssCSE' has timed out. Extension installation may be taking too long, or extension status could not be obtained.

@andyzhangx
Copy link
Contributor

@jayakishorereddy could you kubectl-enter <node-name> and then provide /var/log/azure/cluster-provision.log?

@sharkymcdongles
Copy link

@andyzhangx I tried this and it led to an internal execution error and couldn't complete. The cluster has only one VMSS and only 4 nodes show this error and are in failed state. Other nodes cannot mount the disks when moved off the failed nodes because azure says they are still attached to the failed node and throws a multiattach error and times out saying it is still attached to the failed node when it doesn't show this in the portal or anywhere and the node is now deallocated. After deallocating, I am unable to start the node back as it fails with with an internal execution error.

@sharkymcdongles
Copy link

Here is the /var/log/azure/cluster-provision.log from one of the failed nodes:

root@aks-e16sv3-79416110-vmss000004:/home/devops# cat /var/log/azure/cluster-provision.log
++ date
++ hostname
+ echo Sun Sep 1 01:18:21 UTC 2019,aks-e16sv3-79416110-vmss000004, startcustomscript
+ AZURE_STACK_ENV=azurestackcloud
+ script_lib=/opt/azure/containers/provision_source.sh
++ seq 1 3600
+ for i in '$(seq 1 3600)'
+ '[' -f /opt/azure/containers/provision_source.sh ']'
+ break
+ source /opt/azure/containers/provision_source.sh
++ ERR_SYSTEMCTL_START_FAIL=4
++ ERR_CLOUD_INIT_TIMEOUT=5
++ ERR_FILE_WATCH_TIMEOUT=6
++ ERR_HOLD_WALINUXAGENT=7
++ ERR_RELEASE_HOLD_WALINUXAGENT=8
++ ERR_APT_INSTALL_TIMEOUT=9
++ ERR_ETCD_DATA_DIR_NOT_FOUND=10
++ ERR_ETCD_RUNNING_TIMEOUT=11
++ ERR_ETCD_DOWNLOAD_TIMEOUT=12
++ ERR_ETCD_VOL_MOUNT_FAIL=13
++ ERR_ETCD_START_TIMEOUT=14
++ ERR_ETCD_CONFIG_FAIL=15
++ ERR_DOCKER_INSTALL_TIMEOUT=20
++ ERR_DOCKER_DOWNLOAD_TIMEOUT=21
++ ERR_DOCKER_KEY_DOWNLOAD_TIMEOUT=22
++ ERR_DOCKER_APT_KEY_TIMEOUT=23
++ ERR_DOCKER_START_FAIL=24
++ ERR_MOBY_APT_LIST_TIMEOUT=25
++ ERR_MS_GPG_KEY_DOWNLOAD_TIMEOUT=26
++ ERR_MOBY_INSTALL_TIMEOUT=27
++ ERR_K8S_RUNNING_TIMEOUT=30
++ ERR_K8S_DOWNLOAD_TIMEOUT=31
++ ERR_KUBECTL_NOT_FOUND=32
++ ERR_IMG_DOWNLOAD_TIMEOUT=33
++ ERR_KUBELET_START_FAIL=34
++ ERR_CONTAINER_IMG_PULL_TIMEOUT=35
++ ERR_CNI_DOWNLOAD_TIMEOUT=41
++ ERR_MS_PROD_DEB_DOWNLOAD_TIMEOUT=42
++ ERR_MS_PROD_DEB_PKG_ADD_FAIL=43
++ ERR_SYSTEMD_INSTALL_FAIL=48
++ ERR_MODPROBE_FAIL=49
++ ERR_OUTBOUND_CONN_FAIL=50
++ ERR_KATA_KEY_DOWNLOAD_TIMEOUT=60
++ ERR_KATA_APT_KEY_TIMEOUT=61
++ ERR_KATA_INSTALL_TIMEOUT=62
++ ERR_CONTAINERD_DOWNLOAD_TIMEOUT=70
++ ERR_CUSTOM_SEARCH_DOMAINS_FAIL=80
++ ERR_GPU_DRIVERS_START_FAIL=84
++ ERR_GPU_DRIVERS_INSTALL_TIMEOUT=85
++ ERR_SGX_DRIVERS_INSTALL_TIMEOUT=90
++ ERR_SGX_DRIVERS_START_FAIL=91
++ ERR_APT_DAILY_TIMEOUT=98
++ ERR_APT_UPDATE_TIMEOUT=99
++ ERR_CSE_PROVISION_SCRIPT_NOT_READY_TIMEOUT=100
++ ERR_APT_DIST_UPGRADE_TIMEOUT=101
++ ERR_SYSCTL_RELOAD=103
++ ERR_CIS_ASSIGN_ROOT_PW=111
++ ERR_CIS_ASSIGN_FILE_PERMISSION=112
++ ERR_PACKER_COPY_FILE=113
++ ERR_CIS_APPLY_PASSWORD_CONFIG=115
++ ERR_AZURE_STACK_GET_ARM_TOKEN=120
++ ERR_AZURE_STACK_GET_NETWORK_CONFIGURATION=121
++ ERR_AZURE_STACK_GET_SUBNET_PREFIX=122
+++ gawk 'match($0, /^(ID_LIKE=(coreos)|ID=(.*))$/, a) { print toupper(a[2] a[3]); exit }'
+++ sort -r /etc/lsb-release /etc/os-release
++ OS=UBUNTU
++ UBUNTU_OS_NAME=UBUNTU
++ RHEL_OS_NAME=RHEL
++ COREOS_OS_NAME=COREOS
++ KUBECTL=/usr/local/bin/kubectl
++ DOCKER=/usr/bin/docker
++ GPU_DV=418.40.04
++ GPU_DEST=/usr/local/nvidia
++ NVIDIA_DOCKER_VERSION=2.0.3
++ DOCKER_VERSION=1.13.1-1
++ NVIDIA_CONTAINER_RUNTIME_VERSION=2.0.0
+ install_script=/opt/azure/containers/provision_installs.sh
+ wait_for_file 3600 1 /opt/azure/containers/provision_installs.sh
+ retries=3600
+ wait_sleep=1
+ filepath=/opt/azure/containers/provision_installs.sh
++ seq 1 3600
+ for i in '$(seq 1 $retries)'
+ '[' -f /opt/azure/containers/provision_installs.sh ']'
+ break
+ source /opt/azure/containers/provision_installs.sh
++ CC_SERVICE_IN_TMP=/opt/azure/containers/cc-proxy.service.in
++ CC_SOCKET_IN_TMP=/opt/azure/containers/cc-proxy.socket.in
++ CNI_CONFIG_DIR=/etc/cni/net.d
++ CNI_BIN_DIR=/opt/cni/bin
++ CNI_DOWNLOADS_DIR=/opt/cni/downloads
++ CONTAINERD_DOWNLOADS_DIR=/opt/containerd/downloads
+++ lsb_release -r -s
++ UBUNTU_RELEASE=16.04
+ config_script=/opt/azure/containers/provision_configs.sh
+ wait_for_file 3600 1 /opt/azure/containers/provision_configs.sh
+ retries=3600
+ wait_sleep=1
+ filepath=/opt/azure/containers/provision_configs.sh
++ seq 1 3600
+ for i in '$(seq 1 $retries)'
+ '[' -f /opt/azure/containers/provision_configs.sh ']'
+ break
+ source /opt/azure/containers/provision_configs.sh
+++ hostname
+++ tail -c 2
++ NODE_INDEX=4
+++ hostname
++ NODE_NAME=aks-e16sv3-79416110-vmss000004
++ [[ UBUNTU == COREOS ]]
+++ hostname -I
+++ cut '-d ' -f1
++ PRIVATE_IP=10.4.3.90
++ ETCD_PEER_URL=https://10.4.3.90:2380
++ ETCD_CLIENT_URL=https://10.4.3.90:2379
+ cis_script=/opt/azure/containers/provision_cis.sh
+ wait_for_file 3600 1 /opt/azure/containers/provision_cis.sh
+ retries=3600
+ wait_sleep=1
+ filepath=/opt/azure/containers/provision_cis.sh
++ seq 1 3600
+ for i in '$(seq 1 $retries)'
+ '[' -f /opt/azure/containers/provision_cis.sh ']'
+ break
+ source /opt/azure/containers/provision_cis.sh
+ [[ azurepubliccloud == \a\z\u\r\e\s\t\a\c\k\c\l\o\u\d ]]
+ CUSTOM_SEARCH_DOMAIN_SCRIPT=/opt/azure/containers/setup-custom-search-domains.sh
+ set +x
+ [[ UBUNTU == COREOS ]]
+ '[' -f /var/run/reboot-required ']'
+ REBOOTREQUIRED=false
+ '[' -f /var/log.vhd/azure/golden-image-install.complete ']'
+ echo 'detected golden image pre-install'
detected golden image pre-install
+ FULL_INSTALL_REQUIRED=false
+ rm -rf /home/packer
+ deluser packer
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
Removing user `packer' ...
Warning: group `packer' has no more members.
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
Done.
+ groupdel packer
groupdel: group 'packer' does not exist
+ [[ UBUNTU == UBUNTU ]]
+ '[' false = true ']'
+ echo 'Golden image; skipping dependencies installation'
Golden image; skipping dependencies installation
+ ensureAuditD
+ [[ false == true ]]
+ apt list --installed
+ grep auditd

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

auditd/xenial-updates,now 1:2.4.5-1ubuntu2.1 amd64 [installed]
+ systemctlDisableAndStop auditd
+ systemctl_stop 100 5 30 auditd
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=auditd
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl stop auditd
+ break
+ retrycmd_if_failure 120 5 25 systemctl disable auditd
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl disable auditd
Synchronizing state of auditd.service with SysV init with /lib/systemd/systemd-sysv-install...
Executing /lib/systemd/systemd-sysv-install disable auditd
insserv: warning: current start runlevel(s) (empty) of script `auditd' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `auditd' overrides LSB defaults (0 1 6).
+ break
+ echo Executed '"systemctl' disable 'auditd"' 1 times
Executed "systemctl disable auditd" 1 times
+ [[ -n '' ]]
+ [[ UBUNTU != COREOS ]]
+ installContainerRuntime
+ [[ docker == \d\o\c\k\e\r ]]
+ installMoby
++ dockerd --version
++ grep 'Docker version'
++ cut -d , -f 1
++ cut -d ' ' -f 3
+ CURRENT_VERSION=3.0.6
+ [[ 3.0.6 == \3\.\0\.\6 ]]
+ echo 'dockerd 3.0.6 is already installed, skipping Moby download'
dockerd 3.0.6 is already installed, skipping Moby download
+ installNetworkPlugin
+ [[ azure = \a\z\u\r\e ]]
+ installAzureCNI
++ echo https://acs-mirror.azureedge.net/cni/azure-vnet-cni-linux-amd64-v1.0.25.tgz
++ cut -d / -f 5
+ CNI_TGZ_TMP=azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ [[ ! -f /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz ]]
+ downloadAzureCNI
+ mkdir -p /opt/cni/downloads
++ echo https://acs-mirror.azureedge.net/cni/azure-vnet-cni-linux-amd64-v1.0.25.tgz
++ cut -d / -f 5
+ CNI_TGZ_TMP=azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ retrycmd_get_tarball 120 5 /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz https://acs-mirror.azureedge.net/cni/azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ tar_retries=120
+ wait_sleep=5
+ tarball=/opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ url=https://acs-mirror.azureedge.net/cni/azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ echo '120 retries'
120 retries
++ seq 1 120
+ for i in '$(seq 1 $tar_retries)'
+ tar -tzf /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz
tar (child): /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
+ '[' 1 -eq 120 ']'
+ timeout 60 curl -fsSL https://acs-mirror.azureedge.net/cni/azure-vnet-cni-linux-amd64-v1.0.25.tgz -o /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz
+ sleep 5
+ for i in '$(seq 1 $tar_retries)'
+ tar -tzf /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz
azure-vnet
azure-vnet-ipam
azure-vnet-telemetry
10-azure.conflist
azure-vnet-telemetry.config
+ break
+ mkdir -p /etc/cni/net.d
+ chown -R root:root /etc/cni/net.d
+ chmod 755 /etc/cni/net.d
+ mkdir -p /opt/cni/bin
+ tar -xzf /opt/cni/downloads/azure-vnet-cni-linux-amd64-v1.0.25.tgz -C /opt/cni/bin
+ installCNI
++ echo https://acs-mirror.azureedge.net/cni/cni-plugins-amd64-v0.7.5.tgz
++ cut -d / -f 5
+ CNI_TGZ_TMP=cni-plugins-amd64-v0.7.5.tgz
+ [[ ! -f /opt/cni/downloads/cni-plugins-amd64-v0.7.5.tgz ]]
+ mkdir -p /opt/cni/bin
+ tar -xzf /opt/cni/downloads/cni-plugins-amd64-v0.7.5.tgz -C /opt/cni/bin
+ chown -R root:root /opt/cni/bin
+ chmod -R 755 /opt/cni/bin
+ rm -rf /opt/cni/downloads
+ [[ docker == \k\a\t\a\-\c\o\n\t\a\i\n\e\r\s ]]
+ [[ docker == \c\o\n\t\a\i\n\e\r\d ]]
+ cleanUpContainerd
+ rm -Rf /opt/containerd/downloads
+ [[ false = true ]]
+ cleanUpGPUDrivers
+ rm -Rf /usr/local/nvidia
+ rm -f /etc/apt/sources.list.d/nvidia-docker.list
+ installKubeletAndKubectl
+ [[ ! -f /usr/local/bin/kubectl-1.14.6 ]]
+ [[ docker == \d\o\c\k\e\r ]]
+ extractHyperkube docker
+ CLI_TOOL=docker
+ path=/home/hyperkube-downloads/1.14.6
+ pullContainerImage docker aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
+ CLI_TOOL=docker
+ DOCKER_IMAGE_URL=aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
+ [[ -n '' ]]
+ retrycmd_if_failure 60 1 1200 docker pull aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
+ retries=60
+ wait_sleep=1
+ timeout=1200
+ shift
+ shift
+ shift
++ seq 1 60
+ for i in '$(seq 1 $retries)'
+ timeout 1200 docker pull aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
v1.14.6: Pulling from mirror/hyperkube-amd64
346aee5ea5bc: Already exists
9c47fde751af: Already exists
be2693a52daf: Already exists
6b69eb11d043: Already exists
0abeb1500767: Already exists
4062d80041b7: Already exists
23b6daf06fc2: Already exists
74b51ee2b296: Pulling fs layer
74b51ee2b296: Verifying Checksum
74b51ee2b296: Download complete
74b51ee2b296: Pull complete
Digest: sha256:db2120fdf4f18c4a8b7a7e7cb14dd29c376c328f9cd1cc1d9375daa0ce37bf20
Status: Downloaded newer image for aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6
+ break
+ echo Executed '"docker' pull 'aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6"' 1 times
Executed "docker pull aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6" 1 times
+ [[ docker == \d\o\c\k\e\r ]]
+ mkdir -p /home/hyperkube-downloads/1.14.6
+ docker run --rm -v /home/hyperkube-downloads/1.14.6:/home/hyperkube-downloads/1.14.6 aksrepos.azurecr.io/mirror/hyperkube-amd64:v1.14.6 /bin/bash -c 'cp /hyperkube /home/hyperkube-downloads/1.14.6'
+ [[ UBUNTU == COREOS ]]
+ cp /home/hyperkube-downloads/1.14.6/hyperkube /usr/local/bin/kubelet-1.14.6
+ mv /home/hyperkube-downloads/1.14.6/hyperkube /usr/local/bin/kubectl-1.14.6
+ mv /usr/local/bin/kubelet-1.14.6 /usr/local/bin/kubelet
+ mv /usr/local/bin/kubectl-1.14.6 /usr/local/bin/kubectl
+ chmod a+x /usr/local/bin/kubelet /usr/local/bin/kubectl
+ [[ UBUNTU != COREOS ]]
+ ensureRPC
+ systemctlEnableAndStart rpcbind
+ systemctl_restart 100 5 30 rpcbind
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=rpcbind
+ rm -rf /usr/local/bin/kubelet-1.10.12 /usr/local/bin/kubelet-1.10.13 /usr/local/bin/kubelet-1.11.10 /usr/local/bin/kubelet-1.11.10-azs /usr/local/bin/kubelet-1.11.9 /usr/local/bin/kubelet-1.11.9-azs /usr/local/bin/kubelet-1.12.7 /usr/local/bin/kubelet-1.12.7-azs /usr/local/bin/kubelet-1.12.8 /usr/local/bin/kubelet-1.12.8-azs /usr/local/bin/kubelet-1.13.8 /usr/local/bin/kubelet-1.13.8-azs /usr/local/bin/kubelet-1.13.9 /usr/local/bin/kubelet-1.13.9-azs /usr/local/bin/kubelet-1.14.4 /usr/local/bin/kubelet-1.14.4-azs /usr/local/bin/kubelet-1.14.5 /usr/local/bin/kubelet-1.14.5-azs /usr/local/bin/kubelet-1.15.1 /usr/local/bin/kubelet-1.15.1-azs /usr/local/bin/kubelet-1.15.2 /usr/local/bin/kubelet-1.15.2-azs /usr/local/bin/kubectl-1.10.12 /usr/local/bin/kubectl-1.10.13 /usr/local/bin/kubectl-1.11.10 /usr/local/bin/kubectl-1.11.10-azs /usr/local/bin/kubectl-1.11.9 /usr/local/bin/kubectl-1.11.9-azs /usr/local/bin/kubectl-1.12.7 /usr/local/bin/kubectl-1.12.7-azs /usr/local/bin/kubectl-1.12.8 /usr/local/bin/kubectl-1.12.8-azs /usr/local/bin/kubectl-1.13.8 /usr/local/bin/kubectl-1.13.8-azs /usr/local/bin/kubectl-1.13.9 /usr/local/bin/kubectl-1.13.9-azs /usr/local/bin/kubectl-1.14.4 /usr/local/bin/kubectl-1.14.4-azs /usr/local/bin/kubectl-1.14.5 /usr/local/bin/kubectl-1.14.5-azs /usr/local/bin/kubectl-1.15.1 /usr/local/bin/kubectl-1.15.1-azs /usr/local/bin/kubectl-1.15.2 /usr/local/bin/kubectl-1.15.2-azs /home/hyperkube-downloads
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart rpcbind
+ break
+ RESTART_STATUS=0
+ systemctl status rpcbind --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable rpcbind
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable rpcbind
Synchronizing state of rpcbind.service with SysV init with /lib/systemd/systemd-sysv-install...
Executing /lib/systemd/systemd-sysv-install enable rpcbind
+ break
+ echo Executed '"systemctl' enable 'rpcbind"' 1 times
Executed "systemctl enable rpcbind" 1 times
+ systemctlEnableAndStart rpc-statd
+ systemctl_restart 100 5 30 rpc-statd
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=rpc-statd
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart rpc-statd
+ break
+ RESTART_STATUS=0
+ systemctl status rpc-statd --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable rpc-statd
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable rpc-statd
Created symlink from /etc/systemd/system/nfs-server.service.wants/rpc-statd.service to /lib/systemd/system/rpc-statd.service.
+ break
+ echo Executed '"systemctl' enable 'rpc-statd"' 1 times
Executed "systemctl enable rpc-statd" 1 times
+ createKubeManifestDir
+ KUBEMANIFESTDIR=/etc/kubernetes/manifests
+ mkdir -p /etc/kubernetes/manifests
+ [[ false = true ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ removeEtcd
+ [[ UBUNTU == COREOS ]]
+ rm -rf /usr/bin/etcd
+ '[' -f /opt/azure/containers/setup-custom-search-domains.sh ']'
+ [[ docker == \d\o\c\k\e\r ]]
+ ensureDocker
+ DOCKER_SERVICE_EXEC_START_FILE=/etc/systemd/system/docker.service.d/exec_start.conf
+ wait_for_file 1200 1 /etc/systemd/system/docker.service.d/exec_start.conf
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/systemd/system/docker.service.d/exec_start.conf
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/systemd/system/docker.service.d/exec_start.conf ']'
+ break
+ echo 'ExecStartPost=/sbin/iptables -P FORWARD ACCEPT'
+ usermod -aG docker devops
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
sent invalidate(passwd) request, exiting
sent invalidate(group) request, exiting
+ DOCKER_MOUNT_FLAGS_SYSTEMD_FILE=/etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf
+ [[ UBUNTU != COREOS ]]
+ wait_for_file 1200 1 /etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf ']'
+ break
+ DOCKER_JSON_FILE=/etc/docker/daemon.json
+ wait_for_file 1200 1 /etc/docker/daemon.json
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/docker/daemon.json
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/docker/daemon.json ']'
+ break
+ systemctlEnableAndStart docker
+ systemctl_restart 100 5 30 docker
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=docker
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart docker
+ break
+ RESTART_STATUS=0
+ systemctl status docker --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable docker
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable docker
+ break
+ echo Executed '"systemctl' enable 'docker"' 1 times
Executed "systemctl enable docker" 1 times
+ DOCKER_MONITOR_SYSTEMD_TIMER_FILE=/etc/systemd/system/docker-monitor.timer
+ wait_for_file 1200 1 /etc/systemd/system/docker-monitor.timer
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/systemd/system/docker-monitor.timer
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/systemd/system/docker-monitor.timer ']'
+ break
+ DOCKER_MONITOR_SYSTEMD_FILE=/etc/systemd/system/docker-monitor.service
+ wait_for_file 1200 1 /etc/systemd/system/docker-monitor.service
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/systemd/system/docker-monitor.service
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/systemd/system/docker-monitor.service ']'
+ break
+ systemctlEnableAndStart docker-monitor.timer
+ systemctl_restart 100 5 30 docker-monitor.timer
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=docker-monitor.timer
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart docker-monitor.timer
+ break
+ RESTART_STATUS=0
+ systemctl status docker-monitor.timer --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable docker-monitor.timer
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable docker-monitor.timer
Created symlink from /etc/systemd/system/multi-user.target.wants/docker-monitor.timer to /etc/systemd/system/docker-monitor.timer.
+ break
+ echo Executed '"systemctl' enable 'docker-monitor.timer"' 1 times
Executed "systemctl enable docker-monitor.timer" 1 times
+ configureK8s
+ KUBELET_PRIVATE_KEY_PATH=/etc/kubernetes/certs/client.key
+ touch /etc/kubernetes/certs/client.key
+ chmod 0600 /etc/kubernetes/certs/client.key
+ chown root:root /etc/kubernetes/certs/client.key
+ APISERVER_PUBLIC_KEY_PATH=/etc/kubernetes/certs/apiserver.crt
+ touch /etc/kubernetes/certs/apiserver.crt
+ chmod 0644 /etc/kubernetes/certs/apiserver.crt
+ chown root:root /etc/kubernetes/certs/apiserver.crt
+ AZURE_JSON_PATH=/etc/kubernetes/azure.json
+ touch /etc/kubernetes/azure.json
+ chmod 0600 /etc/kubernetes/azure.json
+ chown root:root /etc/kubernetes/azure.json
+ set +x
+ [[ -n '' ]]
+ configureKubeletServerCert
+ KUBELET_SERVER_PRIVATE_KEY_PATH=/etc/kubernetes/certs/kubeletserver.key
+ KUBELET_SERVER_CERT_PATH=/etc/kubernetes/certs/kubeletserver.crt
+ openssl genrsa -out /etc/kubernetes/certs/kubeletserver.key 2048
Generating RSA private key, 2048 bit long modulus
..................................................+++
...............+++
unable to write 'random state'
e is 65537 (0x10001)
+ openssl req -new -x509 -days 7300 -key /etc/kubernetes/certs/kubeletserver.key -out /etc/kubernetes/certs/kubeletserver.crt -subj /CN=aks-e16sv3-79416110-vmss000004
+ [[ azurepubliccloud == \a\z\u\r\e\s\t\a\c\k\c\l\o\u\d ]]
+ configureCNI
+ retrycmd_if_failure 120 5 25 modprobe br_netfilter
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 modprobe br_netfilter
+ break
+ echo Executed '"modprobe' 'br_netfilter"' 1 times
Executed "modprobe br_netfilter" 1 times
+ echo -n br_netfilter
+ configureCNIIPTables
+ [[ azure = \a\z\u\r\e ]]
+ mv /opt/cni/bin/10-azure.conflist /etc/cni/net.d/
+ chmod 600 /etc/cni/net.d/10-azure.conflist
+ [[ '' == \c\a\l\i\c\o ]]
+ /sbin/ebtables -t nat --list
Bridge table: nat

Bridge chain: PREROUTING, entries: 0, policy: ACCEPT

Bridge chain: OUTPUT, entries: 0, policy: ACCEPT

Bridge chain: POSTROUTING, entries: 0, policy: ACCEPT
+ [[ azure = \c\i\l\i\u\m ]]
+ [[ azurepubliccloud == \a\z\u\r\e\s\t\a\c\k\c\l\o\u\d ]]
+ [[ -n '' ]]
+ [[ docker == \k\a\t\a\-\c\o\n\t\a\i\n\e\r\s ]]
+ [[ docker == \c\o\n\t\a\i\n\e\r\d ]]
+ [[ -n '' ]]
+ '[' false = true ']'
+ ensureKubelet
+ KUBELET_DEFAULT_FILE=/etc/default/kubelet
+ wait_for_file 1200 1 /etc/default/kubelet
+ retries=1200
+ wait_sleep=1
+ filepath=/etc/default/kubelet
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /etc/default/kubelet ']'
+ break
+ KUBECONFIG_FILE=/var/lib/kubelet/kubeconfig
+ wait_for_file 1200 1 /var/lib/kubelet/kubeconfig
+ retries=1200
+ wait_sleep=1
+ filepath=/var/lib/kubelet/kubeconfig
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /var/lib/kubelet/kubeconfig ']'
+ break
+ KUBELET_RUNTIME_CONFIG_SCRIPT_FILE=/opt/azure/containers/kubelet.sh
+ wait_for_file 1200 1 /opt/azure/containers/kubelet.sh
+ retries=1200
+ wait_sleep=1
+ filepath=/opt/azure/containers/kubelet.sh
++ seq 1 1200
+ for i in '$(seq 1 $retries)'
+ '[' -f /opt/azure/containers/kubelet.sh ']'
+ break
+ systemctlEnableAndStart kubelet
+ systemctl_restart 100 5 30 kubelet
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=kubelet
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart kubelet
+ break
+ RESTART_STATUS=0
+ systemctl status kubelet --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable kubelet
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable kubelet
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
+ break
+ echo Executed '"systemctl' enable 'kubelet"' 1 times
Executed "systemctl enable kubelet" 1 times
+ ensureJournal
+ echo Storage=persistent
+ echo SystemMaxUse=1G
+ echo RuntimeMaxUse=1G
+ echo ForwardToSyslog=yes
+ systemctlEnableAndStart systemd-journald
+ systemctl_restart 100 5 30 systemd-journald
+ retries=100
+ wait_sleep=5
+ timeout=30
+ svcname=systemd-journald
++ seq 1 100
+ for i in '$(seq 1 $retries)'
+ timeout 30 systemctl daemon-reload
+ timeout 30 systemctl restart systemd-journald
+ break
+ RESTART_STATUS=0
+ systemctl status systemd-journald --no-pager -l
+ '[' 0 -ne 0 ']'
+ retrycmd_if_failure 120 5 25 systemctl enable systemd-journald
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 systemctl enable systemd-journald
+ break
+ echo Executed '"systemctl' enable 'systemd-journald"' 1 times
Executed "systemctl enable systemd-journald" 1 times
+ [[ -n '' ]]
+ false
+ echo 'Custom script finished successfully'
Custom script finished successfully
++ date
++ hostname
+ echo Sun Sep 1 01:18:58 UTC 2019,aks-e16sv3-79416110-vmss000004, endcustomscript
+ mkdir -p /opt/azure/containers
+ touch /opt/azure/containers/provision.complete
+ false
+ ps auxfww
+ cleanUpContainerImages
+ '[' true = false ']'
+ false
+ [[ UBUNTU == UBUNTU ]]
+ docker rmi registry:2
++ docker images --format '{{.Repository}}:{{.Tag}}'
++ docker images --format '{{.Repository}}:{{.Tag}}'
+ /usr/lib/apt/apt.systemd.daily
++ grep hyperkube
++ grep cloud-controller-manager
++ grep -v '1.14.6$'
++ grep -v '1.14.6$'
+ aptmarkWALinuxAgent unhold
+ wait_for_apt_locks
+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock
+ retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent
+ retries=120
+ wait_sleep=5
+ timeout=25
+ shift
+ shift
+ shift
++ seq 1 120
+ for i in '$(seq 1 $retries)'
+ timeout 25 apt-mark unhold walinuxagent
Error: No such image: registry:2
dpkg: error: dpkg frontend is locked by another process
E: Sub-process dpkg --set-selections returned an error code (2)
E: Executing dpkg failed. Are you root?
+ '[' 1 -eq 120 ']'
+ sleep 5
+ docker rmi k8s.gcr.io/cloud-controller-manager-amd64:v1.15.2 k8s.gcr.io/cloud-controller-manager-amd64:v1.14.5 k8s.gcr.io/cloud-controller-manager-amd64:v1.13.9 k8s.gcr.io/cloud-controller-manager-amd64:v1.15.1 k8s.gcr.io/cloud-controller-manager-amd64:v1.14.4 k8s.gcr.io/cloud-controller-manager-amd64:v1.13.8 k8s.gcr.io/cloud-controller-manager-amd64:v1.11.10 k8s.gcr.io/cloud-controller-manager-amd64:v1.12.8 k8s.gcr.io/cloud-controller-manager-amd64:v1.11.9 k8s.gcr.io/cloud-controller-manager-amd64:v1.12.7 k8s.gcr.io/cloud-controller-manager-amd64:v1.10.13 k8s.gcr.io/cloud-controller-manager-amd64:v1.10.12
+ docker rmi mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.13.9-azs mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.14.5-azs mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.15.2-azs k8s.gcr.io/hyperkube-amd64:v1.15.2 k8s.gcr.io/hyperkube-amd64:v1.14.5 k8s.gcr.io/hyperkube-amd64:v1.13.9 mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.15.1-azs k8s.gcr.io/hyperkube-amd64:v1.15.1 mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.14.4-azs mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.13.8-azs k8s.gcr.io/hyperkube-amd64:v1.14.4 k8s.gcr.io/hyperkube-amd64:v1.13.8 mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.11.10-azs mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.12.8-azs k8s.gcr.io/hyperkube-amd64:v1.11.10 k8s.gcr.io/hyperkube-amd64:v1.12.8 mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.12.7-azs mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.11.9-azs k8s.gcr.io/hyperkube-amd64:v1.11.9 k8s.gcr.io/hyperkube-amd64:v1.12.7 k8s.gcr.io/hyperkube-amd64:v1.10.13 k8s.gcr.io/hyperkube-amd64:v1.10.12
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.13.9-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:f365c524fecc8fa9c23d1ad37832be87cb3cddc7317d3f4f11daac185b859128
Deleted: sha256:12a75b3285c0799560e7b2e0c37308e724809b455595a88e9cbd6421e2d3904e
Deleted: sha256:3896616289aa1f15e49b06b4de4f06e4ee645f514612bdecabba9f9eeca8165e
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.15.2
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:8b1bacb8daa8208f8364098f20f7eea9cae051308008e27d03f26f76f686252d
Deleted: sha256:7964ab09b4bf3d991580cd8283c58785281e287947ac212c328d74fc418a1532
Deleted: sha256:5159ce9566e028134e19ee86124a70f89e89418b02c0277ccb0299bbce1ed47d
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.14.5-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:e0e08ec79fadb60bb4a250726f94420f398d912a881d91b8107d5590f3e76fcb
Deleted: sha256:b4bf099e72d564c59d8634223608bcdf18d7587db5b6923c3f10d9d6fbfd61ae
Deleted: sha256:6c4ee17d51e930d21dc8d169c7cfe2b523602ffb78fa1c3f9b91177a6c1e431f
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.14.5
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:8cddf8de501b24c37157688f970bdbb0244d342fd4999d04af4a2c4034f2b0b6
Deleted: sha256:cfeb7bef08aa5587013552071a8903f34f8479135010da878bf93b182489b72d
Deleted: sha256:dd7697a9d1dc1e47b232c24d1c0b1c5b1039019c9c31005afe22f163649d53c5
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.15.2-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:09bb1973445a49947c0995aa13a168b7ce7cbfef9932973fdd270d039fad233b
Deleted: sha256:fb8a776d0fecc4796a7378613dd4120740146d73d5f1748ff9e8d88bd2ac9aa3
Deleted: sha256:d70514d99e4eb947eacb258af2dc41ec894abaa56156374141121d2a79f7c2c6
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.13.9
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:c6771585b6a8e24f595009265ea5df9726f4df4ca6d752621e9deddd7a10ad22
Deleted: sha256:86cd88ccc6dd31a1181b96d1a7f3d62204404a115473f4315b9809ef04b8f941
Deleted: sha256:9bc93b533725d5a11ed8eb6a921de0d38de60501e4a2a1de3209f2b799c8ae93
Untagged: k8s.gcr.io/hyperkube-amd64:v1.15.2
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:bcc0cb58d0d09a8f8413e2720d815610f33eedf3f67ce45b6e5383a1145185a3
Deleted: sha256:28c8410d1a4d0455b348a19bca8cc4fec4de0cb1944c101ac74e430bbb09d882
Deleted: sha256:8942ec2868becf8b738d0715440972c60fad3d9a853375539ff1558a8746ff50
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.15.1
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:ee8d8c61971cd05d222498332df18e59d0b1d632b42f981e4978178bcaf7f409
Deleted: sha256:f2abfe3576901304f02e02da9703d9ff4ad0ed97d75917e0064034ef50222a23
Deleted: sha256:4232370b37e94eafa2ed2e68dc0df7a0655f0962d91050c0e0150679ea675a1f
Untagged: k8s.gcr.io/hyperkube-amd64:v1.14.5
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:f30cf9b310d429a8736ebb7b8e3b5175d48691591b4d90244c0f792e5a2bb919
Deleted: sha256:59f7d353c26d6b51cdfe9d54c08942ac9c0565fbf3571aa33395595e05a2f161
Deleted: sha256:eb3618147bfceeb1594edef746c54f04075e08b2fc37c2bb29e808dbd38d7737
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.14.4
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:a58324e211a31180f7382cf7c9d8f0aeb06c13fe2e1dacf6c181f4ac561f36d2
Deleted: sha256:99454701747947526021f2cb067e6cf7e9492bfa698983b989821ae99bf8f5b6
Deleted: sha256:cc912c032e529cf4640b0f0cfcc9dabfed965548418ca1450d154720dfb5c71d
Untagged: k8s.gcr.io/hyperkube-amd64:v1.13.9
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:be63e4e00b42763aee68be63b834bc5f47daf0c5442c250d7e7085d3458c433e
Deleted: sha256:7e76269e4e737ee089abdcf625b97c8ba4278af185d44548b060f45153b4f297
Deleted: sha256:82a4d8eb176ef697d73dcdc98560ced4cfb35a5e60f8999d5a771669f8b8eba7
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.13.8
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:ad7b5e95147892c83f84792fc3a8ca9827a4caf7668e43ee866a76f9dd0b89fc
Deleted: sha256:a8cf674a893fe7183d2e230335b6e03e8d8f17cb26bf707b6a0ec6143738d1cd
Deleted: sha256:c6d29fab938a6abcf62386033d0e4a50d72c3d2f4b3f34a0c47192ee605da7c5
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.15.1-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:87aab8908a5b462af48b5b1819418ddd59611440d4e89c166b9aab10a7aa1171
Deleted: sha256:a416e99c3fda26420879d3cc047366475e52f8175e7bed5fffa100b4994d6e88
Deleted: sha256:a60091de49efe9ef2d94d57ab57ada7b3c583d48c11abcba0f56ea4c769a6fa3
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.11.10
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:c83f9a97c3748475c1b151b8f07c2090dfb9482807d909cbc2a0602d88acdaaf
Deleted: sha256:7ce5469873860ac0a4a90d9a055120c4617c1ed2a58a414d3298b23bc0bfbc1e
Deleted: sha256:58562b9e798d172744437e420a76f0a70c820c5e8e2c185a40be4dd644a1b50d
Untagged: k8s.gcr.io/hyperkube-amd64:v1.15.1
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:2b3ca92921ec1b9db8341f262765eb5777c84b6951540b0191d5a9fbe4cc1649
Deleted: sha256:ce850de907910149b2847e725fb220b4ab3ee42810de75eebecc41433add33c2
Deleted: sha256:31eab6d6665da906ea1f532bb30f2ad1f3b15c05d1c6294b9d9dae801824c7f5
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.12.8
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:d2c9ada27a1e52d57331573ce03eb05a8bff5206174dc5a07cd8a23ca9eaa481
Deleted: sha256:9e54a7127b20eadf5f2f5e412ef44c08aefc276765033a7ff3d1808f6f4a397f
Deleted: sha256:0a150afce0001266935b3d3697f9500134ef957fb85ef2db1526bd1551731b89
Deleted: sha256:0b97b1c81a3200e9eeb87f17a5d25a50791a16fa08fc41eb94ad15f26516ccea
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.14.4-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:eed04e64e97773955472880beb6240313f5a878c38661fa5b634920f57c30ae9
Deleted: sha256:3a01131727a8377a088db3d2c37074d74f8e31a74097ae0ce9f8b5595fca73f7
Deleted: sha256:8138ff41aa08732c207912382b7dbfd503e86788f83652b8a7e29ec2adf6a71e
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.11.9
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:6b49ca52fdb50c380dd9df641c99709aaad07da863db71933c4c826b7c19df1d
Deleted: sha256:aaf7a653e5a8ccbe68825bfc6dcc02da728a7b4e566b3cf59287cff79af9dc5f
Deleted: sha256:2821c78e8b6efc944b5f7b7242d9f2e6403d0c9b1f80a8d7ac21c613d979b37d
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.12.7
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:c5b5c835a4a435da69f226fc08e27d2cc52e5911acc8215bb2c6372932bee7c0
Deleted: sha256:5374fd4eef01867696de8cbf8775e6f2510c4283b4f172284e02e4e3c59e1405
Deleted: sha256:8c34b118e182f2bec0e42bdbe8f00b33c89b9478f5f0a98c46a81c9963b604c2
Deleted: sha256:adab5d09ba79ecf30d3a5af58394b23a447eda7ffffe16c500ddc5ccb4c0222f
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.13.8-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:c9d6a2cbffaf316e7bc6cd9fa21eaa1c6b335958698ae465e3f8116538129698
Deleted: sha256:f6897cac7e8a2230baeafedfe9f1f856cd80ba5d38f1fe1c077117ab58742ad5
Deleted: sha256:d7628aa9c7a144c396f26f5f0c32b98c305d2ae3db439ab02961ced5797b17db
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.10.13
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:c27f3acfe0e3a53580b6d6ba24d40a4f8ee2c5de5798b7c6792cd79721569b5f
Deleted: sha256:8d365b0b24e5e0da5f40da308d5286eb60f5a5dc450ef3d49f6541ee28da50ce
Deleted: sha256:dc539800a74189063c85bd1189774899837ab7624978dc087a0e54eec7f123e8
Deleted: sha256:683f499823be212bf04cb9540407d8353803c25d0d9eb5f2fdb62786d8b95ead
Untagged: k8s.gcr.io/hyperkube-amd64:v1.14.4
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:c792a3b37af08acb41488fd980d93d943d113a7d9ddc3721310dfca5e2cb876a
Deleted: sha256:8da7f6094b8a601ddc597bbb47bafef1173a99c37cb04a5e6fa6367ee6ac3796
Deleted: sha256:6ab963935333737cf3c26bb103c8109ae947cc25a9ece16321050b6719d879d7
Untagged: k8s.gcr.io/hyperkube-amd64:v1.13.8
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:1b055b93fe15efbc8eda377cd145474ea7e9fbb858a02a688d258bd6b272cc76
Deleted: sha256:dd72e4ec8328f1e62b70957dad8c924ff7798d422ae59a342af37762625f13b6
Deleted: sha256:42590248b0248ffd847e3e5bbf733d917dfa78363b28907ce7287642d47c4b4e
Deleted: sha256:07b16eebd486b6b27f31a8c5a320e66c077a38061e5a8488ca0fb1f8daa81b6a
Deleted: sha256:8744b03b0ae13719a16a1722c008cafc84c87692ec57c04a414e314ef3cfbbbd
Deleted: sha256:3772c7f3857782941c48c820957678d216276b2cd88677a7a8cea99ac870d409
Deleted: sha256:158938159f90e160dc487c87d1c1763cf8249a855ecb8919695a4f2de1f59b64
Deleted: sha256:ecfafacff829a67d2852dca0cbfd920bc7ea2659f0e171b59acb11e6c5a7d51d
Deleted: sha256:1102ac8eec366e40d6a89007334a7154417ce2b14faa4c68659a3ce039fb5e6a
Untagged: k8s.gcr.io/cloud-controller-manager-amd64:v1.10.12
Untagged: k8s.gcr.io/cloud-controller-manager-amd64@sha256:f9054b8a448894837adb1ee0635cbd3c83ff6a4af0c937f72f1e8b6c1ebfc0b6
Deleted: sha256:f640b381ac400ac65d63fe21a0359bdcc016581c2ae440096bae8f4177239b17
Deleted: sha256:1431ee670aaca74a0587982a76835ad1c5a9e7472252fcc4944a4d25ad651d6d
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.11.10-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:61a636725b2e46f2a8413b79234dbb4d28454ff5a8c6eecd9f712db14a68cfd5
Deleted: sha256:d55d027865a8a8b8f048758bd716abcb03a100d1a668431e02a558a1169a4d47
Deleted: sha256:9524a1d3552aec0455657b11f1dd190b9c783f00fbe83e731db65a476cb5ea22
+ for i in '$(seq 1 $retries)'
+ timeout 25 apt-mark unhold walinuxagent
Canceled hold on walinuxagent.
+ break
+ echo Executed '"apt-mark' unhold 'walinuxagent"' 2 times
Executed "apt-mark unhold walinuxagent" 2 times
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.12.8-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:00f0375a8182774252a1880f174c3418d34f38b20950c91e69c967396704f860
Deleted: sha256:27c8be65e0201f72fd7a01ab33f8569a298f45602b2c78c9a04fb32f3bf6a5c7
Deleted: sha256:65bdb6b17603c264a434ea59d411210d8ef85f1759bd6b0aeec21b959ba138d3
Deleted: sha256:4d8c5c628d607046cad8400f79dd333dce017a5a528ba64cccfa1748c99432cd
Untagged: k8s.gcr.io/hyperkube-amd64:v1.11.10
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:8655691b4c65638134709960d1b0cb6c57c6c61a139057cbecdc858157edc2af
Deleted: sha256:3ec39b12dbf592a7635ac5ec99b5618e07d1cb4461f2ddc65114936e0ee46895
Deleted: sha256:b844e7b7a0e6a7276d93891d7eb4df4b51b6f6c0a592ff94e63fb6c185c8b182
Deleted: sha256:f3f3e725cb44e45d5f535872c077d1d3b45c4e9d67c1352286a278df0fb6aa28
Untagged: k8s.gcr.io/hyperkube-amd64:v1.12.8
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:782d00eed2465f50e8b0f2d7ee6f13115418a97bf0822ef117ee0037f4bc0641
Deleted: sha256:1fa74e02cd1ba5b6c6ce1502dc8753c6f750abc31c1edba18cc3382a4f00ea12
Deleted: sha256:ac225042a5ee033139717cba49c36f798c7ae002c2215b859a899df283fa8dbb
Deleted: sha256:f5d1e8ec2a14b88552471a29e1cb1956afc52218c36ecc29de02361cedc39f7f
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.12.7-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:c67baac996e0a9ce5d2fad2daea14cf0ee227d192c43e7411b6ec1933bdd1709
Deleted: sha256:0e8c564447b696bcbc928f915e206f96721296d5551d164eed291e621eaf7f6f
Deleted: sha256:32bd77938b1d0b2b5f6b80e94fb9b4394213b373c3cbf9372dedaff2b0bf1689
Deleted: sha256:cc16cb18779af30d1dfbbb7e03ace74bbaff8542c99def2f3c7df4ef3ede13e2
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64:v1.11.9-azs
Untagged: mcr.microsoft.com/k8s/azurestack/core/hyperkube-amd64@sha256:c26ec4501606b609b01d0ecf89b267343513490413134b71431440a0cafd0611
Deleted: sha256:2033b1dcd62cef45790457c7bcee344eab2d21ff48d94165e146f6437be13519
Deleted: sha256:6dbe23d1fcbb14a59ee4fee2dc9ace447638f89c525aa6bceb9b77073605d988
Deleted: sha256:e5c9926d78953ea4386002552ff644fa6ccc53fcd3524b3bba22ee9dd6ecd932
Untagged: k8s.gcr.io/hyperkube-amd64:v1.11.9
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:f2db543e78a8a1f055d4738916188ed8ea8bb76c0e3d1a02ef189c30afbb1c82
Deleted: sha256:3530e7eb061d6459c8ece5ba3edc33e452c9a51fba7b40cd47a4300cac5826da
Deleted: sha256:2a3e9ce3092f236c8c3c5e435eb3c6f5dbd47c36a6c0754835174988ca75f630
Deleted: sha256:902ddc30d83e18f81ced6d3d8ee7cfdf0ee16be78a075fa620c812544357dd23
Untagged: k8s.gcr.io/hyperkube-amd64:v1.12.7
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:db736d836f8d954178d121c00cfcf7c61ef0d433ca865c57ca5ddc905241fb9f
Deleted: sha256:417c9a1be9d5943b76e439b015b23c9ed5db1bda6684dfd4941cc94d4268e010
Deleted: sha256:7ca13df6ef13df8fe03d827fa23f773b170d7b8ec0061d51a3f8d768afa7248b
Deleted: sha256:1f44d96f7df7804ba535dc589a3578c3800d7cd97cec586ccc46f463f2485e4f
Untagged: k8s.gcr.io/hyperkube-amd64:v1.10.13
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:b835a26ac4bdf6033cd141bbe2b22b040ae1fbd873165fb4194971e3aefafec0
Deleted: sha256:92e8364db5b0e33461a1bf4821c916de0438b37a7c710804e61dd639ee503105
Deleted: sha256:ff64f0875715074ee915ce8075dc6e5352fef5b8146a046062657db0044de981
Deleted: sha256:b0aaad943e5c4f263caf69518393fe424cbba0ff61a28f579c10c7987bc45849
Untagged: k8s.gcr.io/hyperkube-amd64:v1.10.12
Untagged: k8s.gcr.io/hyperkube-amd64@sha256:946e421d443bd9417df02da7c68c70a5a3c6effd86ff0e9993eecd53fc13d1f4
Deleted: sha256:99a8eada1f9cafc2334385fc6178fccaabfa096e254752dd43a64653e882abc2
Deleted: sha256:f4b6fcdf27e3dbbcf657146b5a8cc0ac0770d1df549ffc0081eb735fcc5bf6d3
Deleted: sha256:a4f56f658452f0cf68688a2e774a23b14a85c2046e67d4fb74ecaf5dd99ba016
Deleted: sha256:7bccead2e89c2e22e4069b4de6143c75d4211abd6c99aff562745327daaa2b68
Deleted: sha256:223165ef645b8f698fd045c4680046b38e02022dde34cf3fe53b9dd87d8e1a0e
Deleted: sha256:373025e3a7d9f2465dd96094d77d5ee3fbbad2f83414fa4e4c0379f65b6c8870
Deleted: sha256:47a633707bc5ef6708bc872a783906f695a90241871fcf445a367e5b02c3c41a
Deleted: sha256:f35f2ebb6be076bc21d225ad99c4710e187934ef6e49e1f2f0f5f94248986648

@sharkymcdongles
Copy link

@andyzhangx ^

Please lemme know ASAP because our cluster is in a bad state because of this.

@sharkymcdongles
Copy link

I deleted the failed nodes now manually and was able to at least get the vmss update command to run through. Disks that were on the drained and deleted nodes still show as attached to them despite them not existing. I was able to fix a few by running:

az disk update --id

@doggvalera
Copy link

Have the same issue with Prometheus. But the solution to delete the PVC doesn't help.

Warning  FailedMount  2m9s (x127 over 4h48m)  kubelet, aks-default-xxxxxx-vmss000004  Unable to mount volumes for pod "prometheus-main-0_monitoring(xxxx)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-main-0". list of unmounted volumes=[prometheus-main-db]. list of unattached volumes=[prometheus-main-db config config-out prometheus-main-rulefiles-0 prometheus-token] 

With PVC everything is ok. prometheus-main-db-prometheus-main-0 Bound pvc-xxx 20Gi RWO default

@kwikwag
Copy link
Author

kwikwag commented Oct 20, 2019

To update on the issue I was facing, first I wanted to mention that I noticed that all error messages involved a single volume.

 Cannot attach data disk 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' to VM 'aks-nodepool1-xxxxxxxx-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

This lead me to understand that there was a single disk culprit that I should try to detach manually. I couldn't find this GUID anywhere (looking in az disk list -g MC_xxx and az vmss show -g MC_xxx -n aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query storageProfile.dataDisks; BTW the command from the docs, gave me an empty list, until I explicity queried the instance with --instance-id). However, two disks belonging to the Zookeeper (identified by their Azure tags) showed up as Attached in the MC_ resource group. Since all StatefulSets were scaled down (expect for one, which wasn't Zookeeper, and whose pods were still stuck on ContainerCreating), I figured detaching them manually would be safe (and might help). That didn't do the trick, but it got me one step forward (finally, something finished successfully!) and set me on the right path. Here's a log of what I did after, with (approximate) times:

  • 15:42 detach disks with Azure CLI
  • 16:14 manual VMSS upgrade Azure CLI
  • 16:25 restart VMSS via the Azure portal
  • 16:28 cluster same-version upgrade
  • 16:31 manual VMSS VM instance upgrade Azure portal
  • 16:44 scaled up K8S cluster from 1 to 2

Between each step, I tried killing the two remaining StatefulSet-related pods to allow them to re-attach. Finally, at 16:47, the pods finally came out of ContainerCreating and I saw Running for this first time in ages... Scaling up all StatefulSets everything started slowly going back to normal.

I suspect this was due to the second Azure-portal upgrade of the VM instance that helped - either that or scaling back up after scaling down (I started doing so hoping to drain the original node, but ended up not need to). One weird thing that happened with respect to the upgrade, is that after the first and second upgrade, Azure portal reported that the sole instance of the VMSS (Standard_DS3_v2 size) to be running the "Latest model", but after things started running (possibly only after scaling?) again "Latest model" showed "No".

I would conclude that a workaround for this issue, for my case, might be (This is all still voodoo):

  1. Scale down all StatefulSets to 0 (kubectl -n namespace scale --all=true statefulset --replicas=0 for each namespace)
  2. Scale down to 1 node (az aks scale -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --node-count 1)
  3. Ensure all VMSS disks are detached:
    3.1. List attached volumes with az disk list -g MC_xxx --query "[?diskState=='Attached'].name"
    3.2. Cross-reference the LUNs with az vmss show -g MC_xxx -n aks-nodepool1-xxxxxxxx-vmss --instance-id 0 --query "storageProfile.dataDisks[].{name: name, lun: lun}"
    3.3. Detach them with az vmss disk detach -g MC_xxx -n aks-nodepool1 --instance-id 0 --lun x (for each LUN).
  4. Update the node again (az vmss update-instances -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
  5. Perform a forced same-version upgrade (az aks upgrade -g xxx --name xxx-k8s --kubernetes-version 1.14.6)
  6. Update this node again (az vmss update-instances -g MC_xxx --name aks-nodepool1-xxxxxxxx-vmss --instance-id 0)
  7. Scale K8S cluster back up (kubectl -n namespace scale --all=true statefulset --replicas=x)

I'm sure some steps here are extraneous, and I don't know if it'll really work the next time I encounter this problem, but it's worth writing it down in hopes that it will help me or someone else in the future...

This, of course, doesn't solve the issue, as it doesn't explain how we got here in the first place. And, truthfully, having to scale down + back up is very uncomfortable. Better than losing the PVCs, but still not good enough. Would be happy to receive any updates regarding this issue (will upgrading to the newer 1.15 preview version of Kubernetes work?).

@sharkymcdongles
Copy link

sharkymcdongles commented Oct 20, 2019 via email

@worldspawn
Copy link

Yeah I'm about to put a bunch of aks into prod. Not inspiring any confidence. Support ticket every week with some random failure

@palma21
Copy link
Member

palma21 commented Oct 20, 2019

There a few different possible issues here in play. A few of these clusters are being throttled by the VMSS API, see below, which can cause this, you might need update to the latest k8s patch versions.

I see some of you have opened tickets already, could you share the numbers?
Additionally, feel free to send those/your details to my email directly. Would love to connect with you to make sure your problems are not recurrent and you can confidently go to production.

Jorge.Palma [at] microsoft.com

Details=[{"code":"TooManyRequests","message":"{"operationGroup":"GetVMScaleSetVM30Min","startTime":"2019-10-18T08:32:36.6723441+00:00","endTime":"2019-10-18T08:47:36.6723441+00:00","allowedRequestCount":2500,"measuredRequestCount":2843}","target":"GetVMScaleSetVM30Min"}] InnerError={"internalErrorCode":"TooManyRequestsReceived"})

@worldspawn
Copy link

@palma21 119101823000010
Confirmed this still occurring on my (to be) prod cluster.

@andyzhangx
Copy link
Contributor

@CecileRobertMichon have you noticed any useful info from the above cluster-provisiong.log, I could find below error logs:

Error: No such image: registry:2
dpkg: error: dpkg frontend is locked by another process
E: Sub-process dpkg --set-selections returned an error code (2)
E: Executing dpkg failed. Are you root?

@andyzhangx
Copy link
Contributor

detach the PVC disk which has problem is always ok, AKS would try to attach back in a loop when pod is scheduling on the new node.

@sharkymcdongles
Copy link

After migrating the entirety of production via velero to a 1.15.4 cluster it seems to be okay for now in the new 1.15.4 cluster. No idea how long this will last though. It'd be nice to know if we can expect this to happen again or if a resolution is found. I am tempted to set all vmextensions for my vmss to autoupdate false after this as I suspect that was the root cause.

@sylus
Copy link

sylus commented Oct 21, 2019

Can we get some sort of statement from Azure into what is going on? I think there are enough people this issue is happening to warrant some sort of explanation? I am a bit super concerned about this obviously know you guys are busy ^_^

@sylus
Copy link

sylus commented Oct 21, 2019

It is pretty clear that scalesets are completely broken. We have had this issue now on every cluster across different subscriptions and even affects VMSS that are windows nodes and not just Linux. I am super worried that this has been about a week now with no clear answer as to what is going on or what changed.

Obviously know Azure folks are doing best just a bit concerned :)

@palma21
Copy link
Member

palma21 commented Oct 21, 2019

I provided a preliminar explanation above, without knowing your clusters and looking into them I can't make a definitive one. Could you reach out (provide options above) with your cluster details/case numbers so I can check if that is happening for your case as well and provide needed recommendations/fixes.

@sharkymcdongles
Copy link

I still have both broken clusters provisioned in my subscription if anyone wants to take a look. @palma21

@palma21
Copy link
Member

palma21 commented Oct 21, 2019

I do, please open a ticket and send me the ticket number and I will take a look.

@palma21
Copy link
Member

palma21 commented Nov 27, 2019

Closing after all fixes from #1278 (comment) are rolled out and a week pinned to avoid mixing any new questions/issues.

@palma21 palma21 closed this as completed Nov 27, 2019
@jluk jluk unpinned this issue Dec 6, 2019
@stromvirvel
Copy link

Doesn't seem to be resolved on my AKS cluster v1.15.5.

│Events:                                                                                                                                                                                                            │
│  Type     Reason       Age    From                                       Message                                                                                                                                  │
│  ----     ------       ----   ----                                       -------                                                                                                                                  │
│  Normal   Scheduled    2m58s  default-scheduler                          Successfully assigned infra-gitlab/infra-gitlab-gitlab-redis-server-0 to aks-lind2sv3-23489316-vmss000002                                │
│  Warning  FailedMount  55s    kubelet, aks-lind2sv3-23489316-vmss000002  Unable to mount volumes for pod "infra-gitlab-gitlab-redis-server-0_infra-gitlab(bc91107a-98e0-4a6e-8806-8ce8841dd8d7)": timeout expired │
│waiting for volumes to attach or mount for pod "infra-gitlab"/"infra-gitlab-gitlab-redis-server-0". list of unmounted volumes=[data]. list of unattached volumes=[data infra-gitlab-gitlab-config infra-gitlab-gitl│
│b-metrics infra-gitlab-gitlab infra-gitlab-gitlab-redis-token-g8pbh[]                                                                                                                                              │

Is there any follow-up issue since this one was closed?

@andyzhangx
Copy link
Contributor

Doesn't seem to be resolved on my AKS cluster v1.15.5.

│Events:                                                                                                                                                                                                            │
│  Type     Reason       Age    From                                       Message                                                                                                                                  │
│  ----     ------       ----   ----                                       -------                                                                                                                                  │
│  Normal   Scheduled    2m58s  default-scheduler                          Successfully assigned infra-gitlab/infra-gitlab-gitlab-redis-server-0 to aks-lind2sv3-23489316-vmss000002                                │
│  Warning  FailedMount  55s    kubelet, aks-lind2sv3-23489316-vmss000002  Unable to mount volumes for pod "infra-gitlab-gitlab-redis-server-0_infra-gitlab(bc91107a-98e0-4a6e-8806-8ce8841dd8d7)": timeout expired │
│waiting for volumes to attach or mount for pod "infra-gitlab"/"infra-gitlab-gitlab-redis-server-0". list of unmounted volumes=[data]. list of unattached volumes=[data infra-gitlab-gitlab-config infra-gitlab-gitl│
│b-metrics infra-gitlab-gitlab infra-gitlab-gitlab-redis-token-g8pbh[]                                                                                                                                              │

Is there any follow-up issue since this one was closed?

pls file a support ticket for this issue, per the events provided, I don't see there is AttachVolume.Attach operation happens from attachdetach-controller,are there other disk attach operations happening in parrellel? You may wait for a few more minutes to check what happened finally.

@stromvirvel
Copy link

@andyzhangx
Sorry, the event output might give the impression i didn't even wait 1m to wait the disk to mount. Actually I waited several minutes and then, desperately removed the Pod in the naive hope it will magically start to work :)

After waiting an hour or so, this is the output:

│Events:                                                                                                                                                                                                            │
│  Type     Reason              Age                 From                                       Message                                                                                                              │
│  ----     ------              ----                ----                                       -------                                                                                                              │
│  Warning  FailedAttachVolume  109s (x6 over 66m)  attachdetach-controller                    AttachVolume.Attach failed for volume "pvc-f7ddb9b1-74cf-40bf-a62c-bbd918cc9fc5" : compute.VirtualMachineScaleSetVMsC│
│lient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'kubernetes-dyn│
│amic-pvc-0406a7c0-3b89-41b8-b147-21c48b5a74b0' to VM 'aks-lind2sv3-23489316-vmss_2' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely deta│
│ched and then try again or delete/detach the disk explicitly again. Instructions can be found at https://aka.ms/AzureDiskDetached"                                                                                 │
│  Warning  FailedMount         40s (x35 over 77m)  kubelet, aks-lind2sv3-23489316-vmss000002  Unable to mount volumes for pod "infra-gitlab-gitlab-redis-server-0_infra-gitlab(bc91107a-98e0-4a6e-8806-8ce8841dd8d7│
│)": timeout expired waiting for volumes to attach or mount for pod "infra-gitlab"/"infra-gitlab-gitlab-redis-server-0". list of unmounted volumes=[data]. list of unattached volumes=[data infra-gitlab-gitlab-conf│
│g infra-gitlab-gitlab-metrics infra-gitlab-gitlab infra-gitlab-gitlab-redis-token-g8pbh[]                                                                                                                          │

The whole cluster currently only has 3 PVCs, while 1 is unmounted (just used by a cronjob), and two should be mounted by a Pod each, but both are stuck.

@skinny
Copy link

skinny commented Jan 13, 2020

Coming from 1.14.6 I upgraded two of my (luckily test) clusters today to 1.15.7 and start having these errors. No changes were made to the deployed manifests.


  Type     Reason              Age                From                                      Message
  ----     ------              ----               ----                                      -------
  Normal   Scheduled           4m8s  default-scheduler                         Successfully assigned xxx-yyy-zzz/redis-ha-0 to aks-default-13451451-vmss000003
  Warning  FailedAttachVolume  4m8s  attachdetach-controller                   Multi-Attach error for volume "pvc-1bb2e8f6-230f-11ea-8427-12752f5c025d" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         14s   kubelet, aks-default-yyyy-vmss000003  Unable to mount volumes for pod "redis-ha-0_xxx-yyy-zzz(a5c9faca-9a1e-4850-8620-3d6e38c066d0)": timeout expired waiting for volumes to attach or mount for pod "xxx-yyy-zzz"/"redis-ha-0". list of unmounted volumes=[redis-ha-data]. list of unattached volumes=[redis-ha-data config probes redis-ha-token-nhq9r]

@andyzhangx
Copy link
Contributor

aks-lind2sv3-23489316-vmss_2

@stromvirvel could you run following command, looks like one of your vmss instance is in limbo state:

az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id 2

@andyzhangx
Copy link
Contributor

andyzhangx commented Jan 13, 2020

pvc-1bb2e8f6-230f-11ea-8427-12752f5c025d

@skinny looks like there is disk detach issue when upgrade from 1.14.6 to 1.15.7, could you find which vm is pvc-1bb2e8f6-230f-11ea-8427-12752f5c025d attached to, and then detach that disk from that vm manually? On 1.15.7, there is disk attach/detach self-healing feature, it would first check whether current disk is already attached to other node, if so, it would trigger a dangling error and k8s controller would detach disk first, and then do the attach volume operation., not sure why it did not happen, how long have you waited? @skinny

@skinny
Copy link

skinny commented Jan 13, 2020

Surprisingly the disk shows no attachment to a VM in the Azure portal. I did try to run the vmssfix.sh script I found earlier but that (obviously) didn't help either.

I have tried it several times by deleting the pod, scaling the Statefulset and waiting for periods between a few minutes and over an hour but still the same result

@andyzhangx
Copy link
Contributor

Surprisingly the disk shows no attachment to a VM in the Azure portal. I did try to run the vmssfix.sh script I found earlier but that (obviously) didn't help either.

I have tried it several times by deleting the pod, scaling the Statefulset and waiting for periods between a few minutes and over an hour but still the same result

@skinny then could you run following command to update aks-default-yyyy-vmss000003 instance?

az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <id>

@skinny
Copy link

skinny commented Jan 13, 2020

Ok, ran it for all four nodes in the cluster :

az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 0
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 1
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 2
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 3

No direct change visible, but do I need to redeploy the Pod ?

Current pod events :


Events:
  Type     Reason              Age                From                                      Message
  ----     ------              ----               ----                                      -------
  Normal   Scheduled           11m                default-scheduler                         Successfully assigned euw-xxx-yyy/redis-ha-0 to aks-default-xxxx-vmss000000
  Warning  FailedAttachVolume  11m                attachdetach-controller                   Multi-Attach error for volume "pvc-1bb2e8f6-230f-11ea-8427-12753f5c025d" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         3s (x5 over 9m6s)  kubelet, aks-default-xxxx-vmss000000  Unable to mount volumes for pod "redis-ha-0_euw-xxx-yyy(d1b70293-dc12-4398-8062-4aa47be7c3fd)": timeout expired waiting for volumes to attach or mount for pod "euw-xxx-yyy"/"redis-ha-0". list of unmounted volumes=[redis-ha-data]. list of unattached volumes=[redis-ha-data config probes redis-ha-token-nhq9r]

@andyzhangx
Copy link
Contributor

Ok, ran it for all four nodes in the cluster :

az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 0
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 1
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 2
az vmss update-instances -g xx-yy-zz -n aks-default-12312312-vmss --instance-id 3

No direct change visible, but do I need to redeploy the Pod ?

Current pod events :


Events:
  Type     Reason              Age                From                                      Message
  ----     ------              ----               ----                                      -------
  Normal   Scheduled           11m                default-scheduler                         Successfully assigned euw-xxx-yyy/redis-ha-0 to aks-default-xxxx-vmss000000
  Warning  FailedAttachVolume  11m                attachdetach-controller                   Multi-Attach error for volume "pvc-1bb2e8f6-230f-11ea-8427-12753f5c025d" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         3s (x5 over 9m6s)  kubelet, aks-default-xxxx-vmss000000  Unable to mount volumes for pod "redis-ha-0_euw-xxx-yyy(d1b70293-dc12-4398-8062-4aa47be7c3fd)": timeout expired waiting for volumes to attach or mount for pod "euw-xxx-yyy"/"redis-ha-0". list of unmounted volumes=[redis-ha-data]. list of unattached volumes=[redis-ha-data config probes redis-ha-token-nhq9r]

@skinny you could wait, it would retry attach, and if you want to make it faster, redeploy would be better, thanks.

@skinny
Copy link

skinny commented Jan 13, 2020

Unfortunately the same timeout error appears some two minutes after redeployment. Will leave it at this for another bit.

Last week I had the same kind of issues on another (1.14.x) cluster which I resolved with a lot of these manual steps. I hoped upgrading to 1.15.7 would finally solve these issues but this time even the manual steps are not helping

@andyzhangx
Copy link
Contributor

andyzhangx commented Jan 13, 2020

Multi-Attach error

@skinny I was a little wrong, this Multi-Attach error for volume xxx Volume is already exclusively attached to one node and can't be attached to another error is due to some orphan pod stuck in one of your node(not removed correctly), related code is here:
https://github.com/kubernetes/kubernetes/blob/36e40fb850293076b415ae3d376f5f81dc897105/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L319-L321

Update: it even could not to go to disk attach process due to that issue, are you trying to run two pods with same disk PVC? Could you file another new issue, paste your Deployment config? You may change your config a little per this instruction: https://github.com/andyzhangx/demo/blob/master/issues/azuredisk-issues.md#25-multi-attach-error

@skinny
Copy link

skinny commented Jan 13, 2020

@andyzhangx That Multi-Attach error is always followed by a timeout message a couple of minutes later. Is there a way to clean this up ?

@andyzhangx
Copy link
Contributor

@andyzhangx That Multi-Attach error is always followed by a timeout message a couple of minutes later. Is there a way to clean this up ?

@skinny
Update: it even could not to go to disk attach process due to that issue, are you trying to run two pods with same disk PVC? Could you file another new issue, paste your Deployment config? You may change your config a little per this instruction: https://github.com/andyzhangx/demo/blob/master/issues/azuredisk-issues.md#25-multi-attach-error

@skinny
Copy link

skinny commented Jan 13, 2020

@andyzhangx thanks, I will delete the deployment for now and let it "rest" for a bit. Tonight I'll deploy the statefulset again and see if it works then.

@andyzhangx
Copy link
Contributor

andyzhangx commented Jan 13, 2020

@andyzhangx thanks, I will delete the deployment for now and let it "rest" for a bit. Tonight I'll deploy the statefulset again and see if it works then.

@skinny I suspect one of your node status of volumesAttached is incorrect, here is the way to check which node has that incorrect volumesAttached status of that volume:

kubectl get no -o yaml | grep pvc-1bb2e8f6-230f-11ea-8427-12753f5c025d

Update: if you found one volume should not be in the volumesAttached of one node, restart kubelet should work.

@skinny
Copy link

skinny commented Jan 13, 2020

@andyzhangx It's getting weirder by the minute

  • The grep you suggested comes up empty
  • I scale the Statefulset from 0 to 1 and the PV actually gets attached (the grep shows volumesAttached
  • However the Multi-Attach error for volume still occurs followed by a timeout two minutes later.

@stromvirvel
Copy link

aks-lind2sv3-23489316-vmss_2

@stromvirvel could you run following command, looks like one of your vmss instance is in limbo state:

az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id 2

I missed your answer - anyway meanwhile I decided to destroy the whole cluster and recreate it from scratch, because there were a lot of disks (PVCs, not OS disks) still shown in the portal, sometimes in attached, sometimes in detached state, even though I entirely removed all PVCs in all namespaces and all PVs.
Anyway, now the cluster deletion failed but also somehow stuck in a deletion state 🤷🏼‍♂️.

@andyzhangx
Copy link
Contributor

@andyzhangx It's getting weirder by the minute

  • The grep you suggested comes up empty
  • I scale the Statefulset from 0 to 1 and the PV actually gets attached (the grep shows volumesAttached
  • However the Multi-Attach error for volume still occurs followed by a timeout two minutes later.

@skinny if it's already attached to one node successfully, why there are second attach which lead to Multi-Attach error for volume, is there two pods using same PVC?

@skinny
Copy link

skinny commented Jan 14, 2020

@andyzhangx nope, deleted all deployments/pod and only left the PVC intact. The grep for attached disks comes up empty before I try to use the PVC. Then for a few moments the disk actually is showing in the “attachedVolumes” output before disappearing again and showing the multi attach error in the events history

@Vandersteen
Copy link

We are also encountering these issues listed above. (aks 1.14.8 with vm scaleset)

Support request: 120022522001011

@croblesm
Copy link

croblesm commented Apr 1, 2020

I had exactly the same problem, but was able to solved with the following steps:

  • Deleted the PVC (without force) with no errors
  • Checked the Azure portal, the shared disk was still around
  • Got the node name where all the pods were in "container creating" status
  • Checked the pods, every single pod was in "container creating" status"
  • Deleted all pods using this instruction kubectl delete pods <pod> --grace-period=0 --force
  • Checked the PVC status again, it was still around also the shared disk
  • Added an extra node
  • Drained the node where the pods were stuck
  • Removed (uncordon) the node where the pods were stuck
  • Deleted PVC using this instruction: kubectl delete pv <pvc> --grace-period=0 --force
  • Checked the Azure portal, the shared disk was gone

Cheers

@PrabhuMathi
Copy link

PrabhuMathi commented Jun 22, 2020

I had exactly the same problem, but was able to solved with the following steps:

  • Deleted the PVC (without force) with no errors
  • Checked the Azure portal, the shared disk was still around
  • Got the node name where all the pods were in "container creating" status
  • Checked the pods, every single pod was in "container creating" status"
  • Deleted all pods using this instruction kubectl delete pods <pod> --grace-period=0 --force
  • Checked the PVC status again, it was still around also the shared disk
  • Added an extra node
  • Drained the node where the pods were stuck
  • Removed (uncordon) the node where the pods were stuck
  • Deleted PVC using this instruction: kubectl delete pv <pvc> --grace-period=0 --force
  • Checked the Azure portal, the shared disk was gone

Cheers

Hi, This is leads to deletion of data right? Facing the same issue in 1.16.9 version and need the data back of those disks. Any other option please suggest?

@croblesm
Copy link

I had exactly the same problem, but was able to solved with the following steps:

  • Deleted the PVC (without force) with no errors
  • Checked the Azure portal, the shared disk was still around
  • Got the node name where all the pods were in "container creating" status
  • Checked the pods, every single pod was in "container creating" status"
  • Deleted all pods using this instruction kubectl delete pods <pod> --grace-period=0 --force
  • Checked the PVC status again, it was still around also the shared disk
  • Added an extra node
  • Drained the node where the pods were stuck
  • Removed (uncordon) the node where the pods were stuck
  • Deleted PVC using this instruction: kubectl delete pv <pvc> --grace-period=0 --force
  • Checked the Azure portal, the shared disk was gone

Cheers

Hi, This is leads to deletion of data right? Facing the same issue in 1.16.9 version and need the data back of those disks. Any other option please suggest?

Yes, it will delete all the data in your PVC.

Cheers,

@andyzhangx
Copy link
Contributor

pls provide your k8s version, pod events, the original disk attaching issue on AKS was fixed around Dec. 2019

@PrabhuMathi
Copy link

pls provide your k8s version, pod events, the original disk attaching issue on AKS was fixed around Dec. 2019

Currently running in 1.16.9

@ghost ghost locked as resolved and limited conversation to collaborators Jul 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests