Handle stale attach code #1260

govint · 2017-05-16T13:54:20Z

Removed code in handle_stale_attach() to reset KV if the VM is powered off. This code is not relevant any more with the handling of VM events in the service.

shaominchen · 2017-05-16T18:36:45Z

esx_service/cli/vmdkops_admin.py

@@ -1286,6 +1286,7 @@ def config_init(args):
        auth = auth_data.AuthorizationDataManager(db_path)
        err = auth.new_db()
        if err:
+            os.remove(db_path)


This seems the same change as PR #1259?

Yes, I made a branch in the same source tree that had this change for #1259 and it pulled the changes from 1259 as well.

shaominchen

Please remove the change in vmdkops_admin.py as it's a duplicate to PR #1259 1259.

pshahzeb · 2017-05-16T22:48:21Z

esx_service/vmdk_ops.py

-                if ret:
-                   return ret
-          else:
+          if cur_vm.runtime.powerState != VM_POWERED_OFF:


line 1144 isn't going to return the VM if it is on another host. (in essence if disk is attached to VM on another host) ?

Correct, this code can wrongly reset a volume KV if the volume is accessed across hosts. The ESX service isn't cluster aware and the right way to fix this is to have cluster aware plugins so volumes are used appropriately.

pdhamdhere

Why not remove handle_stale_attach entirely?

govint · 2017-05-18T03:19:59Z

The check for the volume attached to a live VM is needed and in a cluster scenario the service doesn't assume the other node is running a version that handles hostd events

pdhamdhere · 2017-05-18T03:57:37Z

The check for the volume attached to a live VM is needed

Why? There are 2 places where handle_stale_attach is called and I don't see a reason why that function should exist. In both cases, we check KV status and return error if attached.

In removeVMDK,

kv_status_attached, kv_uuid, attach_mode = getStatusAttached(vmdk_path)
if kv_status_attached:
         // ret = handle_stale_attach(vmdk_path, kv_uuid)
         logging.info("*** removeVMDK: %s is in use, VM uuid = %s (%s)", vmdk_path, kv_uuid, ret)
         return err("Failed to remove volume {0}, in use by VM uuid = {1}.".format(
                vmdk_path, kv_uuid))

In cloneVMDK

        attached, uuid, attach_as = getStatusAttached(src_vmdk_path)
        if attached:
            // if handle_stale_attach(vmdk_path, uuid):
           return err("Source volume cannot be in use when cloning")

and in a cluster scenario the service doesn't assume the other node is running a version
that handles hostd events

Do you mean some ESX running 0.13 or earlier version and some running 0.14 or later? Firstly, I don't see a reason to optimize code for this corner case. Secondly, I don't understand how keeping handle_stale_attach in upcoming release helps.

govint · 2017-05-18T10:34:45Z

If the KV says its attached and we treat it as attached no matter what the actual status of the volume. What happens if the event handler misses an event for some reason (host crash, service crash)? Do we go with what the KV has in it or verify that thats the status of the VM from the platform? The KV is more or less an indication to check the actual status of the VM from the platform.

a) Either leave the function as-is, as a backup for cases where the event handler has either missed the event or doesn't get one and this function resets the KV so the volume is usable.

b) Or, remove the function in all the three call sites (attach, remove, clone) and retain the logging at these sites.

pdhamdhere · 2017-05-18T20:29:07Z

Leaving the function as-is doesn't help since it only works if VM using volume is on same ESX.

Good point on not failing the request based on KV status. So, we should proceed with remove and clone and return appropriate error?

govint · 2017-05-22T17:04:32Z

Sorry, missed this over the weekend. Yes, I'd say attempt the removal/clone/detach and handle the failure. I'll post the change and with the tests for that.

govint · 2017-05-29T10:21:39Z

Testing, volume is already in use:

$ docker volume inspect test-attached
[
    {
        "Driver": "vsphere:latest",
        "Labels": {},
        "Mountpoint": "/mnt/vmdk/test-attached",
        "Name": "test-attached",
        "Options": {},
        "Scope": "global",
        "Status": {
            "access": "read-write",
            "attach-as": "independent_persistent",
            "attached to VM": "2",
            "attachedVMDevice": {
                "ControllerPciSlotNumber": "224",
                "Unit": "0"
            },
            "capacity": {
                "allocated": "15MB",
                "size": "100MB"
            },
            "clone-from": "None",
            "created": "Mon May 29 10:06:22 2017",
            "created by VM": "2",
            "datastore": "bigone",
            "diskformat": "thin",
            "fstype": "ext4",
            "status": "attached"
        }
    }
]

Clone of an in-use volume on the ESX host

docker volume create --driver=vsphere --name=test-attached-clone -o clone-from=test-attached
Error response from daemon: create test-attached-clone: VolumeDriver.Create: Failed to clone volume: Unable to access file [bigone] dockvols/11111111-1111-1111-1111-111111111111/test-attached.vmdk since it is locked

ESX vmdk_ops.log
05/29/17 10:13:33 2410210 [2-bigone._DEFAULT.test-attached-clone] [WARNING] Disk /vmfs/volumes/bigone/dockvols/11111111-1111-1111-1111-111111111111/test-attached-clone.vmdk already attached to VM 2

Remove a volume in use on the node,

docker volume rm test-attached
Error response from daemon: unable to remove volume: remove test-attached: volume is in use - [ca783cfd2969c425cded3cd07323d936f0d3004fa9d9954250ece119ac50ff08]

Attach a volume already attached to the docker host,

$ docker run -it --volume-driver=vsphere -v test-attached:/vol1 --name ub1 ubuntu
root@b260255e3518:/#
root@b260255e3518:/#

govint · 2017-05-29T10:24:02Z

The fix doesn't reset the volume KV just because we don't find the volume on the local node, using the same volume on multiple nodes is disallowed and if a volume is in use one node, its locked and can't be used on another node.

pdhamdhere · 2017-06-01T04:46:30Z

esx_service/vmdk_ops.py

-            logging.info("*** removeVMDK: %s is in use, volume = %s VM = %s VM-uuid = %s (%s)",
-                vmdk_path, vol_name, attached_vm_name, kv_uuid, ret)
-            return err("Failed to remove volume {0}, in use by VM = {1}.".format(vol_name, attached_vm_name))
+        log_attached_volume(vmdk_path, kv_uuid)


Do we really need log_attached_volume here and other places? I can understand if we use it on error path to publish additional information but seems unnecessary on regular code path.

Yes, log_attached_volume() is needed because this isn;t expected in the code paths where its called.

pdhamdhere · 2017-06-01T04:47:13Z

esx_service/vmdk_ops.py

-          ret = reset_vol_meta(vmdk_path)
-          if ret:
-             return ret
+          msg = "Failed to find VM (id {0}), disk {1} is already attached".format(kv_uuid,


We should publish VM name from KV if available.

I checked the code again and while KV does/can give the volume name the log_attached_volume() is actually checking the UUID and if the VM is found and logs the VM name. If the UUID isn't resolved to a VM name then it would be incorrect to print the name from the KV - VM may be renamed (in which case we get the current name or VM is moved to another node and the service is local to the esx host). The proposed changes are handling printing the VM name as appropriate.

Agreed on this reasoning. However, renaming the VM is a rare scenario and i think we can still go ahead with printing th VM name we have in the KV.

govint · 2017-06-07T13:12:40Z

Swarm test seems to have timed out on a poll and errored out, otherwise build is good.

FAIL: swarm_test.go:91: SwarmTestSuite.TestDockerSwarm

swarm_test.go:141:
c.Assert(status, Equals, true, Commentf("Volume %s is still attached", s.volumeName))
... obtained bool = false
... expected bool = true
... Volume swarm_test_volume_1496839264 is still attached

govint · 2017-06-09T12:44:37Z

CI failed with the issue in #1372

govint · 2017-06-12T04:17:21Z

Depends on #1368 to handle the swarm test error

pshahzeb

Apologies for the late review.
Lets wrap this up. I have a comment on the behavior.
Lets address it and merge this.

govint · 2017-06-14T09:33:36Z

CI is in pogress and once its complete.

shaominchen · 2017-06-14T17:28:27Z

tests/e2e/basic_test.go

@@ -92,6 +93,9 @@ func (s *BasicTestSuite) TestVolumeLifecycle(c *C) {
 		status := verification.VerifyAttachedStatus(s.volName1, host, s.esx)
 		c.Assert(status, Equals, true, Commentf("Volume %s is not attached", s.volName1))

+		out, err = dockercli.DeleteVolume(host, s.volName1)
+		c.Assert(err, Not(IsNil), Commentf(out))


Nit: We can use NotNil instead of Not(IsNil)

pdhamdhere · 2017-06-14T18:24:36Z

esx_service/vmdk_ops.py

+            vol_name = vmdk_utils.get_volname_from_vmdk_path(vmdk_path)
+        logging.info("*** removeVMDK: %s is in use, volume = %s VM = %s VM-uuid = %s",
+            vmdk_path, vol_name, attached_vm_name, kv_uuid)
+        return err("Failed to remove volume {0}, in use by VM = {1}.".format(vol_name, attached_vm_name))


AFAICT we are not relying on data in KV for any operation. This would be an exception.

If I understand correctly this is a workaround for DiskLib issue. Rather than adding a workaround here, I would propose to follow solution proposed in original bug. Read KV in memory, delete file, if delete fails, write KV back to disk.

This is not feasible because once the KV is deleted by the issue in disklib a new KV can't be created as the disk can't be opened with the flags needed to create the sidecar. A sidecar can't be created for a disk in use (disk is still attached to some VM). Tried out and figured this out.

govint · 2017-06-15T18:10:33Z

CI failing due to #1371

pdhamdhere · 2017-06-20T16:55:34Z

@govint Please triage CI failure and re-push if needed. This PR has been open for a while.

govint · 2017-06-21T17:40:24Z

@pdhamdhere the CI failed with #1371. Can I merge this change since its not my change thats causing the issue?

pdhamdhere · 2017-06-21T17:43:06Z

@pdhamdhere the CI failed with #1371. Can I merge this change since its not my change thats causing the issue?

Test that was causing #1371 is fixed. We should get a green CI.

govint · 2017-06-22T13:42:21Z

Green CI.

govint · 2017-06-23T04:08:44Z

Closing.

vmwclabot added the cla-not-required label May 16, 2017

shaominchen reviewed May 16, 2017

View reviewed changes

shaominchen approved these changes May 16, 2017

View reviewed changes

pshahzeb reviewed May 16, 2017

View reviewed changes

govint force-pushed the handle-stale-attach-code branch from 4f43c58 to b2d0798 Compare May 17, 2017 11:11

pdhamdhere reviewed May 17, 2017

View reviewed changes

govint force-pushed the handle-stale-attach-code branch from b2d0798 to 1bf87f7 Compare May 29, 2017 09:57

govint force-pushed the handle-stale-attach-code branch from 1bf87f7 to b3af042 Compare May 31, 2017 07:07

pdhamdhere reviewed Jun 1, 2017

View reviewed changes

shuklanirdesh82 assigned govint Jun 3, 2017

govint force-pushed the handle-stale-attach-code branch from b3af042 to 83f03fc Compare June 7, 2017 12:31

govint force-pushed the handle-stale-attach-code branch from 83f03fc to c738436 Compare June 9, 2017 10:45

govint force-pushed the handle-stale-attach-code branch from a0f8ee6 to e85e2aa Compare June 13, 2017 06:23

govint mentioned this pull request Jun 14, 2017

Swarm test fails on CI intermittently due to metadata file missing #1405

Closed

govint requested a review from pshahzeb June 14, 2017 08:17

govint force-pushed the handle-stale-attach-code branch from 7556f0e to 0cc96a7 Compare June 14, 2017 08:49

pshahzeb reviewed Jun 14, 2017

View reviewed changes

shaominchen reviewed Jun 14, 2017

View reviewed changes

pdhamdhere reviewed Jun 14, 2017

View reviewed changes

govint and others added 2 commits June 15, 2017 00:05

Removed resetting KV of powered off VMs.

a7cf8e0

Fix review comments

b6c535a

govint force-pushed the handle-stale-attach-code branch from 0cc96a7 to b6c535a Compare June 15, 2017 10:12

Pushing with current code.

d9fc902

govint merged commit 17fd534 into master Jun 23, 2017

govint deleted the handle-stale-attach-code branch June 23, 2017 04:08

shuklanirdesh82 mentioned this pull request Jul 17, 2017

Overall time taken by each test run should be ~30minutes #1325

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle stale attach code #1260

Handle stale attach code #1260

govint commented May 16, 2017

shaominchen May 16, 2017

govint May 16, 2017 •

edited

Loading

shaominchen left a comment

pshahzeb May 16, 2017

govint May 17, 2017

pdhamdhere left a comment

govint commented May 18, 2017 via email

pdhamdhere commented May 18, 2017

govint commented May 18, 2017 •

edited

Loading

pdhamdhere commented May 18, 2017

govint commented May 22, 2017

govint commented May 29, 2017 •

edited by pdhamdhere

Loading

govint commented May 29, 2017

pdhamdhere Jun 1, 2017

govint Jun 1, 2017

pdhamdhere Jun 1, 2017

govint Jun 7, 2017

pshahzeb Jun 14, 2017

govint commented Jun 7, 2017 •

edited

Loading

govint commented Jun 9, 2017

govint commented Jun 12, 2017

pshahzeb left a comment

govint commented Jun 14, 2017

shaominchen Jun 14, 2017

pdhamdhere Jun 14, 2017

govint Jun 15, 2017

govint commented Jun 15, 2017

pdhamdhere commented Jun 20, 2017

govint commented Jun 21, 2017

pdhamdhere commented Jun 21, 2017

govint commented Jun 22, 2017

govint commented Jun 23, 2017

Handle stale attach code #1260

Handle stale attach code #1260

Conversation

govint commented May 16, 2017

Choose a reason for hiding this comment

govint May 16, 2017 • edited Loading

Choose a reason for hiding this comment

shaominchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdhamdhere left a comment

Choose a reason for hiding this comment

govint commented May 18, 2017 via email

pdhamdhere commented May 18, 2017

govint commented May 18, 2017 • edited Loading

pdhamdhere commented May 18, 2017

govint commented May 22, 2017

govint commented May 29, 2017 • edited by pdhamdhere Loading

govint commented May 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

govint commented Jun 7, 2017 • edited Loading

govint commented Jun 9, 2017

govint commented Jun 12, 2017

pshahzeb left a comment

Choose a reason for hiding this comment

govint commented Jun 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

govint commented Jun 15, 2017

pdhamdhere commented Jun 20, 2017

govint commented Jun 21, 2017

pdhamdhere commented Jun 21, 2017

govint commented Jun 22, 2017

govint commented Jun 23, 2017

govint May 16, 2017 •

edited

Loading

govint commented May 18, 2017 •

edited

Loading

govint commented May 29, 2017 •

edited by pdhamdhere

Loading

govint commented Jun 7, 2017 •

edited

Loading