Metadata load error from volume KV during access update for the volume. #1371

pshahzeb · 2017-06-08T03:14:22Z

While updating the volume access from read-write (default) to read-only, through automated volume access test, the update fails.
The access is updated using admin cli.

The exact log for the failure are

06/08/17 00:42:37 73825 [MainThread] [ERROR  ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/vol_access_volume_1496882546.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 280, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The text was updated successfully, but these errors were encountered:

pshahzeb · 2017-06-08T03:14:48Z

The datastore1 type is local.
Can you please confirm @shuklanirdesh82

shuklanirdesh82 · 2017-06-08T03:52:23Z

Yeah it is local VMFS data store.

shuklanirdesh82 · 2017-06-08T06:49:43Z

https://ci.vmware.run/vmware/docker-volume-vsphere/411

changing readOnly volume to writable volume also fails intermittently.

2017/06/08 06:31:32 START:vol_access TestAccessUpdate_R_RW Thu Jun  8 06:31:32 UTC 2017
2017/06/08 06:31:32 Creating volume [vol_access_volume_1496903492@datastore1] with options [ -o access=read-only] on VM [192.168.31.81]
2017/06/08 06:31:37 Writing message_by_host1 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.81]
2017/06/08 06:31:38 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903497 busybox  /bin/sh -c 'echo "message_by_host1" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:38 Writing message_by_host2 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.85]
2017/06/08 06:31:39 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903498 busybox  /bin/sh -c 'echo "message_by_host2" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:39 Updating access to [read-write] for volume [vol_access_volume_1496903492@datastore1] 
2017/06/08 06:31:40 Writing message_by_host1 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.81]
2017/06/08 06:31:42 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903500 busybox  /bin/sh -c 'echo "message_by_host1" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:42 Removing container [vol_access_container_1496903497 vol_access_container_1496903500] on VM [192.168.31.81]
2017/06/08 06:31:42 Removing container [vol_access_container_1496903498] on VM [192.168.31.85]
2017/06/08 06:31:43 Destroying volume [vol_access_volume_1496903492@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:155: VolumeAccessTestSuite.TestAccessUpdate_R_RW

volume_access_test.go:176:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4202bc680), Stderr:[]uint8(nil)} ("exit status 1")
... /bin/sh: can't create /vol1/test.txt: Read-only file system

/CC @pshahzeb

shuklanirdesh82 · 2017-06-08T06:50:14Z

seen 3 failure so far today so setting it P0

ashahi1 · 2017-06-08T06:59:52Z

I also see this error stack trace:

{"Name":"option_volume_26222426471854123","Opts":{"fstype":"ext4","size":"100gbEE"}},"version":"2"}')
06/08/17 06:31:45 74553 [Thread-90] [ERROR  ] Unhandled Exception:
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 1596, in execRequestThread
    opts=opts)
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 899, in executeRequest
    error_info = authorize_check(vm_uuid, datastore_url, cmd, opts, use_default_ds, datastore, vm_datastore)
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 809, in authorize_check
    error_info, tenant_uuid, tenant_name = auth.authorize(vm_uuid, datastore_url, cmd, opts)
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 454, in authorize
    result = check_privileges_for_command(cmd, opts, tenant_uuid, datastore_url, privileges)
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 342, in check_privileges_for_command
    if not check_max_volume_size(opts, privileges):
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 258, in check_max_volume_size
    vol_size_in_MB = convert.convert_to_MB(get_vol_size(opts))
  File "/usr/lib/vmware/vmdkops/Python/convert.py", line 27, in convert_to_MB
    value = int(vol_size_str[:-2])
ValueError: invalid literal for int() with base 10: '100GB'

File a separate issue for handling this?

pshahzeb · 2017-06-08T07:07:23Z

@ashahi1 yes this is a separate trace for unhandled exception in case of invalid volume option size literals.

pshahzeb · 2017-06-08T07:09:03Z

@shuklanirdesh82
yes. the reason is same as mentioned in description.
Confirmed with logs of CI build 411

ashahi1 · 2017-06-08T07:14:26Z

@pshahzeb Thanks! Filed a separate issue #1375

ashahi1 · 2017-06-10T20:25:38Z

Another instance of the same failure on CI: https://ci.vmware.run/vmware/docker-volume-vsphere/455

2017/06/10 20:07:45 Destroying volume [vol_access_volume_1ofswas2i1n91@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:155: VolumeAccessTestSuite.TestAccessUpdate_R_RW

volume_access_test.go:176:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4202e57e0), Stderr:[]uint8(nil)} ("exit status 1")
... /bin/sh: can't create /vol1/test.txt: Read-only file system

lipingxue · 2017-06-13T18:43:26Z

Hit another failure after enabling the log to dump kv_str.

This is the CI run

#1371

2017/06/13 07:12:24 START: VolumeAccessTestSuite.TestAccessUpdate
2017/06/13 07:12:24 Creating volume [vol_access_volume_l33x03nwf9uh@datastore1] on VM [192.168.31.123]
2017/06/13 07:12:28 Writing message_by_host1 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:29 Reading from file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.83]
2017/06/13 07:12:31 Writing message_by_host2 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.83]
2017/06/13 07:12:33 Reading from file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:35 Updating access to [read-only] for volume  --> !!! This update to change the volume to read-only does not sucessful, after that, the volume stays as "read-write". So the following write operation to this volume does NOT fail as expected !!!
 [vol_access_volume_l33x03nwf9uh@datastore1] 
2017/06/13 07:12:36 Writing message_by_host1 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:37 Removing container [vol_access_container_103q2m5iop6h0 vol_access_container_qoa5dmk31kno vol_access_container_mad3diqfld6z] on VM [192.168.31.123]
2017/06/13 07:12:38 Removing container [vol_access_container_1l0kco5jhjsx3 vol_access_container_f0iszigydzxy] on VM [192.168.31.83]
2017/06/13 07:12:38 Destroying volume [vol_access_volume_l33x03nwf9uh@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:91: VolumeAccessTestSuite.TestAccessUpdate

volume_access_test.go:120:
    c.Assert(strings.Contains(out, errorWriteVolume), Equals, true, Commentf(out))
... obtained bool = false
... expected bool = true
...

Then the test issue AdminCLI command to change the volume type from "read-write" to "read-only", which causes the following call stack to be called

set_vol_opts->
vmdk_ops.set_vol_opts ->
kv.getAll ->
-> kvESX.load(vol_path)

Then the following exception happens during the call to kvESX.load

06/13/17 07:12:36 81054 [MainThread] [WARNING] kv_str from meta file is  
06/13/17 07:12:36 81054 [MainThread] [ERROR  ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/vol_access_volume_l33x03nwf9uh.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 280, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

From the log I have, somehow the kv_str reading from meta_file is an empty string.

My plan is to adding more debug log to see why the kv_str reading from meta_file is an empty string and merge the code with additional log added asap.

Another problem I found that is in the "VolumeAccessTestSuite", after we change the volume access type, we don't have code to verify that change is successful or not and I think we should add those checks. I will file a separate to track this.

govint · 2017-06-14T14:03:23Z

Got this error in CI - https://ci.vmware.run/vmware/docker-volume-vsphere/520. the root cause may be outside of the service. Where the side car reads are returning 0 bytes for whatever reason after the VMDK is detached from the VM. This may be an ESX side issue as there are no indication that the service code is doing anything wrong.

Immediately after the error is seen, the side car (KV) again has all the data in it. But the update from read-only to read-write failed by then and the VMDK is left read-only and causes the writes from the test to fail.

I'd suggest that if the load fails in kv.load() then loop around a fixed number of retries before giving up.
06/14/17 12:26:17 74395 [Ubuntu.14.04-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] Disk detached /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_l33x03nwf9uh.vmdk
06/14/17 12:26:17 74395 [Ubuntu.14.04-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] executeRequest 'detach' completed with ret=None
06/14/17 12:26:17 81711 [MainThread] [ERROR ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_l33x03nwf9uh.vmdk

then its fine again,
06/14/17 12:26:18 74395 [MainThread] [INFO ] Started new thread : 843516368640 with target <function execRequestThread at 0xc4634f5378> and args (9, 80923, b'{"cmd":"get","details":{"Name":"vol_access_volume_l33x03nwf9uh@datastore1"},"version":"2"}')
06/14/17 12:26:18 74395 [Thread-383] [INFO ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/14/17 12:26:18 74395 [photon.vsan-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] executeRequest 'get' completed with ret={'access': 'read-only', 'created by VM': 'photon.vsan', 'diskformat': 'thin', 'datastore': 'datastore1', 'clone-from': 'None', 'capacity': {'allocated': '13MB', 'size': '100MB'}, 'attach-as': 'independent_persistent', 'fstype': 'ext4', 'status': 'detached', 'created': 'Wed Jun 14 12:25:52 2017'}

govint · 2017-06-14T18:26:16Z

The root cause for this behavior is not known, but at least when the issue happens there aren't any parallel accesses from the admin CLI and service on the same volume. All actions are sequential, and the admin CLI sets the RW access on the volume before the service takes the next docker request to attach the volume. This may be a quirk of the side-cars used for the KV and the retry is basically to tide over the zero byte reads on the KV.

lipingxue · 2017-06-14T21:25:41Z

I am able to reproduce this in my local setup, it looks like it happens after plugin restart, and I am not sure whether the plugin restart cause this issue or not.

06/14/17 20:20:16 2044632 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui'                                                                                                                                                                           
06/14/17 20:20:17 2042668 [MainThread] [INFO   ] Started new thread : 513303369472 with target <function execRequestThread at 0x7781a27488> and args (10, 119056, b'{"cmd":"get","details":{"Name":"BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni"},"version":"2"}')
06/14/17 20:20:17 2042668 [Thread-143] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access                                                                                                                                                          
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] Meta file b'/vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni-832b27f2e7734f08.vmfd' is empty      ----> The meta data file is EMPTY!!!!
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [WARNING] load: Kv_str  reading from meta file b'/vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni-832b27f2e7734f08.vmfd' is not in JSON format
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [ERROR  ] load:Failed to decode meta-data for /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk
 last):                                                                                                                                                                                                                                                                      
kops/Python/kvESX.py", line 291, in load                                                                                                                                                                                                                                     
r)                                                                                                                                                                                                                                                                           
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads                                                                                                                                                                   
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode                                                                                                                                                                   
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode                                                                                                                                                               
r: Expecting value: line 1 column 1 (char 0)                                                                                                                                                                                                                                 
ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [ERROR  ] Failed to get disk details for /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk ('No
ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] executeRequest 'get' completed with ret=None

From the log, we can see that the meta file is empty and therefore, "kv_str" reading from the meta file is an empty string.

As @govint mentioned above, we should add code to retry if the load fails in kv.load().

pdhamdhere · 2017-06-15T18:16:00Z

In kvESX.py::def load(volpath)

        try:
            with open(meta_file, "r") as fh:
                kv_str = fh.read()
            break
        except IOError as open_error:  <<<<<<<< XXX
                .....
                return None

    try:
        return json.loads(kv_str)

XXX: Can open/read throw an exception other than IOError for e.g. OSError? We will miss it and end up calling json.load with empty kv_str. Can we change it to except Exception and get a repro

cc/ @msterin

lipingxue · 2017-06-16T22:45:47Z

Hit it again in CI
https://ci.vmware.run/vmware/docker-volume-vsphere/565

This is what I found:

a detach request issued but not completed
plugin restart (I checked the test, it is not made by test)
after plugin restart, get the request to change the volume attribute from "read_only" to "read_write". Function set_vol_opts is called the then kvESX.save() is called to update the meta data. This write FAILED to write the new meta data to .vmfd file and leaves the .vmfd file as empty.

We hit this consistently after the plugin restart, not sure if it related to #1395 although here we don't have parallel volume create here.

06/16/17 07:31:10 48336 [MainThread] [INFO   ] Started new thread : 127867760 with target <function execRequestThread at 0x26dc8534> and args (11, 48994, '{"cmd":"detach","details":{"Name":"vol_access_volume_e8ktpto5vl0h@datastore1"},"version":"2"}')
06/16/17 07:31:10 48336 [Thread-113] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] *** detachVMDK: /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk from Ubuntu.1404.vmfs VM uuid = 564d370b-55b3-385a-4590-919ed4c564ab    --> !!!detachVMDK does not complete, and then a plugin restart happens!!!
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui' 
06/16/17 07:31:10 49423 [MainThread] [INFO   ] set_vol_opts: name=vol_access_volume_e8ktpto5vl0h@datastore1 options=access=read-write vol_name=vol_access_volume_e8ktpto5vl0h, datastore=datastore1
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 49423 [MainThread] [INFO   ] set_vol_opts: path=/vmfs/volumes/datastore1/dockvols/_DEFAULT vmdk_path=/vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] Save the dictionary {u'status': 'detached', u'created': u'Fri Jun 16 07:30:57 2017', u'volOpts': {u'access': u'read-only', u'diskformat': u'thin', u'fstype': u'ext4'}, u'created-by': u'photon.ga.vmfs'} to sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:11 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] save: write {"status": "detached", "created": "Fri Jun 16 07:30:57 2017", "volOpts": {"access": "read-only", "diskformat": "thin", "fstype": "ext4"}, "created-by": "photon.ga.vmfs"} to meta_file /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h-03d9153809d9881f.vmfd     -> !!! this call to save() failed to write the updated meta data .vmfd file !!!
06/16/17 07:31:11 49423 [MainThread] [WARNING] load: Kv_str  reading from meta file /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h-03d9153809d9881f.vmfd is not in JSON format
06/16/17 07:31:11 49423 [MainThread] [ERROR  ] load:Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 290, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/__init__.py", line 338, in loads
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/decoder.py", line 366, in decode
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/decoder.py", line 384, in raw_decode
ValueError: No JSON object could be decoded

pdhamdhere · 2017-06-16T23:46:59Z

Root caused to Test not waiting for docker operation to finish before issuing admin update command. Confirmed that adding --rm ssh.Invoke waits for detach to finish before returning to shell. @shuklanirdesh82 will change Test infra to use --rm as appropriate.

shaominchen · 2017-06-17T00:21:28Z

@pdhamdhere Shouldn't the driver take care of this race condition, i.e. the admin update command should be blocked in this case until the docker operation is finished? I'm not really familiar with the detailed logic in this module, but I think the metadata file should be locked using a R/W lock - any read operation should be blocked if a write operation is in progress. Please correct me if I'm wrong here.

pdhamdhere · 2017-06-17T02:18:17Z

Theoretically, underlying FS (VMFS in this case) locking should correctly handle R/W to same file from multiple processes and should not result in 0 byte file. Another cause could be way SideCar are handled. However, this is rarest race where admin updates status of volume while KV is being written so didnt spend too much time. We should fix the test and move on to higher priority issues. Makes sense?

govint · 2017-06-17T03:43:21Z

@lipingxue But why did the write fail on the KV? Plus a lock is held on every load and save op on the KV even if admin CLI command is issued on the volume when a detach is in progress they will both synch. on that KV lock to prevent simultaneous access.

From the log trace two parallel ops are loading (locked) the KV and then saving (again locked) it. Both ops done for both threads in parallel but they do synch. on the KV lock and hence sequenced.

@lipingxue can you also share the logs of the write failing?

msterin · 2017-06-17T16:54:58Z

@govint - what KV lock you are talking about?
To.me it looks we still have a big risk since we have no idea why write has failed, and we exercised a.real (albeit rare) use case, when admin just happen to type something when a vmdk op on some docker host was in progress.

govint · 2017-06-17T17:35:38Z

Both kv load and save take a lock before executing the op.

msterin · 2017-06-17T18:17:06Z

Which lock ? In-process python ? It is irrelevant when admin CLI and vmdk_ops service race. I am not aware of any interprocess lock for KV. Did I miss something ?

govint · 2017-06-20T05:42:22Z

Correct, its not a file lock, the kvESX.py load() and save() use a lock decorator so those functions are synchronized within the ESX service and the admin CLI individually.

shaominchen · 2017-06-20T20:47:02Z

The test case itself has been fixed, but we still need more discussion about this issue.

govint · 2017-07-24T10:36:35Z

@shaominchen, @tusharnt - what do we want to do for this issue? there is a race between an admin CLI user and the service user (via docker). One way is to have the admin CLI stop using the service code and instead post requests to the service and get the responses. There are no file locks used here so different processes can't access meta-data in parallel.

shaominchen · 2017-07-24T17:35:25Z

One way is to have the admin CLI stop using the service code and instead post requests to the service and get the responses.

This sounds a reasonable solution to me. Do we have an estimation for this approach? Depending on the schedule and the estimation of the fix, we can decide if we want to fix this for GA or not.

govint · 2017-07-25T09:47:57Z

This approach will need a new listener in the ESX service (can't reuse the VMCI listener in the service) to talk to, to handle requests from the admin CLI. I'd prefer not to do this for GA as the use case is fairly remote - race condition between admin CLI and docker. In fact I'd prefer not to address this issue unless a customer picks it up in which case something like #1079 may be usable - a more generic API based service that can be accessed not only locally but across hosts.

govint · 2017-09-13T04:43:04Z

Closing this issue, the approach was to change the way admin CLI shares the ESX service code, but I don't see that changing any more and definitely not for addressing the scenario here.

pshahzeb added kind/bug P1 labels Jun 8, 2017

pshahzeb added this to the 0.16 milestone Jun 8, 2017

shuklanirdesh82 added P0 and removed P1 labels Jun 8, 2017

pshahzeb mentioned this issue Jun 8, 2017

Automate vsan policy test case #1361

Merged

shuklanirdesh82 assigned lipingxue Jun 9, 2017

lipingxue mentioned this issue Jun 11, 2017

Add log to dump kv_str #1388

Merged

pshahzeb mentioned this issue Jun 13, 2017

Reduce the environment difference between CI setup and local nimbus testbed setup #1407

Closed

lipingxue mentioned this issue Jun 13, 2017

Adding more logs when reading/writing meta file #1410

Merged

govint mentioned this issue Jun 14, 2017

Handle exception on integer conversion of volume size. #1383

Merged

lipingxue mentioned this issue Jun 14, 2017

Remove logs added for triage #1419

Closed

govint mentioned this issue Jun 15, 2017

Handle stale attach code #1260

Merged

pshahzeb mentioned this issue Jun 16, 2017

Capturing coverage of vmdkops service #1421

Merged

pdhamdhere assigned shuklanirdesh82 and unassigned lipingxue Jun 16, 2017

shuklanirdesh82 mentioned this issue Jun 17, 2017

Adding trivial fix for access update testcase #1435

Merged

shaominchen modified the milestones: 0.17, 0.16 Jun 20, 2017

tusharnt assigned govint and unassigned shuklanirdesh82 Jul 17, 2017

tusharnt modified the milestones: Biweekly sprint - Rafiki, 0.17 Jul 17, 2017

govint added kind/enhancement P2 P1 and removed P0 P2 labels Jul 25, 2017

govint modified the milestones: Future, Sprint - Rafiki Jul 25, 2017

govint removed the kind/bug label Aug 3, 2017

govint closed this as completed Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata load error from volume KV during access update for the volume. #1371

Metadata load error from volume KV during access update for the volume. #1371

pshahzeb commented Jun 8, 2017

pshahzeb commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

ashahi1 commented Jun 8, 2017 •

edited

Loading

pshahzeb commented Jun 8, 2017

pshahzeb commented Jun 8, 2017

ashahi1 commented Jun 8, 2017 •

edited

Loading

ashahi1 commented Jun 10, 2017

lipingxue commented Jun 13, 2017 •

edited

Loading

govint commented Jun 14, 2017

govint commented Jun 14, 2017

lipingxue commented Jun 14, 2017

pdhamdhere commented Jun 15, 2017 •

edited

Loading

lipingxue commented Jun 16, 2017

pdhamdhere commented Jun 16, 2017

shaominchen commented Jun 17, 2017

pdhamdhere commented Jun 17, 2017

govint commented Jun 17, 2017 •

edited

Loading

msterin commented Jun 17, 2017

govint commented Jun 17, 2017

msterin commented Jun 17, 2017 •

edited

Loading

govint commented Jun 20, 2017

shaominchen commented Jun 20, 2017

govint commented Jul 24, 2017

shaominchen commented Jul 24, 2017 •

edited

Loading

govint commented Jul 25, 2017 •

edited

Loading

govint commented Sep 13, 2017

Metadata load error from volume KV during access update for the volume. #1371

Metadata load error from volume KV during access update for the volume. #1371

Comments

pshahzeb commented Jun 8, 2017

pshahzeb commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

shuklanirdesh82 commented Jun 8, 2017

ashahi1 commented Jun 8, 2017 • edited Loading

pshahzeb commented Jun 8, 2017

pshahzeb commented Jun 8, 2017

ashahi1 commented Jun 8, 2017 • edited Loading

ashahi1 commented Jun 10, 2017

lipingxue commented Jun 13, 2017 • edited Loading

govint commented Jun 14, 2017

govint commented Jun 14, 2017

lipingxue commented Jun 14, 2017

pdhamdhere commented Jun 15, 2017 • edited Loading

lipingxue commented Jun 16, 2017

pdhamdhere commented Jun 16, 2017

shaominchen commented Jun 17, 2017

pdhamdhere commented Jun 17, 2017

govint commented Jun 17, 2017 • edited Loading

msterin commented Jun 17, 2017

govint commented Jun 17, 2017

msterin commented Jun 17, 2017 • edited Loading

govint commented Jun 20, 2017

shaominchen commented Jun 20, 2017

govint commented Jul 24, 2017

shaominchen commented Jul 24, 2017 • edited Loading

govint commented Jul 25, 2017 • edited Loading

govint commented Sep 13, 2017

ashahi1 commented Jun 8, 2017 •

edited

Loading

ashahi1 commented Jun 8, 2017 •

edited

Loading

lipingxue commented Jun 13, 2017 •

edited

Loading

pdhamdhere commented Jun 15, 2017 •

edited

Loading

govint commented Jun 17, 2017 •

edited

Loading

msterin commented Jun 17, 2017 •

edited

Loading

shaominchen commented Jul 24, 2017 •

edited

Loading

govint commented Jul 25, 2017 •

edited

Loading