Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Metadata load error from volume KV during access update for the volume. #1371

Closed
pshahzeb opened this issue Jun 8, 2017 · 28 comments
Closed
Assignees
Milestone

Comments

@pshahzeb
Copy link
Contributor

pshahzeb commented Jun 8, 2017

While updating the volume access from read-write (default) to read-only, through automated volume access test, the update fails.
The access is updated using admin cli.

The exact log for the failure are

06/08/17 00:42:37 73825 [MainThread] [ERROR  ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/vol_access_volume_1496882546.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 280, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
@pshahzeb pshahzeb added this to the 0.16 milestone Jun 8, 2017
@pshahzeb
Copy link
Contributor Author

pshahzeb commented Jun 8, 2017

The datastore1 type is local.
Can you please confirm @shuklanirdesh82

@shuklanirdesh82
Copy link
Contributor

Yeah it is local VMFS data store.

@shuklanirdesh82
Copy link
Contributor

https://ci.vmware.run/vmware/docker-volume-vsphere/411

changing readOnly volume to writable volume also fails intermittently.

2017/06/08 06:31:32 START:vol_access TestAccessUpdate_R_RW Thu Jun  8 06:31:32 UTC 2017
2017/06/08 06:31:32 Creating volume [vol_access_volume_1496903492@datastore1] with options [ -o access=read-only] on VM [192.168.31.81]
2017/06/08 06:31:37 Writing message_by_host1 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.81]
2017/06/08 06:31:38 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903497 busybox  /bin/sh -c 'echo "message_by_host1" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:38 Writing message_by_host2 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.85]
2017/06/08 06:31:39 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903498 busybox  /bin/sh -c 'echo "message_by_host2" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:39 Updating access to [read-write] for volume [vol_access_volume_1496903492@datastore1] 
2017/06/08 06:31:40 Writing message_by_host1 to file test.txt on volume [vol_access_volume_1496903492@datastore1] from VM[192.168.31.81]
2017/06/08 06:31:42 Failed to invoke command [docker run  -v vol_access_volume_1496903492@datastore1:/vol1 --name vol_access_container_1496903500 busybox  /bin/sh -c 'echo "message_by_host1" > /vol1/test.txt']: exit status 1
2017/06/08 06:31:42 Removing container [vol_access_container_1496903497 vol_access_container_1496903500] on VM [192.168.31.81]
2017/06/08 06:31:42 Removing container [vol_access_container_1496903498] on VM [192.168.31.85]
2017/06/08 06:31:43 Destroying volume [vol_access_volume_1496903492@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:155: VolumeAccessTestSuite.TestAccessUpdate_R_RW

volume_access_test.go:176:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4202bc680), Stderr:[]uint8(nil)} ("exit status 1")
... /bin/sh: can't create /vol1/test.txt: Read-only file system

/CC @pshahzeb

@shuklanirdesh82
Copy link
Contributor

seen 3 failure so far today so setting it P0

@shuklanirdesh82 shuklanirdesh82 added P0 and removed P1 labels Jun 8, 2017
@ashahi1
Copy link
Contributor

ashahi1 commented Jun 8, 2017

I also see this error stack trace:

{"Name":"option_volume_26222426471854123","Opts":{"fstype":"ext4","size":"100gbEE"}},"version":"2"}')
06/08/17 06:31:45 74553 [Thread-90] [ERROR  ] Unhandled Exception:
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 1596, in execRequestThread
    opts=opts)
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 899, in executeRequest
    error_info = authorize_check(vm_uuid, datastore_url, cmd, opts, use_default_ds, datastore, vm_datastore)
  File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 809, in authorize_check
    error_info, tenant_uuid, tenant_name = auth.authorize(vm_uuid, datastore_url, cmd, opts)
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 454, in authorize
    result = check_privileges_for_command(cmd, opts, tenant_uuid, datastore_url, privileges)
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 342, in check_privileges_for_command
    if not check_max_volume_size(opts, privileges):
  File "/usr/lib/vmware/vmdkops/Python/auth.py", line 258, in check_max_volume_size
    vol_size_in_MB = convert.convert_to_MB(get_vol_size(opts))
  File "/usr/lib/vmware/vmdkops/Python/convert.py", line 27, in convert_to_MB
    value = int(vol_size_str[:-2])
ValueError: invalid literal for int() with base 10: '100GB'

File a separate issue for handling this?

@pshahzeb
Copy link
Contributor Author

pshahzeb commented Jun 8, 2017

@ashahi1 yes this is a separate trace for unhandled exception in case of invalid volume option size literals.

@pshahzeb
Copy link
Contributor Author

pshahzeb commented Jun 8, 2017

@shuklanirdesh82
yes. the reason is same as mentioned in description.
Confirmed with logs of CI build 411

@ashahi1
Copy link
Contributor

ashahi1 commented Jun 8, 2017

@pshahzeb Thanks! Filed a separate issue #1375

@ashahi1
Copy link
Contributor

ashahi1 commented Jun 10, 2017

Another instance of the same failure on CI: https://ci.vmware.run/vmware/docker-volume-vsphere/455

2017/06/10 20:07:45 Destroying volume [vol_access_volume_1ofswas2i1n91@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:155: VolumeAccessTestSuite.TestAccessUpdate_R_RW

volume_access_test.go:176:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4202e57e0), Stderr:[]uint8(nil)} ("exit status 1")
... /bin/sh: can't create /vol1/test.txt: Read-only file system

@lipingxue
Copy link
Contributor

lipingxue commented Jun 13, 2017

Hit another failure after enabling the log to dump kv_str.

This is the CI run

#1371

2017/06/13 07:12:24 START: VolumeAccessTestSuite.TestAccessUpdate
2017/06/13 07:12:24 Creating volume [vol_access_volume_l33x03nwf9uh@datastore1] on VM [192.168.31.123]
2017/06/13 07:12:28 Writing message_by_host1 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:29 Reading from file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.83]
2017/06/13 07:12:31 Writing message_by_host2 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.83]
2017/06/13 07:12:33 Reading from file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:35 Updating access to [read-only] for volume  --> !!! This update to change the volume to read-only does not sucessful, after that, the volume stays as "read-write". So the following write operation to this volume does NOT fail as expected !!!
 [vol_access_volume_l33x03nwf9uh@datastore1] 
2017/06/13 07:12:36 Writing message_by_host1 to file test.txt on volume [vol_access_volume_l33x03nwf9uh@datastore1] from VM[192.168.31.123]
2017/06/13 07:12:37 Removing container [vol_access_container_103q2m5iop6h0 vol_access_container_qoa5dmk31kno vol_access_container_mad3diqfld6z] on VM [192.168.31.123]
2017/06/13 07:12:38 Removing container [vol_access_container_1l0kco5jhjsx3 vol_access_container_f0iszigydzxy] on VM [192.168.31.83]
2017/06/13 07:12:38 Destroying volume [vol_access_volume_l33x03nwf9uh@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:91: VolumeAccessTestSuite.TestAccessUpdate

volume_access_test.go:120:
    c.Assert(strings.Contains(out, errorWriteVolume), Equals, true, Commentf(out))
... obtained bool = false
... expected bool = true
... 

Then the test issue AdminCLI command to change the volume type from "read-write" to "read-only", which causes the following call stack to be called

set_vol_opts->
vmdk_ops.set_vol_opts ->
kv.getAll ->
-> kvESX.load(vol_path)

Then the following exception happens during the call to kvESX.load

06/13/17 07:12:36 81054 [MainThread] [WARNING] kv_str from meta file is  
06/13/17 07:12:36 81054 [MainThread] [ERROR  ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/11111111-1111-1111-1111-111111111111/vol_access_volume_l33x03nwf9uh.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 280, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode
  File "/build/mts/release/bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

From the log I have, somehow the kv_str reading from meta_file is an empty string.

My plan is to adding more debug log to see why the kv_str reading from meta_file is an empty string and merge the code with additional log added asap.

Another problem I found that is in the "VolumeAccessTestSuite", after we change the volume access type, we don't have code to verify that change is successful or not and I think we should add those checks. I will file a separate to track this.

@govint
Copy link
Contributor

govint commented Jun 14, 2017

Got this error in CI - https://ci.vmware.run/vmware/docker-volume-vsphere/520. the root cause may be outside of the service. Where the side car reads are returning 0 bytes for whatever reason after the VMDK is detached from the VM. This may be an ESX side issue as there are no indication that the service code is doing anything wrong.

Immediately after the error is seen, the side car (KV) again has all the data in it. But the update from read-only to read-write failed by then and the VMDK is left read-only and causes the writes from the test to fail.

I'd suggest that if the load fails in kv.load() then loop around a fixed number of retries before giving up.
06/14/17 12:26:17 74395 [Ubuntu.14.04-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] Disk detached /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_l33x03nwf9uh.vmdk
06/14/17 12:26:17 74395 [Ubuntu.14.04-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] executeRequest 'detach' completed with ret=None
06/14/17 12:26:17 81711 [MainThread] [ERROR ] Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_l33x03nwf9uh.vmdk

then its fine again,
06/14/17 12:26:18 74395 [MainThread] [INFO ] Started new thread : 843516368640 with target <function execRequestThread at 0xc4634f5378> and args (9, 80923, b'{"cmd":"get","details":{"Name":"vol_access_volume_l33x03nwf9uh@datastore1"},"version":"2"}')
06/14/17 12:26:18 74395 [Thread-383] [INFO ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/14/17 12:26:18 74395 [photon.vsan-vsanDatastore._DEFAULT.vol_access_volume_l33x03nwf9uh] [INFO ] executeRequest 'get' completed with ret={'access': 'read-only', 'created by VM': 'photon.vsan', 'diskformat': 'thin', 'datastore': 'datastore1', 'clone-from': 'None', 'capacity': {'allocated': '13MB', 'size': '100MB'}, 'attach-as': 'independent_persistent', 'fstype': 'ext4', 'status': 'detached', 'created': 'Wed Jun 14 12:25:52 2017'}

@govint
Copy link
Contributor

govint commented Jun 14, 2017

The root cause for this behavior is not known, but at least when the issue happens there aren't any parallel accesses from the admin CLI and service on the same volume. All actions are sequential, and the admin CLI sets the RW access on the volume before the service takes the next docker request to attach the volume. This may be a quirk of the side-cars used for the KV and the retry is basically to tide over the zero byte reads on the KV.

@lipingxue
Copy link
Contributor

I am able to reproduce this in my local setup, it looks like it happens after plugin restart, and I am not sure whether the plugin restart cause this issue or not.

06/14/17 20:20:16 2044632 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui'                                                                                                                                                                           
06/14/17 20:20:17 2042668 [MainThread] [INFO   ] Started new thread : 513303369472 with target <function execRequestThread at 0x7781a27488> and args (10, 119056, b'{"cmd":"get","details":{"Name":"BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni"},"version":"2"}')
06/14/17 20:20:17 2042668 [Thread-143] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access                                                                                                                                                          
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] Meta file b'/vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni-832b27f2e7734f08.vmfd' is empty      ----> The meta data file is EMPTY!!!!
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [WARNING] load: Kv_str  reading from meta file b'/vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni-832b27f2e7734f08.vmfd' is not in JSON format
06/14/17 20:20:17 2042668 [ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [ERROR  ] load:Failed to decode meta-data for /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk
 last):                                                                                                                                                                                                                                                                      
kops/Python/kvESX.py", line 291, in load                                                                                                                                                                                                                                     
r)                                                                                                                                                                                                                                                                           
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/__init__.py", line 319, in loads                                                                                                                                                                   
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 339, in decode                                                                                                                                                                   
bora-4564106/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/json/decoder.py", line 357, in raw_decode                                                                                                                                                               
r: Expecting value: line 1 column 1 (char 0)                                                                                                                                                                                                                                 
ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [ERROR  ] Failed to get disk details for /vmfs/volumes/sharedVmfs-0/dockvols/11111111-1111-1111-1111-111111111111/BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni.vmdk ('No
ubuntu-VM0.0-sharedVmfs-0._DEFAULT.BasicTestSuite.TestVolumeLifecycle_volume_4lkyhppi2lni] [INFO   ] executeRequest 'get' completed with ret=None                                                                                                                            

From the log, we can see that the meta file is empty and therefore, "kv_str" reading from the meta file is an empty string.

As @govint mentioned above, we should add code to retry if the load fails in kv.load().

@pdhamdhere
Copy link
Contributor

pdhamdhere commented Jun 15, 2017

In kvESX.py::def load(volpath)

        try:
            with open(meta_file, "r") as fh:
                kv_str = fh.read()
            break
        except IOError as open_error:  <<<<<<<< XXX
                .....
                return None

    try:
        return json.loads(kv_str)

XXX: Can open/read throw an exception other than IOError for e.g. OSError? We will miss it and end up calling json.load with empty kv_str. Can we change it to except Exception and get a repro

cc/ @msterin

@lipingxue
Copy link
Contributor

Hit it again in CI
https://ci.vmware.run/vmware/docker-volume-vsphere/565

This is what I found:

  1. a detach request issued but not completed
  2. plugin restart (I checked the test, it is not made by test)
  3. after plugin restart, get the request to change the volume attribute from "read_only" to "read_write". Function set_vol_opts is called the then kvESX.save() is called to update the meta data. This write FAILED to write the new meta data to .vmfd file and leaves the .vmfd file as empty.

We hit this consistently after the plugin restart, not sure if it related to #1395 although here we don't have parallel volume create here.

06/16/17 07:31:10 48336 [MainThread] [INFO   ] Started new thread : 127867760 with target <function execRequestThread at 0x26dc8534> and args (11, 48994, '{"cmd":"detach","details":{"Name":"vol_access_volume_e8ktpto5vl0h@datastore1"},"version":"2"}')
06/16/17 07:31:10 48336 [Thread-113] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] *** detachVMDK: /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk from Ubuntu.1404.vmfs VM uuid = 564d370b-55b3-385a-4590-919ed4c564ab    --> !!!detachVMDK does not complete, and then a plugin restart happens!!!
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Connecting to the local Service Instance as 'dcui' 
06/16/17 07:31:10 49423 [MainThread] [INFO   ] set_vol_opts: name=vol_access_volume_e8ktpto5vl0h@datastore1 options=access=read-write vol_name=vol_access_volume_e8ktpto5vl0h, datastore=datastore1
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 49423 [MainThread] [INFO   ] set_vol_opts: path=/vmfs/volumes/datastore1/dockvols/_DEFAULT vmdk_path=/vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 49423 [MainThread] [INFO   ] Load dictionary from sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:10 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] Save the dictionary {u'status': 'detached', u'created': u'Fri Jun 16 07:30:57 2017', u'volOpts': {u'access': u'read-only', u'diskformat': u'thin', u'fstype': u'ext4'}, u'created-by': u'photon.ga.vmfs'} to sidecar for vol path /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
06/16/17 07:31:11 48336 [Ubuntu.1404.vmfs-datastore1._DEFAULT.vol_access_volume_e8ktpto5vl0h] [INFO   ] save: write {"status": "detached", "created": "Fri Jun 16 07:30:57 2017", "volOpts": {"access": "read-only", "diskformat": "thin", "fstype": "ext4"}, "created-by": "photon.ga.vmfs"} to meta_file /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h-03d9153809d9881f.vmfd     -> !!! this call to save() failed to write the updated meta data .vmfd file !!!
06/16/17 07:31:11 49423 [MainThread] [WARNING] load: Kv_str  reading from meta file /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h-03d9153809d9881f.vmfd is not in JSON format
06/16/17 07:31:11 49423 [MainThread] [ERROR  ] load:Failed to decode meta-data for /vmfs/volumes/datastore1/dockvols/_DEFAULT/vol_access_volume_e8ktpto5vl0h.vmdk
Traceback (most recent call last):
  File "/usr/lib/vmware/vmdkops/Python/kvESX.py", line 290, in load
    return json.loads(kv_str)
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/__init__.py", line 338, in loads
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/decoder.py", line 366, in decode
  File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/json/decoder.py", line 384, in raw_decode
ValueError: No JSON object could be decoded

@pdhamdhere
Copy link
Contributor

Root caused to Test not waiting for docker operation to finish before issuing admin update command. Confirmed that adding --rm ssh.Invoke waits for detach to finish before returning to shell. @shuklanirdesh82 will change Test infra to use --rm as appropriate.

@shaominchen
Copy link
Contributor

@pdhamdhere Shouldn't the driver take care of this race condition, i.e. the admin update command should be blocked in this case until the docker operation is finished? I'm not really familiar with the detailed logic in this module, but I think the metadata file should be locked using a R/W lock - any read operation should be blocked if a write operation is in progress. Please correct me if I'm wrong here.

@pdhamdhere
Copy link
Contributor

Theoretically, underlying FS (VMFS in this case) locking should correctly handle R/W to same file from multiple processes and should not result in 0 byte file. Another cause could be way SideCar are handled. However, this is rarest race where admin updates status of volume while KV is being written so didnt spend too much time. We should fix the test and move on to higher priority issues. Makes sense?

@govint
Copy link
Contributor

govint commented Jun 17, 2017

@lipingxue But why did the write fail on the KV? Plus a lock is held on every load and save op on the KV even if admin CLI command is issued on the volume when a detach is in progress they will both synch. on that KV lock to prevent simultaneous access.

From the log trace two parallel ops are loading (locked) the KV and then saving (again locked) it. Both ops done for both threads in parallel but they do synch. on the KV lock and hence sequenced.

@lipingxue can you also share the logs of the write failing?

@msterin
Copy link
Contributor

msterin commented Jun 17, 2017

@govint - what KV lock you are talking about?
To.me it looks we still have a big risk since we have no idea why write has failed, and we exercised a.real (albeit rare) use case, when admin just happen to type something when a vmdk op on some docker host was in progress.

@govint
Copy link
Contributor

govint commented Jun 17, 2017

Both kv load and save take a lock before executing the op.

@msterin
Copy link
Contributor

msterin commented Jun 17, 2017

Which lock ? In-process python ? It is irrelevant when admin CLI and vmdk_ops service race. I am not aware of any interprocess lock for KV. Did I miss something ?

@govint
Copy link
Contributor

govint commented Jun 20, 2017

Correct, its not a file lock, the kvESX.py load() and save() use a lock decorator so those functions are synchronized within the ESX service and the admin CLI individually.

@shaominchen
Copy link
Contributor

The test case itself has been fixed, but we still need more discussion about this issue.

@shaominchen shaominchen modified the milestones: 0.17, 0.16 Jun 20, 2017
@tusharnt tusharnt assigned govint and unassigned shuklanirdesh82 Jul 17, 2017
@tusharnt tusharnt modified the milestones: Biweekly sprint - Rafiki, 0.17 Jul 17, 2017
@govint
Copy link
Contributor

govint commented Jul 24, 2017

@shaominchen, @tusharnt - what do we want to do for this issue? there is a race between an admin CLI user and the service user (via docker). One way is to have the admin CLI stop using the service code and instead post requests to the service and get the responses. There are no file locks used here so different processes can't access meta-data in parallel.

@shaominchen
Copy link
Contributor

shaominchen commented Jul 24, 2017

One way is to have the admin CLI stop using the service code and instead post requests to the service and get the responses.

This sounds a reasonable solution to me. Do we have an estimation for this approach? Depending on the schedule and the estimation of the fix, we can decide if we want to fix this for GA or not.

@govint
Copy link
Contributor

govint commented Jul 25, 2017

This approach will need a new listener in the ESX service (can't reuse the VMCI listener in the service) to talk to, to handle requests from the admin CLI. I'd prefer not to do this for GA as the use case is fairly remote - race condition between admin CLI and docker. In fact I'd prefer not to address this issue unless a customer picks it up in which case something like #1079 may be usable - a more generic API based service that can be accessed not only locally but across hosts.

@govint
Copy link
Contributor

govint commented Sep 13, 2017

Closing this issue, the approach was to change the way admin CLI shares the ESX service code, but I don't see that changing any more and definitely not for addressing the scenario here.

@govint govint closed this as completed Sep 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants