Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Add lock to init_datastoreCache #1165

Merged
merged 6 commits into from
Apr 17, 2017
Merged

Conversation

lipingxue
Copy link
Contributor

init_datastoreCache and a couple of it's callers use global 'datastores' with no lock, which may cause race. This PR is to add lock to init_datastoreCache() to avoid the race.

@@ -644,7 +645,7 @@ def listVMDK(tenant):
Each volume name is returned as either `volume@datastore`, or just `volume`
for volumes on vm_datastore
"""
vmdk_utils.init_datastoreCache()
vmdk_utils.init_datastoreCache(force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to force cache refresh here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it need to make sure the listVMDK return the correct value even if the "datastore" name is changed outside. Cache need to be forced refresh to pickup the new datastore name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense

@msterin
Copy link
Contributor

msterin commented Apr 14, 2017

LgTM

.gitignore Outdated
@@ -1,3 +1,4 @@
.gitignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file to be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have this file in the master branch too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is there and controls what git should ignore for all... but I think this specific changbe to the file does not much sense as it removes itself from .git control and freezes the old version. My bad, I suggested this. @lipingxue - you may want to reverse this file.

Copy link
Contributor

@govint govint Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, could add .swp to it - not this change

@@ -152,13 +165,13 @@ def get_volumes(tenant_re):
error_info, tenant_name = auth_api.get_tenant_name(sub_dir_name)
if not error_info:
logging.debug("get_volumes: path=%s root=%s sub_dir_name=%s tenant_name=%s",
path, root, sub_dir_name, tenant_name)
path, root, sub_dir_name, tenant_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - white space

if datastore in [i[0] for i in datastores]:
return True
else:
init_datastoreCache()
init_datastoreCache(force=True)
if datastore in [i[0] for i in datastores]:
Copy link
Contributor

@govint govint Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens is one thread is doing this check here, looping thru the datastores, while another thread is running init_datastoreCache() and reconstructing the datastores list?

Copy link
Contributor

@govint govint Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All uses of the datastores list must be protected with the lock - whether re-creating the list or looping over it. The code here may cause an exception if another thread just set datastores to [].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid using global variable "datastores" to loop through. We should call function "get_datastores" instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_datastores() returns the global, not sure how Python passes globals across modules in this case between vmdk_utils.py and vmdk_ops.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't use global "datastores" in "vmdk_ops.py". Am I missing anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call get_datastores() in vmdk_ops.py: get_datastore_names_list()

Copy link
Contributor

@msterin msterin Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@govint get_datastores() returns a ref to an object really, not a global or anything like that.. Whataver name it is assigned to later , it will bind that name to the current list-of-datastores object (which was bound to global 'datastores' name in init_cache(). , When we use the name, python will increase the list object refcount and use the object (not the name). If during this usage a new list object is created and bound to 'datastores' global name,([] in init_cache() in a different thread) it should not impact the current object-in-use. Due to GIL (global interpreter lock) an op like 'datastores' name binding to a new object is atomic - see my comment on flipping datastores binding to the new list in a single operation.. .. of course it's still worth testing.
Here is a small illustration of objects ids and name binding. id() is a built-in which returns an id of an object bound to the current name:

>>> ds = [1,2,3]
>>> id(ds)
65827064
>>> ds = []
>>> id(ds)
22780784
>>> ds.append(111)
>>> ds
[111]
>>> id(ds)
22780784

BTW - thanks for catching the glitch !

@@ -109,21 +122,21 @@ def get_datastores():
'url' is datastore URL (e.g. '/vmfs/volumes/vsan:572904f8c031435f-3513e0db551fcc82')
'dockvol-path; is a full path to 'dockvols' folder on datastore
"""
if datastores == None:
init_datastoreCache()
init_datastoreCache()
return datastores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thread may in parallel be in init_datastoreCache() and set datastores to None (by force option). May be get_datastores() should be returning a copy of the datastores list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I think if another thread is in "init_datastoreCache()", this thread try to call "init_datastoreCache()", they it should wait, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, thread A has finished init_datastoreCache() and is going to return "datastores" and thread B calls init_datastoreCache() and sets datastores to None. Thread A will be returning None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, even "get_datastores" returns a copy of global variable "datastores" is not right. Since now the copy of "datastores" can be None too. We should let the init_datastoreCache() to return "datastores". Then funtion "get_datastores" just return "init_datastoreCache()".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned all uses of datastores should be with the lock held. Either hold the lock and create a copy or like you mentioned let init return a copy of datastores list.

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses of the datastores list need to be under lock in vmdk_utils.py and a copy of the global list must be returned vs. the list itself.

A test would be to modify the code to always do a "forced init" just so the list is always trashed and re-created and run commands from multiple VMs on a single host.

@lipingxue
Copy link
Contributor Author

@govint Could you explain more what is the test you suggested? I am not quite clear.

@govint
Copy link
Contributor

govint commented Apr 15, 2017

Ok, modify init_datastoreCache() to always do a forced init - meaning datastores will be set to None and re-initialized on each call and run a workload of test-esx on say >=2 VMs on the host. Verify no exceptions/errors in the ESX service.

@msterin
Copy link
Contributor

msterin commented Apr 15, 2017

To add to the discussion above. The main gap is that datastores flip to new value is not transactional for anyone not acquiring the lock. It is easily healed by accumulating the list in a local object (local var) and flipping in one op at the end of init() (datastores = tmp_datastores). This way once a traversal of stores started in a thread, init() in another thread will not impact the object in use. Forcing init defeats the purpose of cache though it's good for testing.

If there is further discussion, let's take it to telegram

with lockManager.get_lock("init_datastoreCache"):
if datastores:
if force:
datastores = None
Copy link
Contributor

@msterin msterin Apr 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what can cause issues on a non-locked consumption, as @govint has pointed out
it should simply say

  if datastores and not force:
    return

and the code down the line should use tmp_datastores (local name bound to a new list) instead of datastores
And the last statement should simply be datastores = tmp_datastores which will re-bind datastores name to the new object. Note that the old object (list of datastores from the previous init()) may be in use by other threads at the same time and Python will simply garbage collects it when it's not used anymore

datastores = []

for datastore in ds_objects:
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L89 -92 could be simply:

    tmp_ds = []
    for datastore si.content.rootFolder.childEntity[0].datastoreFolder.childEntity:
        ... same operation, using tmp_ds instead of datastore
   datastores = tmp_ds

@lipingxue lipingxue force-pushed the add_lock_to_init_ds_cache.liping branch from 879a345 to ce6bb77 Compare April 16, 2017 05:12
@lipingxue
Copy link
Contributor Author

@msterin @govint I have addressed your comments and run the test that Govindan suggested(no exception is found in ESX). Please review it.

Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lipingxue - thanks for making this one happen and accommodating all fix requests.
@govint - thanks again for pointing out to the datastores usage.

The code is LGTM.

@lipingxue - Please drop .gitignore change from the PR (sorry, my bad - see related comments) and remove TODO/fix the comment (see below). Also a couple of nits - your call if you want to take them or ignore them. Ready to go after that.

.gitignore Outdated
@@ -1,3 +1,4 @@
.gitignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is there and controls what git should ignore for all... but I think this specific changbe to the file does not much sense as it removes itself from .git control and freezes the old version. My bad, I suggested this. @lipingxue - you may want to reverse this file.

Initializes the datastore cache with the list of datastores accessible from local ESX host.
Initializes the datastore cache with the list of datastores accessible
from local ESX host. force=True will force it to ignore current cache
and force init
"""
global datastores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be moved under 'with lockManager'. Not a big difference , just a more precise scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for datastore in ds_objects:
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name)
if err:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a logging.error here too, as it is not supposed to happen at all

# tenant_re = "tenant*" : return volumes which belongs to tenant1 or tenant2
# tenant_re = "*" : return all volumes under /vmfs/volumes/datastore1/dockervol
# tenant_re = "tenant*" : return volumes which belong to tenant1 or tenant2
# tenant_re = "*" : return all volumes under /vmfs/volumes/datastore1/dockervol TODO: fix the comment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the or drop TODO before merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for datastore in ds_objects:
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name)
if err:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can remove the white spaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a log saying this DS is being ignored as the dockvols path can't be created on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return res[0] if res else None


def get_datastore_url_from_config_path(config_path):
"""Returns datastore url in config_path """
# path can be /vmfs/volumes/<datastore_url_name>/...
# or /vmfs/volumes/datastore_name/...
# so extract datastore_url_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this change, the comment is confusing a little, if the format is the latter "/vmfs/volumes/datastore_name/ then this function is returning datastore_name and not the URL.

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits. For the testing can it be done with datastores that have spaces in the datastore name.

@lipingxue lipingxue force-pushed the add_lock_to_init_ds_cache.liping branch from ce6bb77 to 97ae985 Compare April 17, 2017 16:25
Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls change error->warning before merge (see comments).

for datastore in ds_objects:
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name)
if err:
logging.error(" datastore %s is being ignored as the dockvol path can't be created on it")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging.warning. Looks like it can happen and will happen

@lipingxue lipingxue merged commit 5c3f5c1 into master Apr 17, 2017
@msterin msterin deleted the add_lock_to_init_ds_cache.liping branch April 22, 2017 03:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants