-
Notifications
You must be signed in to change notification settings - Fork 95
Conversation
@@ -644,7 +645,7 @@ def listVMDK(tenant): | |||
Each volume name is returned as either `volume@datastore`, or just `volume` | |||
for volumes on vm_datastore | |||
""" | |||
vmdk_utils.init_datastoreCache() | |||
vmdk_utils.init_datastoreCache(force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to force cache refresh here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it need to make sure the listVMDK
return the correct value even if the "datastore" name is changed outside. Cache need to be forced refresh to pickup the new datastore name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense
LgTM |
.gitignore
Outdated
@@ -1,3 +1,4 @@ | |||
.gitignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have this file in the master branch too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is there and controls what git should ignore for all... but I think this specific changbe to the file does not much sense as it removes itself from .git control and freezes the old version. My bad, I suggested this. @lipingxue - you may want to reverse this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, could add .swp to it - not this change
@@ -152,13 +165,13 @@ def get_volumes(tenant_re): | |||
error_info, tenant_name = auth_api.get_tenant_name(sub_dir_name) | |||
if not error_info: | |||
logging.debug("get_volumes: path=%s root=%s sub_dir_name=%s tenant_name=%s", | |||
path, root, sub_dir_name, tenant_name) | |||
path, root, sub_dir_name, tenant_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - white space
if datastore in [i[0] for i in datastores]: | ||
return True | ||
else: | ||
init_datastoreCache() | ||
init_datastoreCache(force=True) | ||
if datastore in [i[0] for i in datastores]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens is one thread is doing this check here, looping thru the datastores, while another thread is running init_datastoreCache() and reconstructing the datastores list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All uses of the datastores list must be protected with the lock - whether re-creating the list or looping over it. The code here may cause an exception if another thread just set datastores to [].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid using global variable "datastores" to loop through. We should call function "get_datastores" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_datastores() returns the global, not sure how Python passes globals across modules in this case between vmdk_utils.py and vmdk_ops.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't use global "datastores" in "vmdk_ops.py". Am I missing anything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call get_datastores() in vmdk_ops.py: get_datastore_names_list()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@govint get_datastores() returns a ref to an object really, not a global or anything like that.. Whataver name it is assigned to later , it will bind that name to the current list-of-datastores object (which was bound to global 'datastores' name in init_cache(). , When we use the name, python will increase the list object refcount and use the object (not the name). If during this usage a new list object is created and bound to 'datastores' global name,([]
in init_cache() in a different thread) it should not impact the current object-in-use. Due to GIL (global interpreter lock) an op like 'datastores' name binding to a new object is atomic - see my comment on flipping datastores binding to the new list in a single operation.. .. of course it's still worth testing.
Here is a small illustration of objects ids and name binding. id()
is a built-in which returns an id of an object bound to the current name:
>>> ds = [1,2,3]
>>> id(ds)
65827064
>>> ds = []
>>> id(ds)
22780784
>>> ds.append(111)
>>> ds
[111]
>>> id(ds)
22780784
BTW - thanks for catching the glitch !
@@ -109,21 +122,21 @@ def get_datastores(): | |||
'url' is datastore URL (e.g. '/vmfs/volumes/vsan:572904f8c031435f-3513e0db551fcc82') | |||
'dockvol-path; is a full path to 'dockvols' folder on datastore | |||
""" | |||
if datastores == None: | |||
init_datastoreCache() | |||
init_datastoreCache() | |||
return datastores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thread may in parallel be in init_datastoreCache() and set datastores to None (by force option). May be get_datastores() should be returning a copy of the datastores list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I think if another thread is in "init_datastoreCache()", this thread try to call "init_datastoreCache()", they it should wait, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, thread A has finished init_datastoreCache() and is going to return "datastores" and thread B calls init_datastoreCache() and sets datastores to None. Thread A will be returning None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, even "get_datastores" returns a copy of global variable "datastores" is not right. Since now the copy of "datastores" can be None too. We should let the init_datastoreCache() to return "datastores". Then funtion "get_datastores" just return "init_datastoreCache()".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned all uses of datastores should be with the lock held. Either hold the lock and create a copy or like you mentioned let init return a copy of datastores list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses of the datastores list need to be under lock in vmdk_utils.py and a copy of the global list must be returned vs. the list itself.
A test would be to modify the code to always do a "forced init" just so the list is always trashed and re-created and run commands from multiple VMs on a single host.
@govint Could you explain more what is the test you suggested? I am not quite clear. |
Ok, modify init_datastoreCache() to always do a forced init - meaning datastores will be set to None and re-initialized on each call and run a workload of test-esx on say >=2 VMs on the host. Verify no exceptions/errors in the ESX service. |
To add to the discussion above. The main gap is that datastores flip to new value is not transactional for anyone not acquiring the lock. It is easily healed by accumulating the list in a local object (local var) and flipping in one op at the end of init() ( If there is further discussion, let's take it to telegram |
esx_service/utils/vmdk_utils.py
Outdated
with lockManager.get_lock("init_datastoreCache"): | ||
if datastores: | ||
if force: | ||
datastores = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what can cause issues on a non-locked consumption, as @govint has pointed out
it should simply say
if datastores and not force:
return
and the code down the line should use tmp_datastores (local name bound to a new list) instead of datastores
And the last statement should simply be datastores = tmp_datastores
which will re-bind datastores
name to the new object. Note that the old object (list of datastores from the previous init()) may be in use by other threads at the same time and Python will simply garbage collects it when it's not used anymore
esx_service/utils/vmdk_utils.py
Outdated
datastores = [] | ||
|
||
for datastore in ds_objects: | ||
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L89 -92 could be simply:
tmp_ds = []
for datastore si.content.rootFolder.childEntity[0].datastoreFolder.childEntity:
... same operation, using tmp_ds instead of datastore
datastores = tmp_ds
879a345
to
ce6bb77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lipingxue - thanks for making this one happen and accommodating all fix requests.
@govint - thanks again for pointing out to the datastores
usage.
The code is LGTM.
@lipingxue - Please drop .gitignore change from the PR (sorry, my bad - see related comments) and remove TODO/fix the comment (see below). Also a couple of nits - your call if you want to take them or ignore them. Ready to go after that.
.gitignore
Outdated
@@ -1,3 +1,4 @@ | |||
.gitignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is there and controls what git should ignore for all... but I think this specific changbe to the file does not much sense as it removes itself from .git control and freezes the old version. My bad, I suggested this. @lipingxue - you may want to reverse this file.
esx_service/utils/vmdk_utils.py
Outdated
Initializes the datastore cache with the list of datastores accessible from local ESX host. | ||
Initializes the datastore cache with the list of datastores accessible | ||
from local ESX host. force=True will force it to ignore current cache | ||
and force init | ||
""" | ||
global datastores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be moved under 'with lockManager'. Not a big difference , just a more precise scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
esx_service/utils/vmdk_utils.py
Outdated
for datastore in ds_objects: | ||
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name) | ||
if err: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a logging.error here too, as it is not supposed to happen at all
esx_service/utils/vmdk_utils.py
Outdated
# tenant_re = "tenant*" : return volumes which belongs to tenant1 or tenant2 | ||
# tenant_re = "*" : return all volumes under /vmfs/volumes/datastore1/dockervol | ||
# tenant_re = "tenant*" : return volumes which belong to tenant1 or tenant2 | ||
# tenant_re = "*" : return all volumes under /vmfs/volumes/datastore1/dockervol TODO: fix the comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix the or drop TODO before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
esx_service/utils/vmdk_utils.py
Outdated
for datastore in ds_objects: | ||
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name) | ||
if err: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, can remove the white spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add a log saying this DS is being ignored as the dockvols path can't be created on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return res[0] if res else None | ||
|
||
|
||
def get_datastore_url_from_config_path(config_path): | ||
"""Returns datastore url in config_path """ | ||
# path can be /vmfs/volumes/<datastore_url_name>/... | ||
# or /vmfs/volumes/datastore_name/... | ||
# so extract datastore_url_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this change, the comment is confusing a little, if the format is the latter "/vmfs/volumes/datastore_name/ then this function is returning datastore_name and not the URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nits. For the testing can it be done with datastores that have spaces in the datastore name.
For init_datastoreCache to refresh by default, using for test. Small fix.
ce6bb77
to
97ae985
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls change error->warning before merge (see comments).
for datastore in ds_objects: | ||
dockvols_path, err = vmdk_ops.get_vol_path(datastore.info.name) | ||
if err: | ||
logging.error(" datastore %s is being ignored as the dockvol path can't be created on it") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.warning. Looks like it can happen and will happen
init_datastoreCache and a couple of it's callers use global 'datastores' with no lock, which may cause race. This PR is to add lock to init_datastoreCache() to avoid the race.