Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Make configuration DB location survive ESXi reboot #1401

Merged
merged 3 commits into from
Jun 22, 2017

Conversation

msterin
Copy link
Contributor

@msterin msterin commented Jun 12, 2017

Fixes #1347

On reboot all changes in /etc are lost, unless they are backed up.
ESXi has a standard backup for /etc/rc.local.d/local.sh - this file is
backed up and run on boot ,so to persist symlink to shared DB we need to
insert the link creation to local.sh. And when we drop the link, we need
to clean up the local.sh

This is exactly what this change is doing.

Tested - unit test attached. Admin_cli tested manually:

Add:

[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin] ./vmdkops_admin.py config init --datastore=datastore1 --force                                                                                                                               
Warning: this feature is EXPERIMENTAL                                                                                                                                                                                                          
Creating a symlink to /vmfs/volumes/datastore1/dockvols/vmdkops_config.db at /etc/vmware/vmdkops/auth-db                                                                                                                                       
Updating /etc/rc.local.d/local.sh                                                                                                                                                                                                              
[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin] cat /etc/rc.local.d/local.sh                                                                                                                                                                
#!/bin/sh                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                               
# local configuration options                                                                                                                                                                                                                  
                                                                                                                                                                                                                                               
# Note: modify at your own risk!  If you do/use anything in this                                                                                                                                                                               
# script that is not part of a stable API (relying on files to be in                                                                                                                                                                           
# specific places, specific tools, specific output, etc) there is a                                                                                                                                                                            
# possibility you will end up with a broken system after patching or                                                                                                                                                                           
# upgrading.  Changes are not supported unless under direction of                                                                                                                                                                              
# VMware support.                                                                                                                                                                                                                              
                                                                                                                                                                                                                                               
# Note: This script will not be run when UEFI secure boot is enabled.                                                                                                                                                                          
                                                                                                                                                                                                                                               
# -- vSphere Docker Volume Service configuration --                                                                                                                                                                                            
#                                                                                                                                                                                                                                              
# Please do not edit this section manually. It is managed by vmdkops_admin.py config command.                                                                                                                                                  
# Note: the code relies on local.sh having "exit 0" at the end.                                                                                                                                                                                
#                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                               
datastore=datastore1                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
slink=/etc/vmware/vmdkops/auth-db                                                                                                                                                                                                              
shared_dbn=/vmfs/volumes/$datastore/dockvols/vmdkops_config.db                                                                                                                                                                                 
                                                                                                                                                                                                                                               
if [ -d $(basename $slink) ] && [ ! -e $slink  ]                                                                                                                                                                                               
then                                                                                                                                                                                                                                           
    ln -s $shared_db $slink                                                                                                                                                                                                                    
fi                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                               
# -- vSphere Docker Volume Service configuration --                                                                                                                                                                                            
exit 0                                                                                                                                                                                                                                         
[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin>

remove :

[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin] ./vmdkops_admin.py config rm --unlink --confirm
Removed link /etc/vmware/vmdkops/auth-db
Updating /etc/rc.local.d/local.sh
[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin] cat /etc/rc.local.d/local.sh
#!/bin/sh

# local configuration options

# Note: modify at your own risk!  If you do/use anything in this
# script that is not part of a stable API (relying on files to be in
# specific places, specific tools, specific output, etc) there is a
# possibility you will end up with a broken system after patching or
# upgrading.  Changes are not supported unless under direction of
# VMware support.

# Note: This script will not be run when UEFI secure boot is enabled.

exit 0
[root@msterin-esx-136:/usr/lib/vmware/vmdkops/bin]

# This is what we use to identify the our content for DB links..
CONFIG_DB_TAG = "# -- vSphere Docker Volume Service configuration --"

# This is the content tempate for db links. '{}' will be replaced by datastore name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: template

Copy link
Contributor

@shaominchen shaominchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have an e2e test to verify this scenario. Can we file a issue to track the task of adding an e2e test?

"""
Support for adding/removing information about config db link from /etc/rc.local.d/local.sh
Any config stuff we need and configure in /etc/... will be removed on ESX reboot.
Anything we need to persist between the reboots, needs to be confugured here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: confugured => configured

# We need to insert the content before it.
END_OF_SCRIPT = "exit 0"

# This is what we use to identify the our content for DB links..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "the our" => "the"

datastore={}

slink=/etc/vmware/vmdkops/auth-db
shared_dbn=/vmfs/volumes/$datastore/dockvols/vmdkops_config.db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_dbn => shared_db?


# open file and scan it.
#
# if we reached "exit 0" add the text section, add the rest of the file and be done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

  1. Let's keep the tense consistent: reached => reach (since we are using "find" below)
  2. What does "add the rest of the file and be done" mean?

# requested content just in case.
if add:
sys.stdout.write(content)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove one empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want 2 newline between functions

import shutil

class TestLocalShInfo(unittest.TestCase):
""" Basic test for saving cofig DB link uising local.sh. The test checks saving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cofig => config; uising => using; fie => file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn, my spellchecker was off. Thanks for the catch. Fixed


# Basic test: add new content. Replace it. Remove it.
# Compare with original content - should be the same.
# Also, on neach step check some pattern in the current file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: neach => each?

else:
print("local.sh update/remove test - All good")
os.remove(name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove one empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Sam, will do (will fix all nits and file an issue)

Copy link
Contributor

@pshahzeb pshahzeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
couple of nits and a question about inplace update of the file.

@@ -1343,16 +1349,19 @@ def config_rm(args):
print("Removed link {}".format(link_path))
except Exception as ex:
print(" Failed to remove {}: {}".format(link_path, ex))
print("Updating /etc/rc.local.d/local.sh")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"/etc/rc.local.d/local.sh" should this be a const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch. will fix

# First tag - add the content (if needed) and skip till the next tag
if add:
sys.stdout.write(content)
skip_to_tag = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming the code flow about how is this piece working when there is already one section with vdvs tag (datastore 1) in the file and then this function is called to add another section with vdvs tag (datastore 2). The new section would be written and old section would be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's exactly the behavior

Copy link
Contributor

@govint govint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can the ESX service instead use a symbolic link in /usr/lib/vmware/vmdkops instead of a link in /etc/vmware? Given the comments in local.sh regarding its usage.
  2. If needing to still use a link from /etc/vmware can the VIB install a script under /usr/lib/vmware/vmdkops/bin which handles restoring the linkage to the DB and insert an invocation of the script in local.sh - only one line addition to local.sh
  3. Having a script to handle the linkage is easier, any changes to handling links to the DB can be done to the script vs. in local_sh.py.

@msterin
Copy link
Contributor Author

msterin commented Jun 13, 2017

@govint

Can the ESX service instead use a symbolic link in /usr/lib/vmware/vmdkops instead of a link in /etc/vmware?

it can but the result will be the same. VIB is loaded in ramdisk and changes to ramdisk are discarded on reboot

If needing to still use a link from /etc/vmware can the VIB install a script under /usr/lib/vmware/vmdkops/bin which handles restoring the linkage to the DB and insert an invocation of the script in local.sh - only one line addition to local.sh

I did not catch it - can you elaborate ? Keep in mind that all locations we install via VIB are just a tar unpacked into ramdisk, and ramdisk changes are discarded on reboot. There is no system persistency without state.tgz backup (there are plenty of related KBs)

@msterin msterin force-pushed the persist_dblink.msterin branch 2 times, most recently from c43cd9a to b30ff67 Compare June 14, 2017 01:22
Copy link
Contributor

@pshahzeb pshahzeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msterin
Copy link
Contributor Author

msterin commented Jun 14, 2017

please do not merge yet, I need to investigate a potential issue...

@pshahzeb
Copy link
Contributor

@msterin curious as to what is the potential issue?

@msterin
Copy link
Contributor Author

msterin commented Jun 14, 2017

On my test laptop one of the esxi reboots failed and it is now stuck in booting. I want to investigate before any merge

never mind, it was VSAN-related issue - delay in starting vsantraced on rebooot. The vDVS config worked fine.

@@ -1323,7 +1331,17 @@ def config_rm(args):
info = auth.get_info()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the current change, but the info variable seems not being used.

os.remove(auth_data.AUTH_DB_PATH)
except:
pass
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return statement be inside the above "except auth_data.DbAccessError as ex:" block? The current indention seems incorrect as it will return in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is an indent error and it should be returning only for "except auth_data.DbAccessError as ex:". Thanks for the catch

@pdhamdhere
Copy link
Contributor

@msterin Please triage CI failure. This has been open for a while. Let's try to get this in asap.

@msterin
Copy link
Contributor Author

msterin commented Jun 20, 2017

@pdhamdhere - looks test-related. The branch was a bit behind - just rebased and resubmitted.
Will triage when CI is done

@msterin
Copy link
Contributor Author

msterin commented Jun 20, 2017

test fails in docker volume create, for some reason driver name does not seem to have been passed as it says "vol_access_volume_l33x03nwf9uh@datastore1" includes invalid characters for a local volume name

the failure is unrelated to the fix but concerning. @shuklanirdesh82 - any comment ?

Also @shuklanirdesh82 - there is still garbage in the logs (after cron) , I thought it was fixed ?

2017/06/20 18:48:10 Destroying volume [vol_access_volume_l33x03nwf9uh@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:76: VolumeAccessTestSuite.TestAccessUpdate

volume_access_test.go:91:
    c.Assert(err, IsNil, Commentf(out))
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc4201263e0), Stderr:[]uint8(nil)} ("exit status 125")
... docker: Error response from daemon: create vol_access_volume_l33x03nwf9uh@datastore1: "vol_access_volume_l33x03nwf9uh@datastore1" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
See 'docker run --help'.

2017/06/20 18:48:11 START: VolumeAccessTestSuite.TestAccessUpdate_R_RW
2017/06/20 18:48:11 Creating volume [vol_access_volume_f0iszigydzxy@datastore1] with options [ -o access=read-only] on VM [192.168.31.81]
2017/06/20 18:48:18 Writing message_by_host1 to file test.txt on volume [vol_access_volume_f0iszigydzxy@datastore1] from VM[192.168.31.81]
2017/06/20 18:48:23 Writing message_by_host2 to file test.txt on volume [vol_access_volume_f0iszigydzxy@datastore1] from VM[192.168.31.85]
2017/06/20 18:48:24 Destroying volume [vol_access_volume_f0iszigydzxy@datastore1]

----------------------------------------------------------------------
FAIL: volume_access_test.go:139: VolumeAccessTestSuite.TestAccessUpdate_R_RW

volume_access_test.go:154:
    c.Assert(strings.Contains(out, errorWriteVolume), Equals, true, Commentf(out))
... obtained bool = false
... expected bool = true
... docker: Error response from daemon: create vol_access_volume_f0iszigydzxy@datastore1: "vol_access_volume_f0iszigydzxy@datastore1" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
See 'docker run --help'.
'

@pshahzeb
Copy link
Contributor

@msterin @shuklanirdesh82
Looking into this issue about volume creation failed. Filed issue to track this #1448

@shuklanirdesh82
Copy link
Contributor

shuklanirdesh82 commented Jun 21, 2017

#1401 (comment)

test fails in docker volume create, for some reason driver name does not seem to have been passed as it says "vol_access_volume_l33x03nwf9uh@datastore1" includes invalid characters for a local volume name

the failure is unrelated to the fix but concerning. @shuklanirdesh82 - any comment ?

We have seen such failure locally once, I asked @pshahzeb to raise an issue to triage further (#1401 (comment)). The failure is unrelated to your PR.

Also @shuklanirdesh82 - there is still garbage in the logs (after cron) , I thought it was fixed ?

Yeah it was fixed. I will check once again what is causing this again. Thanks @msterin for pointing out.

@shuklanirdesh82
Copy link
Contributor

shuklanirdesh82 commented Jun 21, 2017

#1401 (comment)
We have seen such failure locally once, I asked @pshahzeb to raise an issue to triage further (#1401 (comment)). The failure is unrelated to your PR.

After triaging further came to know that it is a regression introduced by this PR.

E2E test failure is tricky and not helping to find out root cause easily. The failure observed due to this PR is raising a false impression about volume_access_test failure in fact the failure is with basic_test (it fails to find out the root cause).

I had to checkout the branch locally to generate the VIB and following is my analysis.

conclusion: config rm --local --confirm does nothing and stays the DB as it is.

steps to reproduce:

  1. config init (Single mode)
  2. vmdkops_admin config status
  3. config remove --confirm
  4. vmdkops_admin config status <=== here is the failure config stays as SingleMode
[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py config init --local
Warning: this feature is EXPERIMENTAL
Creating new DB at /etc/vmware/vmdkops/auth-db
Warning: Local configuration will not survive ESXi reboot. See KB2043564 for details

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py status
=== Service: 
Version: 0.14.6732c52-0.0.1
Status: Running
Pid: 925689
Port: 1019
LogConfigFile: /etc/vmware/vmdkops/log_config.json
LogFile: /var/log/vmware/vmdk_ops.log
LogLevel: INFO
=== Authorization Config DB: 
DB_LocalPath: /etc/vmware/vmdkops/auth-db
DB_SharedLocation: N/A
DB_Mode: SingleNode (local DB exists)

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py vmgroup ls
Uuid                                  Name      Description                Default_datastore  VM_list  
------------------------------------  --------  -------------------------  -----------------  -------  
11111111-1111-1111-1111-111111111111  _DEFAULT  This is a default vmgroup  _VM_DS                      

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py vmgroup create --name=T1 --default-datastore=_VM_DS --vm-list=ubuntu1
vmgroup 'T1' is created. Do not forget to run 'vmgroup vm add' to add vm to vmgroup.
[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py vmgroup ls
Uuid                                  Name      Description                Default_datastore  VM_list  
------------------------------------  --------  -------------------------  -----------------  -------  
11111111-1111-1111-1111-111111111111  _DEFAULT  This is a default vmgroup  _VM_DS                      
55a13c06-ee8e-4435-ba2a-77894057b208  T1                                   _VM_DS             ubuntu1  

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py config rm --local --confirm
[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py vmgroup ls
Uuid                                  Name      Description                Default_datastore  VM_list  
------------------------------------  --------  -------------------------  -----------------  -------  
11111111-1111-1111-1111-111111111111  _DEFAULT  This is a default vmgroup  _VM_DS                      
55a13c06-ee8e-4435-ba2a-77894057b208  T1                                   _VM_DS             ubuntu1  

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py config status
DB_LocalPath: /etc/vmware/vmdkops/auth-db
DB_SharedLocation: N/A
DB_Mode: SingleNode (local DB exists)

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py config rm --local --confirm

[root@promc-2n-dhcp105-97:~] /usr/lib/vmware/vmdkops/bin/vmdkops_admin.py config status
DB_LocalPath: /etc/vmware/vmdkops/auth-db
DB_SharedLocation: N/A
DB_Mode: SingleNode (local DB exists)

@msterin
Copy link
Contributor Author

msterin commented Jun 21, 2017

thanks for the analysis.
It sounds like a basic test for "config rm' is either missing or did not catch the issue. Is it on the list for 1.6 ?

@shuklanirdesh82
Copy link
Contributor

It sounds like a basic test for "config rm' is either missing or did not catch the issue.

The step is there but not performing any validation. (https://github.com/vmware/docker-volume-vsphere/blob/master/tests/e2e/basic_test.go#L197)

	// Remove Config DB
	admincli.ConfigRemove(s.esx)

	misc.LogTestEnd(c.TestName())
}

Is it on the list for 1.6 ?

I have just created a new issue. (#1450)

@pshahzeb
Copy link
Contributor

Thanks for analysis @shuklanirdesh82 .
I too looked at why error message thrown was about' illegal characters during volume creation'
After going through the steps you mentioned, I think here is what is happening.

steps to reproduce:

config init (Single mode)
vmdkops_admin config status
config remove --confirm
vmdkops_admin config status <=== here is the failure config stays as SingleMode

Since the cleanup didn't complete, VM2 is still a part of different vmgroup (not default)
Durin access update test,

  1. volume v1 is created from VM1.
  2. A message is written from VM1 to a file on volume v1.
  3. Next step is to read the content of the file from VM2.
    • As part of this step, we fire a container from VM2 and attach volume v1.
    • But since VM2 is still part of different vmgroup, it can't see v1.
    • So docker tries to create a local volume with name (which is in format name@datastore) on VM2.
    • @ character is invalid for the local volume name and hence the error.

This is what I think is happening.

Mark Sterin added 2 commits June 21, 2017 12:02
On reboot all changes in /etc are lost, unless they are backed up.
ESXi has a standard backup for /etc/rc.local.d/local.sh - this file is
backed up and run on boot ,so to persist symlink to shared DB we need to
insert the link crestion to local.sh. And when we drop the link, we need
to clean up the local.sh

This is exactly what this change is doing
@msterin msterin force-pushed the persist_dblink.msterin branch 2 times, most recently from a46ba38 to bc15906 Compare June 21, 2017 22:13
@msterin msterin merged commit 3ed8c16 into master Jun 22, 2017
@shuklanirdesh82 shuklanirdesh82 deleted the persist_dblink.msterin branch November 10, 2017 05:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants