Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Improvement in cleanup script #4722

Merged
merged 1 commit into from
May 11, 2023

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Mar 17, 2023

The existing script would reset ovs-vswitchd when stuck in "starting" status. In some corner cases, ovsdb-server/ovs-ovswitchd services may be removed unexpectedly.

This change includes,

  1. add improvement to recover ovsdb-server/ovs-vswitched service if if they do not exist when running the cleanup script.
  2. remove vNICs using Windows VMSwitch API rather than calling ovs-vsctl commands, this can remove the dependency on the running OVS userspace process (ovs-vswitchd).

Fix: #4721

sc.exe failure ovs-vswitchd reset= 0 actions= restart/0/restart/0/restart/0
start-service ovs-vswitchd
$OVS_VERSION=$(Get-Item $OVSInstallDir\driver\OVSExt.sys).VersionInfo.ProductVersion
ovs-vsctl --no-wait set Open_vSwitch . ovs_version=$OVS_VERSION
Copy link
Contributor

@XinShuYang XinShuYang Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set ovs_version here? Will restarting ovs service remove version info in ovsdb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is not lost if only restart ovs-vswitchd process/service. But if we re-create the OVS schema file ( called in line 92 when ovs-vswitchd failed ), this data is lost as we have deleted db.conf . This version is needed by in antrea-agent monitoring CR, and it may cause antrea-agent crash if not exists.

@luolanzone luolanzone added the area/OS/windows Issues or PRs related to the Windows operating system. label Mar 17, 2023
@luolanzone
Copy link
Contributor

@wenyingd @XinShuYang you can use the label area/OS/Windows to track Windows related PRs.

@codecov
Copy link

codecov bot commented Mar 19, 2023

Codecov Report

Merging #4722 (bc07e92) into main (6a7dd77) will decrease coverage by 0.05%.
The diff coverage is n/a.

❗ Current head bc07e92 differs from pull request most recent head faa6b0c. Consider uploading reports for the commit faa6b0c to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4722      +/-   ##
==========================================
- Coverage   71.40%   71.36%   -0.05%     
==========================================
  Files         409      406       -3     
  Lines       61208    63087    +1879     
==========================================
+ Hits        43706    45020    +1314     
- Misses      14547    15089     +542     
- Partials     2955     2978      +23     
Flag Coverage Δ *Carryforward flag
e2e-tests 38.34% <ø> (-0.01%) ⬇️ Carriedforward from eec86e0
integration-tests 33.81% <ø> (-0.01%) ⬇️
kind-e2e-tests 45.89% <ø> (+5.99%) ⬆️
unit-tests 62.72% <ø> (-0.36%) ⬇️

*This pull request uses carry forward flags. Click here to find out more.

see 96 files with indirect coverage changes

XinShuYang
XinShuYang previously approved these changes Mar 21, 2023
Copy link
Contributor

@XinShuYang XinShuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

RemoveNetworkAdapter $OVS_BR_ADAPTER
RemoveNetworkAdapter "antrea-gw0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a commit message to clarify why vNic was removed from Windows VMSwitch and why netadapter such as antrea-gw0 doesn't need to be removed in script any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the commit message:

s/other than calling ovs-vsctl commands/rather than calling ovs-vsctl commands

Comment on lines +31 to +34
stop-service ovs-vswitchd
sc.exe delete ovs-vswitchd
stop-service ovsdb-server
sc.exe delete ovsdb-server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this fail if one of the services has already been stopped or deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not fail if the service is already stopped. For a nonexist service ( deleted ), yes, it may fail, but it does not block the following commands

return
}
$ovsStatus = $(Get-Service ovs-vswitchd).Status
if ("$ovsStatus" -EQ "StartPending") {
if ("$ovsStatus" -ne "Running") {
sc.exe delete ovs-vswitchd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be safer to add stop-service ovs-vswitchd before this line? Or is this guaranteed to succeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is not necessary, as the current of the status is not "running" and stop-service does not valid impact.

return
}
$ovsStatus = $(Get-Service ovs-vswitchd).Status
if ("$ovsStatus" -EQ "StartPending") {
if ("$ovsStatus" -ne "Running") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the status case-insensitive? I ask because I notice that you use "running" instead later in the script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows is case-insensitive.

if ($vmSwitch -ne $null) {
Write-Host "Remove vNICs"
Remove-VMNetworkAdapter -SwitchName $AntreaHnsNetworkName -ManagementOS -Confirm:$false -ErrorAction SilentlyContinue
$hnsNetwork = Get-HnsNetwork | ? Name -eq $AntreaHnsNetworkName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we use a slightly different command in another script:

hack/windows/Prepare-AntreaAgent.ps1:$AntreaHnsNetwork = Get-HnsNetwork | Where-Object {$_.Name -eq "antrea-hnsnetwork"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two commands are the same, ? is an alias of Where-Object.
I would update to keep consistent.

Comment on lines 112 to 113
# This might happen after the Windows host is restarted abnormally, in which case some stale configurations block
# ovs-vswitchd running, like the pid file and the misconfigurations in OVSDB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case some stale configurations can prevent ovs-vswitchd from running, like a stale pid file or misconfigurations in OVSDB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

hack/windows/Clean-AntreaNetwork.ps1 Outdated Show resolved Hide resolved
hack/windows/Clean-AntreaNetwork.ps1 Outdated Show resolved Hide resolved
@@ -6,10 +6,16 @@
OVS installation directory. It is the path argument when using Install-OVS.ps1. The default path is "C:\openvswitch".
.PARAMETER RenewIPConfig
Renew the ipconfig on the host. The default value is $false.
.PARAMETER RemoveOVS
Remove ovsdb-server and ovs-vswitchd services on the host. The default value is $false. If this argument is set as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/on the host/from the host

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

hack/windows/Clean-AntreaNetwork.ps1 Outdated Show resolved Hide resolved
@luolanzone
Copy link
Contributor

luolanzone commented May 8, 2023

Typo in summary and commit messages:
s/when running the cleaup script./when running the cleanup script.

@wenyingd wenyingd force-pushed the issue_4721 branch 2 times, most recently from ece97c5 to faa6b0c Compare May 8, 2023 03:20
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR description / commit message:

  • s/The existing script would reset ovs-vswitchd is stucking in "starting" status/The existing script would reset ovs-vswitchd when stuck in "starting" status - I think that's what you meant, not totally sure
  • s/if they are not existing/if they do not exist
  • s/remove the dependency of/remove the dependency on

Otherwise, LGTM

@tnqn tnqn added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label May 9, 2023
The existing script would reset ovs-vswitchd when stuck in "starting" status. In some corner cases, ovsdb-server/ovs-ovswitchd services may be removed unexpectedly.

This change includes,
- add improvement to recover ovsdb-server/ovs-vswitched service if they do not exist when running the cleanup script.
- remove vNICs using Windows VMSwitch API rather than calling ovs-vsctl commands, this can remove the dependency on the running OVS userspace process (ovs-vswitchd).

Signed-off-by: wenyingd <wenyingd@vmware.com>
@wenyingd wenyingd requested a review from XinShuYang May 10, 2023 07:50
@antoninbas
Copy link
Contributor

@wenyingd @XinShuYang is there any Jenkins test that uses this script and that we should run before merging?

@XinShuYang
Copy link
Contributor

@antoninbas No, all windows testbeds use snapshot to clean the environment.

@antoninbas
Copy link
Contributor

/skip-all

@antoninbas antoninbas merged commit 7866681 into antrea-io:main May 11, 2023
@wenyingd wenyingd deleted the issue_4721 branch May 30, 2023 06:48
ceclinux pushed a commit to ceclinux/antrea that referenced this pull request Jun 5, 2023
The existing script would reset ovs-vswitchd when stuck in "starting" status. In some corner cases, ovsdb-server/ovs-ovswitchd services may be removed unexpectedly.

This change includes,
- add improvement to recover ovsdb-server/ovs-vswitched service if they do not exist when running the cleanup script.
- remove vNICs using Windows VMSwitch API rather than calling ovs-vsctl commands, this can remove the dependency on the running OVS userspace process (ovs-vswitchd).

Signed-off-by: wenyingd <wenyingd@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/OS/windows Issues or PRs related to the Windows operating system. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean-AntreaNetwork.ps1 doesnt recreate ovs-vswitchd if the service is terminated
5 participants