Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] TestDowngradeUpgradeClusterOf3 timeout #14540

Closed
chaochn47 opened this issue Sep 30, 2022 · 7 comments · Fixed by #14657
Closed

[Test] TestDowngradeUpgradeClusterOf3 timeout #14540

chaochn47 opened this issue Sep 30, 2022 · 7 comments · Fixed by #14657

Comments

@chaochn47
Copy link
Member

chaochn47 commented Sep 30, 2022

Which github workflows are flaking?

test (linux-amd64-e2e)

Which tests are flaking?

TestDowngradeUpgradeClusterOf3

Github Action link

https://github.com/etcd-io/etcd/actions/runs/3156389973/jobs/5136053128

Reason for failure (if possible)

--- FAIL: TestDowngradeUpgradeClusterOf3 (32.84s)
logger.go:130: 2022-09-30T04:57:21.930Z	INFO	started server.	{"name": "TestDowngradeUpgradeClusterOf3-test-2", "pid": 10915}
logger.go:130: 2022-09-30T04:57:21.930Z	INFO	started server.	{"name": "TestDowngradeUpgradeClusterOf3-test-0", "pid": 10916}
logger.go:130: 2022-09-30T04:57:21.941Z	INFO	started server.	{"name": "TestDowngradeUpgradeClusterOf3-test-1", "pid": 10921}
cluster_downgrade_test.go:57: Cluster created
cluster_downgrade_test.go:59: etcdctl downgrade enable 3.5
cluster_downgrade_test.go:62: Downgrade enabled, validating if cluster is ready for downgrade
cluster_downgrade_test.go:64: failed to find match string: context deadline exceeded
logger.go:130: 2022-09-30T04:57:52.050Z	INFO	closing test cluster...

e2e test cluster logging

/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-0) (10916): 
{
	"level": "info",
	"ts": "2022-09-30T04:57:25.934Z",
	"caller": "etcdserver/server.go:2083",
	"msg": "saved snapshot",
	"snapshot-index": 10
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-0) (10916): 
{
	"level": "info",
	"ts": "2022-09-30T04:57:25.935Z",
	"caller": "etcdserver/server.go:2113",
	"msg": "compacted Raft logs",
	"compact-index": 1
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:29.929Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "warn",
	"ts": "2022-09-30T04:57:30.863Z",
	"caller": "version/monitor.go:200",
	"msg": "remotes server has mismatching etcd version",
	"remote-member-id": "5ac8aa22f1eb4c8f",
	"current-server-version": "3.6.0",
	"target-version": "3.5.0"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:33.931Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "warn",
	"ts": "2022-09-30T04:57:35.864Z",
	"caller": "version/monitor.go:200",
	"msg": "remotes server has mismatching etcd version",
	"remote-member-id": "72e86eb5e356a737",
	"current-server-version": "3.6.0",
	"target-version": "3.5.0"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:37.933Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "warn",
	"ts": "2022-09-30T04:57:40.865Z",
	"caller": "version/monitor.go:200",
	"msg": "remotes server has mismatching etcd version",
	"remote-member-id": "5ac8aa22f1eb4c8f",
	"current-server-version": "3.6.0",
	"target-version": "3.5.0"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:41.934Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "warn",
	"ts": "2022-09-30T04:57:45.866Z",
	"caller": "version/monitor.go:200",
	"msg": "remotes server has mismatching etcd version",
	"remote-member-id": "5ac8aa22f1eb4c8f",
	"current-server-version": "3.6.0",
	"target-version": "3.5.0"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:45.935Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "error",
	"ts": "2022-09-30T04:57:49.936Z",
	"caller": "version/monitor.go:81",
	"msg": "Cannot downgrade cluster version, version change is not valid",
	"downgrade-version": "3.5.0",
	"cluster-version": "3.5.0",
	"minimal-server-version": "3.6.0-alpha.0",
	"stacktrace": "go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-2) (10915): 
{
	"level": "warn",
	"ts": "2022-09-30T04:57:50.868Z",
	"caller": "version/monitor.go:200",
	"msg": "remotes server has mismatching etcd version",
	"remote-member-id": "ca50e9357181d758",
	"current-server-version": "3.6.0",
	"target-version": "3.5.0"
}
/home/runner/work/etcd/etcd/bin/etcd (TestDowngradeUpgradeClusterOf3-test-0) (10916): 
{
	"level": "info",
	"ts": "2022-09-30T04:57:52.051Z",
	"caller": "osutil/interrupt_unix.go:66",
	"msg": "received signal; shutting down",
	"signal": "terminated"
}

Anything else we need to know?

No response

@chaochn47
Copy link
Member Author

chaochn47 commented Sep 30, 2022

Trying to reproduce

Looking into failures from last 100 runs (excluding coverage and known issues) we get failures in:

4 failures in e2e tests of TestDowngradeUpgradeClusterOf3 (example)

PASSES='build release e2e' MANUAL_VER=v3.5.0 CPU='4' EXPECT_DEBUG='true' COVER='false' RACE='true' ./scripts/test.sh 2>&1 | tee test.log

or

PASSES='build release' MANUAL_VER=v3.5.0 ./scripts/test.sh 2>&1
pushd tests/e2e
go test -v -run TestDowngradeUpgradeClusterOf3 -count=100 
popd

@xiaods
Copy link

xiaods commented Oct 4, 2022

testing on branch main, the case is not reproduce again.
use above cmd:

PASSES='build release' MANUAL_VER=v3.5.0 ./scripts/test.sh 2>&1
pushd tests/e2e
go test -v -run TestDowngradeUpgradeClusterOf3 -count=100 
popd

it works like charm, no error found.

Image_20221004165706

@tjungblu
Copy link
Contributor

tjungblu commented Oct 14, 2022

That's actually an interesting case, checkout the logs:


2022-09-30T05:04:04.9116606Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:21.922Z","logger":"raft","caller":"raft/raft.go:771","msg":"72e86eb5e356a737 became leader at term 2"}
...
2022-09-30T05:04:04.9151220Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:21.928Z","caller":"membership/cluster.go:547","msg":"set initial cluster version","cluster-id":"df14232385b4161a","local-member-id":"72e86eb5e356a737","cluster-version":"3.6"}

...

2022-09-30T05:04:04.9225356Z (test-1) (10921): {"level":"warn","ts":"2022-09-30T04:57:21.935Z","caller":"etcdserver/server.go:2154","msg":"Failed to detect schema version","error":"missing confstate information"}
2022-09-30T05:04:04.9278597Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:21.933Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
2022-09-30T05:04:04.9279983Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:21.935Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
2022-09-30T05:04:04.9282584Z (test-1) (10921): {"level":"error","ts":"2022-09-30T04:57:21.935Z","caller":"version/monitor.go:108","msg":"failed to update storage version","cluster-version":"3.6.0","error":"cannot detect storage schema version: missing confstate information","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:108\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2188\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9285120Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:21.935Z","caller":"etcdserver/server.go:505","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"5ac8aa22f1eb4c8f","forward-ticks":8,"forward-duration":"800ms","election-ticks":10,"election-timeout":"1s","active-remote-members":2}
2022-09-30T05:04:04.9286975Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:21.944Z","caller":"etcdserver/server.go:505","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"ca50e9357181d758","forward-ticks":8,"forward-duration":"800ms","election-ticks":10,"election-timeout":"1s","active-remote-members":2}
2022-09-30T05:04:04.9288471Z (test-0) (10916): {"level":"warn","ts":"2022-09-30T04:57:21.952Z","caller":"etcdserver/server.go:2154","msg":"Failed to detect schema version","error":"missing confstate information"}
2022-09-30T05:04:04.9295885Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:25.862Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"72e86eb5e356a737","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9297219Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.927Z","caller":"etcdserver/server.go:2286","msg":"updating cluster version using v3 API","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9298597Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.928Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"72e86eb5e356a737","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9299904Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
2022-09-30T05:04:04.9301293Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"apply/apply.go:412","msg":"Cluster version downgrade detected, forcing snapshot","prev-cluster-version":"3.6.0","new-cluster-version":"3.5.0"}
2022-09-30T05:04:04.9302883Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.928Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"5ac8aa22f1eb4c8f","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9304166Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
2022-09-30T05:04:04.9305534Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"apply/apply.go:412","msg":"Cluster version downgrade detected, forcing snapshot","prev-cluster-version":"3.6.0","new-cluster-version":"3.5.0"}
2022-09-30T05:04:04.9307231Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"etcdserver/server.go:1140","msg":"triggering snapshot","local-member-id":"5ac8aa22f1eb4c8f","local-member-applied-index":10,"local-member-snapshot-index":0,"local-member-snapshot-count":100000,"snapshot-forced":true}
2022-09-30T05:04:04.9309199Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"etcdserver/server.go:1140","msg":"triggering snapshot","local-member-id":"72e86eb5e356a737","local-member-applied-index":10,"local-member-snapshot-index":0,"local-member-snapshot-count":100000,"snapshot-forced":true}
2022-09-30T05:04:04.9310881Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.930Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"ca50e9357181d758","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9312252Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.930Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
2022-09-30T05:04:04.9313761Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.931Z","caller":"apply/apply.go:412","msg":"Cluster version downgrade detected, forcing snapshot","prev-cluster-version":"3.6.0","new-cluster-version":"3.5.0"}
2022-09-30T05:04:04.9315167Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"etcdserver/server.go:2301","msg":"cluster version is updated","cluster-version":"3.5"}
2022-09-30T05:04:04.9316738Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.929Z","caller":"version/monitor.go:104","msg":"cluster version differs from storage version.","cluster-version":"3.5.0","storage-version":"3.6.0"}
2022-09-30T05:04:04.9318504Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.932Z","caller":"etcdserver/server.go:1140","msg":"triggering snapshot","local-member-id":"ca50e9357181d758","local-member-applied-index":10,"local-member-snapshot-index":0,"local-member-snapshot-count":100000,"snapshot-forced":true}
2022-09-30T05:04:04.9320022Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.932Z","caller":"etcdserver/server.go:2083","msg":"saved snapshot","snapshot-index":10}
2022-09-30T05:04:04.9321281Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.932Z","caller":"etcdserver/server.go:2113","msg":"compacted Raft logs","compact-index":1}
2022-09-30T05:04:04.9322714Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"schema/migration.go:64","msg":"updated storage version","new-storage-version":"3.5.0"}
2022-09-30T05:04:04.9324183Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"version/monitor.go:113","msg":"The server is ready to downgrade","target-version":"3.5.0","server-version":"3.6.0-alpha.0"}
2022-09-30T05:04:04.9325515Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"etcdserver/server.go:2083","msg":"saved snapshot","snapshot-index":10}
2022-09-30T05:04:04.9326761Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"etcdserver/server.go:2113","msg":"compacted Raft logs","compact-index":1}
2022-09-30T05:04:04.9328002Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"etcdserver/server.go:2083","msg":"saved snapshot","snapshot-index":10}
2022-09-30T05:04:04.9329322Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.935Z","caller":"etcdserver/server.go:2113","msg":"compacted Raft logs","compact-index":1}

<-- 4s

2022-09-30T05:04:04.9332586Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:29.929Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9335163Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:30.863Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"5ac8aa22f1eb4c8f","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9338498Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:33.931Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9341141Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:35.864Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"72e86eb5e356a737","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9344676Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:37.933Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9347236Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:40.865Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"5ac8aa22f1eb4c8f","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9350717Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:41.934Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9353391Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:45.866Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"5ac8aa22f1eb4c8f","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9356623Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:45.935Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9360607Z (test-2) (10915): {"level":"error","ts":"2022-09-30T04:57:49.936Z","caller":"version/monitor.go:81","msg":"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).decideClusterVersion\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:81\ngo.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateClusterVersionIfNeeded\n\t/home/runner/work/etcd/etcd/server/etcdserver/version/monitor.go:54\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorClusterVersions\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2174\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/home/runner/work/etcd/etcd/server/etcdserver/server.go:2403"}
2022-09-30T05:04:04.9363121Z (test-2) (10915): {"level":"warn","ts":"2022-09-30T04:57:50.868Z","caller":"version/monitor.go:200","msg":"remotes server has mismatching etcd version","remote-member-id":"ca50e9357181d758","current-server-version":"3.6.0","target-version":"3.5.0"}
2022-09-30T05:04:04.9364404Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:52.051Z","caller":"osutil/interrupt_unix.go:66","msg":"received signal; shutting down","signal":"terminated"}

<-- 30s

2022-09-30T05:04:04.9366304Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:52.051Z","caller":"embed/etcd.go:391","msg":"closing etcd server","name":"TestDowngradeUpgradeClusterOf3-test-0","data-dir":"/tmp/TestDowngradeUpgradeClusterOf33326995608/002","advertise-peer-urls":["http://localhost:20001"],"advertise-client-urls":["http://localhost:20000"]}

The raft request to downgrade was successfully propagated to all members:

2022-09-30T05:04:04.9298597Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.928Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"72e86eb5e356a737","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9302883Z (test-1) (10921): {"level":"info","ts":"2022-09-30T04:57:25.928Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"5ac8aa22f1eb4c8f","from":"3.6","to":"3.5"}
2022-09-30T05:04:04.9310881Z (test-0) (10916): {"level":"info","ts":"2022-09-30T04:57:25.930Z","caller":"membership/cluster.go:539","msg":"updated cluster version","cluster-id":"df14232385b4161a","local-member-id":"ca50e9357181d758","from":"3.6","to":"3.5"}

Then member and leader 72e86eb5e356a737 was "... ready to downgrade" after taking the snapshot:

2022-09-30T05:04:04.9324183Z (test-2) (10915): {"level":"info","ts":"2022-09-30T04:57:25.934Z","caller":"version/monitor.go:113","msg":"The server is ready to downgrade","target-version":"3.5.0","server-version":"3.6.0-alpha.0"}

The timeout is then caused by the leader being stuck checking whether downgrades are valid:

"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0"
"remotes server has mismatching etcd version","remote-member-id":"72e86eb5e356a737","current-server-version":"3.6.0","target-version":"3.5.0"}

While the other members never reach "The server is ready to downgrade" which is what the test expected when it times out.
It seems to be some kind of race between the cluster version and downgrade version - anyone more experienced with the downgrade process that can chime in here? I'll try to see if I can repro this locally somehow...

@serathius
Copy link
Member

serathius commented Oct 14, 2022

Downgrade process it to allow temporary lowering of cluster version.

"Cannot downgrade cluster version, version change is not valid","downgrade-version":"3.5.0","cluster-version":"3.5.0","minimal-server-version":"3.6.0-alpha.0"

Shows that cluster version was already downgraded. Looks like a bug in the decideClusterVersion

func (m *Monitor) decideClusterVersion() *semver.Version {
clusterVersion := m.s.GetClusterVersion()
minimalServerVersion := m.membersMinimalServerVersion()
if clusterVersion == nil {
if minimalServerVersion != nil {
return minimalServerVersion
}
return semver.New(version.MinClusterVersion)
}
if minimalServerVersion == nil {
return nil
}
downgrade := m.s.GetDowngradeInfo()
if downgrade != nil && downgrade.Enabled {
if IsValidVersionChange(clusterVersion, downgrade.GetTargetVersion()) && IsValidVersionChange(minimalServerVersion, downgrade.GetTargetVersion()) {
return downgrade.GetTargetVersion()
}
m.lg.Error("Cannot downgrade cluster version, version change is not valid",
zap.String("downgrade-version", downgrade.TargetVersion),
zap.String("cluster-version", clusterVersion.String()),
zap.String("minimal-server-version", minimalServerVersion.String()),
)
return nil
}
if clusterVersion.LessThan(*minimalServerVersion) && IsValidVersionChange(clusterVersion, minimalServerVersion) {
return minimalServerVersion
}
return nil
}

If downgrade-version equals cluster-version then downgrade is not needed. decideClusterVersion should return nil as no downgrade is needed.

@serathius
Copy link
Member

Still it's strange that downgrade test doesn't proceed as the cluster-version was lowered as expected.

@tjungblu
Copy link
Contributor

tjungblu commented Oct 14, 2022

If downgrade-version equals cluster-version then downgrade is not needed. decideClusterVersion should return nil as no downgrade is needed.

Even in the happy path case of the test passing, it complains about that version difference for a short time before shutting down. I assume it's mostly cosmetic as of now. Should the GetDowngradeInfo get reset to nil once the downgrade is done?

Still it's strange that downgrade test doesn't proceed as the cluster-version was lowered as expected.

Just one member (the leader) printed the the ready message that's required, the other members seem stuck (?)
Could the leader hold up the members from advancing on reaching the "The server is ready to downgrade" statement somehow with that downgrade error?
EDIT: yeah it seems that way when you inject an error at that place.

Looks like totally separate code paths to me right now, so very strange indeed.

@tjungblu
Copy link
Contributor

Added #14587, but I'm highly unsure this is the right way to fix this. That sounds more like a race condition lingering somewhere than this.

fuweid added a commit to fuweid/etcd that referenced this issue Oct 30, 2022
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/etcd that referenced this issue Nov 2, 2022
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/etcd that referenced this issue Nov 2, 2022
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/etcd that referenced this issue Nov 2, 2022
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/etcd that referenced this issue Nov 2, 2022
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants