-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: deflake TestDowngradeUpgradeClusterOf3 timeout #14657
test: deflake TestDowngradeUpgradeClusterOf3 timeout #14657
Conversation
I think we can add validation about storage version when handle downgrade. |
Not sure what you mean. I would like to leave checking |
Codecov Report
@@ Coverage Diff @@
## main #14657 +/- ##
=======================================
Coverage 75.67% 75.67%
=======================================
Files 457 457
Lines 37299 37299
=======================================
Hits 28225 28225
Misses 7317 7317
Partials 1757 1757
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Hi @serathius, thanks for the review! Let me explain it. when the new cluster starts and before lead election, the client can only get the The So, the client might see three values before storage version has been finilized: etcd/server/storage/schema/schema.go Lines 93 to 107 in a1018db
The storage version becomes etcd/server/etcdserver/version/monitor.go Lines 106 to 131 in a1018db
The original test case doesn't wait for the storage version ready. So I check the storage version to make sure that everything is ready. |
Nether this test nor user should start downgrade before cluster version was figured out during cluster bootstrap. We should update the test and wait for cluster version to be equal v3.6.
|
Yes. And we should also wait for storage version to be equal to v3.6. This commit is used to deflake the case. |
@fuweid good catch, I've thought there's a race condition somewhere. Thanks for digging this up. |
fc26bc3
to
f96f1b0
Compare
@serathius Sorry for the late reply. The code has been updated. PTAL. |
tests/e2e/cluster_downgrade_test.go
Outdated
for { | ||
if expect.Server != "" { | ||
err = e2e.SpawnWithExpects(e2e.CURLPrefixArgs(cfg, member, "GET", e2e.CURLReq{Endpoint: "/version"}), nil, `"etcdserver":"`+expect.Server) | ||
err := func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove creating anonymous function. There is no need for it if we immidietly call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The func
is used to unify the time.Sleep
statement. If not, there are several time.Sleep
if err != nil, like
if expect.Cluster != result.Cluster {
time.Sleep(time.Second)
continue
}
if expect.Server != result.Server {
time.Sleep(time.Second)
continue
}
...
Just wonder that there is any reason to prevent this usage? 😂 Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it obfuscates code. It's better to be explicit than implicit. There is nothing wrong with repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still want to avoid repetition, it would be better to change anonymous function to normal function that takes expect
and result
versions. Like
func validateVersion(expect, got version.Version) error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Thanks for the command. I updated. PTAL~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f96f1b0
to
5524772
Compare
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the `confState` dirty and the storage backend precommit hook will update the `confState`. For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the `confState` record has been committed. And it will be >v3.5 if the `storageVersion` record has been committed. When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the `monitorStorageVersion` to update the `storageVersion` to v3.6.x. If the `confState` record has been updated before cluster version update, we will get storageVersion record. If the storage backend doesn't commit in time, the `monitorStorageVersion` won't update the version because of `cannot detect storage schema version: missing confstate information`. And then we file the downgrade request before next round of `monitorStorageVersion`(per 4 second), the cluster version will be v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result. And we won't see that `The server is ready to downgrade`. It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus. So, we should wait for the new cluster's storage ready before downgrade request. Fixes: etcd-io#14540 Signed-off-by: Wei Fu <fuweid89@gmail.com>
5524772
to
3ddcb3d
Compare
reopen to trigger the CI Flake case: https://github.com/etcd-io/etcd/actions/runs/3378570366/jobs/5608948355 |
Thanks for looking into this. Downgrades are important feature for v3.6 so it's great to see fixes here. |
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the
confState
dirty and the storage backend precommit hook will update theconfState
.For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the
confState
record has been committed. And it will be >v3.5 if thestorageVersion
record has been committed.When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the
monitorStorageVersion
to update thestorageVersion
to v3.6.x. If theconfState
record has been updated before cluster version update, we will get storageVersion record.If the storage backend doesn't commit in time, the
monitorStorageVersion
won't update the version because ofcannot detect storage schema version: missing confstate information
.And then we file the downgrade request before next round of
monitorStorageVersion
(per 4 second), the cluster version will be v3.5.0 which is equal to theUnsafeDetectSchemaVersion
's result. And we won't see thatThe server is ready to downgrade
.It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus.
So, we should wait for the new cluster's storage ready before downgrade request.
Fixes: #14540
Signed-off-by: Wei Fu fuweid89@gmail.com
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.