Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: deflake TestDowngradeUpgradeClusterOf3 timeout #14657

Merged

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Oct 30, 2022

In the TestDowngradeUpgradeCluster case, the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests. The simple-config-changer will mark the confState dirty and the storage backend precommit hook will update the confState.

For the new cluster, the storage version is nil at the beginning. And it will be v3.5 if the confState record has been committed. And it will be >v3.5 if the storageVersion record has been committed.

When the new cluster is ready, the leader will set init cluster version with v3.6.x. And then it will trigger the monitorStorageVersion to update the storageVersion to v3.6.x. If the confState record has been updated before cluster version update, we will get storageVersion record.

If the storage backend doesn't commit in time, the monitorStorageVersion won't update the version because of cannot detect storage schema version: missing confstate information.

And then we file the downgrade request before next round of monitorStorageVersion(per 4 second), the cluster version will be v3.5.0 which is equal to the UnsafeDetectSchemaVersion's result. And we won't see that The server is ready to downgrade.

It is easy to reproduce the issue if you use cpuset or taskset to limit in two cpus.

So, we should wait for the new cluster's storage ready before downgrade request.

Fixes: #14540

Signed-off-by: Wei Fu fuweid89@gmail.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@fuweid
Copy link
Member Author

fuweid commented Oct 30, 2022

I think we can add validation about storage version when handle downgrade.
If the storage version is not aligned with cluster version, we should reject the request.

@serathius
Copy link
Member

And then we file the downgrade request before next round of monitorStorageVersion(per 4 second), the cluster version will be v3.5.0 which is equal to the UnsafeDetectSchemaVersion's result. And we won't see that The server is ready to downgrade.

Not sure what you mean. I would like to leave checking The server is ready to downgrade log as this is the current way user is informed that they can start downgrading members.

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2022

Codecov Report

Merging #14657 (e25090f) into main (e25090f) will not change coverage.
The diff coverage is n/a.

❗ Current head e25090f differs from pull request most recent head 3ddcb3d. Consider uploading reports for the commit 3ddcb3d to get more accurate results

@@           Coverage Diff           @@
##             main   #14657   +/-   ##
=======================================
  Coverage   75.67%   75.67%           
=======================================
  Files         457      457           
  Lines       37299    37299           
=======================================
  Hits        28225    28225           
  Misses       7317     7317           
  Partials     1757     1757           
Flag Coverage Δ
all 75.67% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@fuweid
Copy link
Member Author

fuweid commented Oct 31, 2022

And then we file the downgrade request before next round of monitorStorageVersion(per 4 second), the cluster version will be v3.5.0 which is equal to the UnsafeDetectSchemaVersion's result. And we won't see that The server is ready to downgrade.

Not sure what you mean. I would like to leave checking The server is ready to downgrade log as this is the current way user is informed that they can start downgrading members.

Hi @serathius, thanks for the review! Let me explain it.

when the new cluster starts and before lead election, the client can only get the not_decided cluster version and unknown storage version codelink.
The monitorClusterVersions will initialize the cluster version with v3.6.0, which will trigger the monitorStorageVersion to update the storage version.

The monitorStorageVersion updates storage version only if the storage version is nil or it is not aligned with cluster version.
If the storage backend doesn't commit confState in time, the monitorStorageVersion won't update storage version and it is still be unknown via /version http api.
Since the monitorStorageVersion does update periodically, the storage version will be correct eventually.

So, the client might see three values before storage version has been finilized: unknown, v3.5.0 and v3.6.0.
For the value v3.5.0, it happens after confState committed and before monitorStorageVersion updates codelink.
It is intermediate state.

func UnsafeDetectSchemaVersion(lg *zap.Logger, tx backend.ReadTx) (v semver.Version, err error) {
vp := UnsafeReadStorageVersion(tx)
if vp != nil {
return *vp, nil
}
confstate := UnsafeConfStateFromBackend(lg, tx)
if confstate == nil {
return v, fmt.Errorf("missing confstate information")
}
_, term := UnsafeReadConsistentIndex(tx)
if term == 0 {
return v, fmt.Errorf("missing term information")
}
return version.V3_5, nil
}

The storage version becomes v3.6.0 only if the cluster version is v3.6.0.
If the cluster version has been updated to be v3.5.0 by Downgrade request, and the storage version is still being v3.5.0, the monitorStorageVersion won't update the storage version,
because the storage version is equal to the cluster version. So we won't see the log The server is ready to downgrade.

func (m *Monitor) UpdateStorageVersionIfNeeded() {
cv := m.s.GetClusterVersion()
if cv == nil {
return
}
sv := m.s.GetStorageVersion()
if sv == nil || sv.Major != cv.Major || sv.Minor != cv.Minor {
if sv != nil {
m.lg.Info("cluster version differs from storage version.", zap.String("cluster-version", cv.String()), zap.String("storage-version", sv.String()))
}
err := m.s.UpdateStorageVersion(semver.Version{Major: cv.Major, Minor: cv.Minor})
if err != nil {
m.lg.Error("failed to update storage version", zap.String("cluster-version", cv.String()), zap.Error(err))
return
}
d := m.s.GetDowngradeInfo()
if d != nil && d.Enabled {
m.lg.Info(
"The server is ready to downgrade",
zap.String("target-version", d.TargetVersion),
zap.String("server-version", version.Version),
)
}
}
}

The original test case doesn't wait for the storage version ready. So I check the storage version to make sure that everything is ready.
I was thinking that it is hard to check server log to make sure that the Downgrade is ready. Use /version api is easy for the operator.
Does it make senses?

@serathius
Copy link
Member

serathius commented Oct 31, 2022

Nether this test nor user should start downgrade before cluster version was figured out during cluster bootstrap. We should update the test and wait for cluster version to be equal v3.6.

I was thinking that it is hard to check server log to make sure that the Downgrade is ready. Use /version api is easy for the operator.
I can agree with that, however the process should also be visible for log so administrator that is sshed to know when they can start replacing members.

@fuweid
Copy link
Member Author

fuweid commented Oct 31, 2022

Nether this test nor user should start downgrade before cluster version was figured out during cluster bootstrap. We should update the test and wait for cluster version to be equal v3.6.

Yes. And we should also wait for storage version to be equal to v3.6. This commit is used to deflake the case.

@tjungblu
Copy link
Contributor

the brand-new cluster is using simple-config-changer, which means that entries has been committed before leader election and these entries will be applied when etcdserver starts to receive apply-requests.

@fuweid good catch, I've thought there's a race condition somewhere. Thanks for digging this up.

@fuweid fuweid force-pushed the test-fix-TestDowngradeUpgradeClusterOf3 branch 2 times, most recently from fc26bc3 to f96f1b0 Compare November 2, 2022 06:59
@fuweid
Copy link
Member Author

fuweid commented Nov 2, 2022

@serathius Sorry for the late reply. The code has been updated. PTAL.

for {
if expect.Server != "" {
err = e2e.SpawnWithExpects(e2e.CURLPrefixArgs(cfg, member, "GET", e2e.CURLReq{Endpoint: "/version"}), nil, `"etcdserver":"`+expect.Server)
err := func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove creating anonymous function. There is no need for it if we immidietly call it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The func is used to unify the time.Sleep statement. If not, there are several time.Sleep if err != nil, like

if expect.Cluster != result.Cluster {
    time.Sleep(time.Second)
    continue
}
if expect.Server != result.Server {
    time.Sleep(time.Second)
    continue
} 
...

Just wonder that there is any reason to prevent this usage? 😂 Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it obfuscates code. It's better to be explicit than implicit. There is nothing wrong with repetition.

Copy link
Member

@serathius serathius Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still want to avoid repetition, it would be better to change anonymous function to normal function that takes expect and result versions. Like

func validateVersion(expect, got version.Version) error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Thanks for the command. I updated. PTAL~

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid force-pushed the test-fix-TestDowngradeUpgradeClusterOf3 branch from f96f1b0 to 5524772 Compare November 2, 2022 13:25
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.

For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.

When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.

If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.

And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.

It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.

So, we should wait for the new cluster's storage ready before downgrade
request.

Fixes: etcd-io#14540

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@fuweid fuweid force-pushed the test-fix-TestDowngradeUpgradeClusterOf3 branch from 5524772 to 3ddcb3d Compare November 2, 2022 14:50
@fuweid fuweid closed this Nov 2, 2022
@fuweid fuweid reopened this Nov 2, 2022
@fuweid
Copy link
Member Author

fuweid commented Nov 2, 2022

reopen to trigger the CI

Flake case: https://github.com/etcd-io/etcd/actions/runs/3378570366/jobs/5608948355

@serathius serathius merged commit 7ed4eda into etcd-io:main Nov 2, 2022
@serathius
Copy link
Member

Thanks for looking into this. Downgrades are important feature for v3.6 so it's great to see fixes here.

@fuweid fuweid deleted the test-fix-TestDowngradeUpgradeClusterOf3 branch November 2, 2022 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

[Test] TestDowngradeUpgradeClusterOf3 timeout
4 participants