Flacky test TestLeasingDeleteRangeContendTxn #15352

chaochn47 · 2023-02-24T01:32:37Z

Which github workflows are flaking?

Failed in the forked repo github workflow https://github.com/chaochn47/etcd/actions/runs/4258007188/jobs/7408759584

The flacky test has been mentioned in the PR comment #14918 (comment) and it still can be reproduced now.

Which tests are flaking?

TestLeasingDeleteRangeContendTxn

Github Action link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

Reproduced in the local box

cd tests

taskset -c 1 go test -v -failfast -count=100 -run TestLeasingDeleteRangeContendTxn  ./integration/clientv3/lease

The text was updated successfully, but these errors were encountered:

serathius · 2023-02-26T12:42:21Z

Thanks for report and repro!

tjungblu · 2023-03-03T10:08:08Z

leasing_test.go:1313: #0: expected [key:"key/0" create_revision:1120 mod_revision:1121 version:2 value:"123" ], got [key:"key/0" create_revision:1120 mod_revision:1120 version:1 value:"123" ]

just adding another error message here, in the test linked above the result was empty []:

leasing_test.go:1313: #0: expected [key:"key/0" create_revision:2296 mod_revision:2296 version:1 value:"123" ], got []

Fixes etcd-io#15352. Depending on the goroutine scheduling, the expected count of 8 might not have been reached yet. This ensures the routine won't stop earlier than that. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when the target resources are being updated. When the `txnLeasing` wants a server-side transaction, it needs to ensure all the keys mod revision should be leass than what it saw. If the compare fails, it will repeat to apply the server-side transaction until it is sucessful. I believe the test-case is trying to verify how the `txnLeasing` handles the race issue. Before the patch etcd-io#15401, the resource-updating goroutine keeps updating until the RangeDelete finishes. The testcase is flaky because two goroutines are sharing one `ctx` and grpc-go client won't wait for the response if `ctx` has been canceled. For example, | DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status | | -- | --- | -- | -- | | deleted | | | version = 0 | | | send update(key/0=123) req | received update(key/0=123) req | version = 0 | | cancel | | | version = 0 | | | exit because of cancel | | version = 0 | | get key/0 by putkv | | | version = 0 | | | | applied update(key/0=123) | version = 1 | | get key/0 by raw-cli | | | version = 1 | So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv` applies two update reqs to ETCD server and the last one is canceled before apply, the error will be like: ``` expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ] ``` The resource-updating goroutine should not share the ctx with RangeDelete here. And I also revert current main branch because the resource-update goroutine only updates 8 times and might exit before `RangeDelete`. In this case, the `txnLeasing` is not handling the race issue. Fixes: etcd-io#15352 Signed-off-by: Wei Fu <fuweid89@gmail.com>

Fixes etcd-io#15352. Depending on the goroutine scheduling, the expected count of 8 might not have been reached yet. This ensures the routine won't stop earlier than that. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com> Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>

The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when the target resources are being updated. When the `txnLeasing` wants a server-side transaction, it needs to ensure all the keys mod revision should be leass than what it saw. If the compare fails, it will repeat to apply the server-side transaction until it is sucessful. I believe the test-case is trying to verify how the `txnLeasing` handles the race issue. Before the patch etcd-io#15401, the resource-updating goroutine keeps updating until the RangeDelete finishes. The testcase is flaky because two goroutines are sharing one `ctx` and grpc-go client won't wait for the response if `ctx` has been canceled. For example, | DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status | | -- | --- | -- | -- | | deleted | | | version = 0 | | | send update(key/0=123) req | received update(key/0=123) req | version = 0 | | cancel | | | version = 0 | | | exit because of cancel | | version = 0 | | get key/0 by putkv | | | version = 0 | | | | applied update(key/0=123) | version = 1 | | get key/0 by raw-cli | | | version = 1 | So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv` applies two update reqs to ETCD server and the last one is canceled before apply, the error will be like: ``` expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ] ``` The resource-updating goroutine should not share the ctx with RangeDelete here. And I also revert current main branch because the resource-update goroutine only updates 8 times and might exit before `RangeDelete`. In this case, the `txnLeasing` is not handling the race issue. Fixes: etcd-io#15352 Signed-off-by: Wei Fu <fuweid89@gmail.com> Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>

chaochn47 added the type/flake label Feb 24, 2023

serathius added the help wanted label Feb 26, 2023

tjungblu mentioned this issue Mar 3, 2023

Fixing flaky TestLeasingDeleteRangeContendTxn #15401

Merged

serathius closed this as completed in #15401 Mar 3, 2023

tjungblu mentioned this issue Mar 3, 2023

integration/clientv3/lease TestLeaseDeleteRangeContendDel #15336

Closed

fuweid mentioned this issue Mar 7, 2023

tests/integration: Update TestLeasingDeleteRangeContendTxn #15425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flacky test TestLeasingDeleteRangeContendTxn #15352

Flacky test TestLeasingDeleteRangeContendTxn #15352

chaochn47 commented Feb 24, 2023 •

edited

Loading

serathius commented Feb 26, 2023

tjungblu commented Mar 3, 2023 •

edited

Loading

Flacky test TestLeasingDeleteRangeContendTxn #15352

Flacky test TestLeasingDeleteRangeContendTxn #15352

Comments

chaochn47 commented Feb 24, 2023 • edited Loading

Which github workflows are flaking?

Which tests are flaking?

Github Action link

Reason for failure (if possible)

Anything else we need to know?

serathius commented Feb 26, 2023

tjungblu commented Mar 3, 2023 • edited Loading

chaochn47 commented Feb 24, 2023 •

edited

Loading

tjungblu commented Mar 3, 2023 •

edited

Loading