tests: Ensure healthy cluster before and after robustness failpoint #15604

jmhbnz · 2023-03-31T19:55:20Z

We need a way to verify if the cluster is healthy before and after injecting failpoints in robustness tests so we can surface these errors and ensure watch does not wait indefinitely causing the robustness suite to fail.

Fixes: #15596

tests/robustness/failpoints.go

tests/robustness/linearizability_test.go

tests/robustness/failpoints.go

serathius · 2023-04-03T10:21:31Z

ping @ahrtr @ptabor

ahrtr · 2023-04-03T10:58:30Z

I don't understand why we need this (Thanks @jmhbnz 's effort anyway).

When we first start a cluster, all member must be healthy, because the e2e test framework waits for the message "ready to serve client requests".
When the robustness test triggers any failpoint, it also needs to guarantee the member to come back to healthy status afterwards. If any failpoint doesn't guarantee this, we should fix the failpoint's implementation.

serathius · 2023-04-03T12:01:11Z

Please read #15595, we injected the failpoint on one member, but other members crashed. This is unexpected and should be detected by failpoint code as we cannot say that failpoint injection succeeded if cluster was unhealthy before or after.

serathius · 2023-04-03T13:04:26Z

ping @ptabor
This is required to get periodic tests to stop flaking.

ahrtr · 2023-04-03T22:59:26Z

Please read #15595, we injected the failpoint on one member, but other members crashed.

Based on the discussion in #15595, it's because the proxy layer has issue. Shouldn't the proxy layer be fixed? No matter it's production or test environments, if a member crashes unexpectedly, it should be an critical or major issue, we should fix it. Adding more protection may not be good, because we may regard it as a flaky case and just retry, and accordingly hiding the real issue.

serathius · 2023-04-04T07:28:01Z

Please read #15595, we injected the failpoint on one member, but other members crashed.

Based on the discussion in #15595, it's because the proxy layer has issue. Shouldn't the proxy layer be fixed? No matter it's production or test environments, if a member crashes unexpectedly, it should be an critical or major issue, we should fix it. Adding more protection may not be good, because we may regard it as a flaky case and just retry, and accordingly hiding the real issue.

No, the trigger was the proxy blackholing, but for robustness tests problem was that etcd followers crashed and the test didn't notice it. Because tests do not expect whole cluster to be down, they:

continued to run as normal as failpoint cares about health of only restarted member
hid the follower panic amongst flood of "member not reachable error"
reported incorrect error "not enough qps"

This is an unexpected error, thus tests should not retry it but exit immediately. And this is what @jmhbnz implemented. We mark the test as fail with t.Error and cancel all the concurrent processes. I prefer a graceful shutdown here vs a t.Fatal as it avoids obscuring etcd panic and gives us report of operations and db files.

Please ask about the code, instead of making an incorrect assumption. The design was also discussed #15596 (comment)

ptabor

LGTM. Thank you.

I think it's better to have even redundant sources of signal and fail the tests early if anything is not going as expected.

ptabor · 2023-04-04T10:13:19Z

tests/robustness/failpoints.go

+		defer clusterClient.Close()
+
+		cli := healthpb.NewHealthClient(clusterClient.ActiveConnection())
+		resp, err := cli.Check(ctx, &healthpb.HealthCheckRequest{})


Potentially we should have a 'helper' retrier (3 times) in case of connection flakiness around such semi-grpc code.
We might monitor it for flakes... but intuitively there will be (even though it's a 'localhost' communication).

If there are flakes we are sure to discover them in nightly tests. We can consider it as a followup.

serathius · 2023-04-04T10:27:30Z

~~Ups, wanted to fix the conflict myself, but GitHub editor is terrible. Please rebase the PR and sorry for the mess.~~
Managed to rebase PR myself.

…nts. Signed-off-by: James Blair <mail@jamesblair.net>

jmhbnz commented Mar 31, 2023

View reviewed changes