Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Ensure healthy cluster before and after robustness failpoint #15604

Merged
merged 1 commit into from
Apr 4, 2023

Conversation

jmhbnz
Copy link
Member

@jmhbnz jmhbnz commented Mar 31, 2023

We need a way to verify if the cluster is healthy before and after injecting failpoints in robustness tests so we can surface these errors and ensure watch does not wait indefinitely causing the robustness suite to fail.

Fixes: #15596

@jmhbnz jmhbnz force-pushed the robustness-ensure-healthy-clus branch 3 times, most recently from e4286cf to 5252324 Compare April 1, 2023 08:59
@jmhbnz jmhbnz marked this pull request as ready for review April 1, 2023 09:16
@jmhbnz jmhbnz changed the title Ensure healthy cluster after robustness failpoint Ensure healthy cluster before and after robustness failpoint Apr 1, 2023
@jmhbnz jmhbnz force-pushed the robustness-ensure-healthy-clus branch from 5252324 to bd63c0a Compare April 1, 2023 09:27
@jmhbnz jmhbnz force-pushed the robustness-ensure-healthy-clus branch from bd63c0a to 3f249c9 Compare April 1, 2023 10:23
@jmhbnz jmhbnz force-pushed the robustness-ensure-healthy-clus branch from 3f249c9 to f292ebd Compare April 1, 2023 18:23
@jmhbnz jmhbnz changed the title Ensure healthy cluster before and after robustness failpoint tests: Ensure healthy cluster before and after robustness failpoint Apr 2, 2023
tests/robustness/failpoints.go Outdated Show resolved Hide resolved
tests/robustness/failpoints.go Show resolved Hide resolved
@jmhbnz jmhbnz force-pushed the robustness-ensure-healthy-clus branch 5 times, most recently from 90c74f8 to e83703b Compare April 2, 2023 23:11
@serathius
Copy link
Member

ping @ahrtr @ptabor

@ahrtr
Copy link
Member

ahrtr commented Apr 3, 2023

I don't understand why we need this (Thanks @jmhbnz 's effort anyway).

  • When we first start a cluster, all member must be healthy, because the e2e test framework waits for the message "ready to serve client requests".
  • When the robustness test triggers any failpoint, it also needs to guarantee the member to come back to healthy status afterwards. If any failpoint doesn't guarantee this, we should fix the failpoint's implementation.

@serathius
Copy link
Member

Please read #15595, we injected the failpoint on one member, but other members crashed. This is unexpected and should be detected by failpoint code as we cannot say that failpoint injection succeeded if cluster was unhealthy before or after.

@serathius
Copy link
Member

serathius commented Apr 3, 2023

ping @ptabor
This is required to get periodic tests to stop flaking.

@ahrtr
Copy link
Member

ahrtr commented Apr 3, 2023

Please read #15595, we injected the failpoint on one member, but other members crashed.

Based on the discussion in #15595, it's because the proxy layer has issue. Shouldn't the proxy layer be fixed? No matter it's production or test environments, if a member crashes unexpectedly, it should be an critical or major issue, we should fix it. Adding more protection may not be good, because we may regard it as a flaky case and just retry, and accordingly hiding the real issue.

@serathius
Copy link
Member

serathius commented Apr 4, 2023

Please read #15595, we injected the failpoint on one member, but other members crashed.

Based on the discussion in #15595, it's because the proxy layer has issue. Shouldn't the proxy layer be fixed? No matter it's production or test environments, if a member crashes unexpectedly, it should be an critical or major issue, we should fix it. Adding more protection may not be good, because we may regard it as a flaky case and just retry, and accordingly hiding the real issue.

No, the trigger was the proxy blackholing, but for robustness tests problem was that etcd followers crashed and the test didn't notice it. Because tests do not expect whole cluster to be down, they:

  • continued to run as normal as failpoint cares about health of only restarted member
  • hid the follower panic amongst flood of "member not reachable error"
  • reported incorrect error "not enough qps"

This is an unexpected error, thus tests should not retry it but exit immediately. And this is what @jmhbnz implemented. We mark the test as fail with t.Error and cancel all the concurrent processes. I prefer a graceful shutdown here vs a t.Fatal as it avoids obscuring etcd panic and gives us report of operations and db files.

Please ask about the code, instead of making an incorrect assumption. The design was also discussed #15596 (comment)

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

I think it's better to have even redundant sources of signal and fail the tests early if anything is not going as expected.

defer clusterClient.Close()

cli := healthpb.NewHealthClient(clusterClient.ActiveConnection())
resp, err := cli.Check(ctx, &healthpb.HealthCheckRequest{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we should have a 'helper' retrier (3 times) in case of connection flakiness around such semi-grpc code.
We might monitor it for flakes... but intuitively there will be (even though it's a 'localhost' communication).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are flakes we are sure to discover them in nightly tests. We can consider it as a followup.

@serathius
Copy link
Member

serathius commented Apr 4, 2023

Ups, wanted to fix the conflict myself, but GitHub editor is terrible. Please rebase the PR and sorry for the mess.
Managed to rebase PR myself.

@serathius serathius force-pushed the robustness-ensure-healthy-clus branch from f3c61ab to 3682955 Compare April 4, 2023 11:56
…nts.

Signed-off-by: James Blair <mail@jamesblair.net>
@serathius serathius force-pushed the robustness-ensure-healthy-clus branch from 3682955 to 1227754 Compare April 4, 2023 11:58
@serathius serathius merged commit f9d1249 into etcd-io:main Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Ensure healthy cluster after robustness failpoint
6 participants