-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in NodesDecommissioningTest.test_recommissioning_one_of_decommissioned_nodes
#6721
Comments
There is a backtrace on docker-rp-11 for that test
|
Another piece to the backtrace that I missed on docker-rp-11
|
I guess this is a reactor stall? and the actual Timeout might be unrelated to the trace. |
Yeah the last redpanda line in the backtrace is here: https://github.com/graphcareful/redpanda/blob/dev/src/v/cluster/controller_backend.cc#L169 upon the printing of a tracelog. Doesn't look like the call to logger contained a vector with many elements either. |
Took a second look, seems like the test decomissions nodes 3 and 4. Then waits until complete, then recommissions node 3. Finally the Looking into the logs for node 4 I observe 6 reactor stalls, each occurring in an important place in
It seems to me like the reactor stall prevented the cluster from executing the logic necessary to remove the node. Retries occur but more reactor stalls occur. Anyone have an idea why so many reactor stalls are occurring within this test? Some sort of issue with the env? |
Reactor stalls on a debug build aren't necessarily surprising, but if it's happening on a particular test repeatedly that's probably a sign that there really is some pathological loop in there. |
True, I'm going to close this ticket as its seeming like there is an environment issue here. Out of 5 of the release builds two failed with failures in this unit test:
and the rest failed to successfully run any ducktape tests, reporting Socket and PipeTimeout's for the entire duration. All of the debug builds have more then one reactor stall reported. |
https://buildkite.com/redpanda/redpanda/builds/18525#01846f78-61b3-42e0-97f8-dd8e7e48704b also this one is in the debug build |
Re-opening, looks like the logs from Andrea's failures are different from all of the other failed runs. |
After looking at the logs, seems like yet another backtrace correlating to around the time i'd expect the system to be removing the decommissioned node , however it wasn't automatically decoded:
|
That last failure link (job 18525) is the one and only failure in the last 30 days in this test. 503s getting broker list happen when the node cannot get health information. |
This is a test bug. The test is sending /v1/brokers requests to any node, but that includes the decommissioned node. The decommissioned node will always respond with 503 because it cannot join controller raft group. Other cases in this class pick a survivor node to use for admin API requests, this test should do the same. |
This test could end up trying to use a decommed node's admin API to query the status of the cluster, and fail on too many 503s. Fixes redpanda-data#6721
This test could end up trying to use a decommed node's admin API to query the status of the cluster, and fail on too many 503s. Fixes redpanda-data#6721
This test could end up trying to use a decommed node's admin API to query the status of the cluster, and fail on too many 503s. Fixes redpanda-data#6721 (cherry picked from commit 6d1f9df)
This test could end up trying to use a decommed node's admin API to query the status of the cluster, and fail on too many 503s. Fixes redpanda-data#6721
Seen in https://buildkite.com/redpanda/redpanda/builds/16483#0183c7b1-90d7-435c-9c27-da7fed969939/6-8059
Specifically the PR #6639
The text was updated successfully, but these errors were encountered: