Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (BadLoglines broken promise) in ManyPartitionsTest.test_many_partitions, ManyPartitionsTest.test_many_partitions_compacted #8518

Closed
ballard26 opened this issue Jan 31, 2023 · 8 comments · Fixed by #8519
Assignees
Labels
area/net Networking and RPC ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@ballard26
Copy link
Contributor

Child of #7405

https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be

test_id:    rptest.scale_tests.many_partitions_test.ManyPartitionsTest.test_many_partitions_compacted
status:     FAIL
run time:   29 minutes 39.206 seconds


    <BadLogLines nodes=ip-172-31-39-189(1) example="ERROR 2022-12-26 20:52:41,574 [shard 11] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.42.51:64603 - seastar::broken_promise (broken promise)">
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 67, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1620, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-39-189(1) example="ERROR 2022-12-26 20:52:41,574 [shard 11] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.42.51:64603 - seastar::broken_promise (broken promise)">
@ballard26 ballard26 added kind/bug Something isn't working ci-failure labels Jan 31, 2023
@ballard26
Copy link
Contributor Author

ballard26 commented Jan 31, 2023

Not sure of the exact reason for these broken promises yet(trying to replicate with debug log lines). However, it looks like we can fail to set a promise in our internal RPC code if a connection closes before we send a reply to an unknown method. And we certainly receive a lot of requests for unknown methods in the test.

ubuntu@ip-172-31-1-10:~$ grep -hro "Received a request for an unknown method [0-9]*" tests/results/latest/ManyPartitionsTest/test_many_partitions/1/RedpandaService-0-139979921641904/ | sort | uniq -c
      2 Received a request for an unknown method 1699244518
     41 Received a request for an unknown method 1702467757
      3 Received a request for an unknown method 1821185049
     23 Received a request for an unknown method 2184789260
     31 Received a request for an unknown method 2190540222
   8548 Received a request for an unknown method 2257645883
      1 Received a request for an unknown method 3105104740
 164929 Received a request for an unknown method 3279383117
  40672 Received a request for an unknown method 3615903018
     87 Received a request for an unknown method 4077993012

@ballard26
Copy link
Contributor Author

Was able to replicate the issue with debug logs for rpc enabled. It looks like my above theory may be the issue. Will get a PR up to fix it and see if the broken promise shows up again.

699073-DEBUG 2023-01-31 01:51:55,304 [shard  4] rpc - rpc_server.cc:159 - Received a request for an unknown method 2257645883 from 172.31.54.197:57316
699074-DEBUG 2023-01-31 01:51:55,304 [shard  4] rpc - rpc_server.cc:159 - Received a request for an unknown method 1699244518 from 172.31.54.197:57316
699801:ERROR 2023-01-31 01:51:55,308 [shard  4] rpc - server.cc:131 - Error[applying protocol] remote address: 172.31.54.197:57316 - seastar::broken_promise (broken promise)
699874-ERROR 2023-01-31 01:51:55,309 [shard  4] rpc - Error dispatching: std::__1::system_error (error system:104, read: Connection reset by peer)

@dotnwat
Copy link
Member

dotnwat commented Jan 31, 2023

@bharathv i think this is the same broken promise that we RCA'd on zoom the other day?

@dotnwat dotnwat added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jan 31, 2023
@dotnwat
Copy link
Member

dotnwat commented Jan 31, 2023

sev/high: broken promises are usually serious logic bugs.

@dotnwat
Copy link
Member

dotnwat commented Jan 31, 2023

@ballard26 fyi if the same broken promise that @bharathv and i worked through, then there are actually 3 separate broken_promise scenarios that can be fixed. they are enumerated here: #8074 (comment)

@piyushredpanda
Copy link
Contributor

Are all three getting solved by your PR, @ballard26 ?

@ballard26
Copy link
Contributor Author

Are all three getting solved by your PR, @ballard26 ?

Yep, each of them is fixed in the PR.

@bharathv
Copy link
Contributor

bharathv commented Feb 1, 2023

fyi if the same broken promise that @bharathv and i worked through, then there are actually 3 separate broken_promise scenarios that can be fixed. they are enumerated here:

Ah yes, thanks @ballard26 for taking care of this.

@dotnwat dotnwat added area/net Networking and RPC and removed area/redpanda labels Feb 1, 2023
@jcsp jcsp changed the title CI Failure broken promise in (ManyPartitionsTest.test_many_partitions, ManyPartitionsTest.test_many_partitions_compacted) CI Failure (BadLoglines broken promise) in ManyPartitionsTest.test_many_partitions, ManyPartitionsTest.test_many_partitions_compacted Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/net Networking and RPC ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants