-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in SchemaRegistryBasicAuthTest
.test_delete_subject
, BasicAuthScaleTest
.test_many_users
#6903
Comments
SchemaRegistryBasicAuthTest.test_delete_subject
It looks like we de-referenced a pointer that was garbage or something like that. Unfortunately there is no other log messages or a backtrace to get more info so I'm going to submit a PR to enable debug logging for pandaproxy/schema-registry |
I may have missed something: d5f6e9f |
@NyaliaLui are you handling this? It is a segfault so needs some love. |
Fix redpanda-data#6903 Signed-off-by: Ben Pope <ben@redpanda.com>
Fix redpanda-data#6903 Signed-off-by: Ben Pope <ben@redpanda.com>
@piyushredpanda |
It might be related, but it's not a fix yet. Well, 3 hours of running ~1500 tests on my machine was clear, but on CI maybe 10% failed. |
another one in a feature branch: |
This seems to be most repeatable during the initial connect when using sasl. The crash is usually somewhere around: redpanda/src/v/kafka/client/sasl_client.cc Line 164 in a2517c2
IME the best test for triggering it is: redpanda/tests/rptest/tests/pandaproxy_test.py Line 1461 in a2517c2
|
On debug ci run for #7242 a test failed which seems to be similar to this issue - the test is different https://buildkite.com/redpanda/redpanda/builds/18462#01846631-734d-497c-b944-51a1cbf90aaf
log from docker-rp-4 where the error occured:
|
Yeah that's the same error. |
I think this might be another instance of this issue |
Another instance here: https://buildkite.com/redpanda/redpanda/builds/18686#0184812d-4a62-411e-aa0a-51fa424f2105 |
Another instance in a feature branch, and very similar to what @abhijat has reported:
docker-rp-3:
|
I assume @BenPope is chasing this? I'll remove other assignees so it's clear. |
SchemaRegistryBasicAuthTest.test_delete_subject
SchemaRegistryBasicAuthTest
.test_delete_subject
, BasicAuthScaleTest
.test_many_users
FAIL test: BasicAuthScaleTest.test_many_users.num_users=500 (1/47 runs) |
FAIL test: RackAwarePlacementTest.test_replica_placement.rack_layout_str=xxYYzz.num_partitions=50.replication_factor=5.num_topics=2 (1/32 runs) |
Another instance of the failure here |
This is a segfault in SchemaRegistryAutoAuthTest.test_mixed_deletes @BenPope I'm tagging onto this issue because it seems like the general vibe is that there's some sort of SR crash buried that can affect various tests intermittently. This feels like a case where a new stress test may be needed to throw enough traffic at the system to reproduce it more reliably, if we have exhausted other avenues of investigation. |
Here's a segfault in test_post_subjects_subject_versions that I assume is the same underlying issue. FAIL test: SchemaRegistryBasicAuthTest.test_post_subjects_subject_versions (1/55 runs) |
i'm working on this |
yikes i dunno what's happening here |
Effectively all logs containing instances of this crash happen near in time to sasl scram authentication. For instance, the following is typical. Here a panda proxy client is authenticating with a broker, which may or may not be the current broker. Notice the reactor stall.
For a very long time we were not able to get any additional information other than this sanitizer output. Finally, by changing the ASAN_OPTIONS to disable the normal ASAN handling of the SEGV we were able to delegate directly normal fatal signal handling and get a core dump to analyze. Below I've added
Below is the backtrace for all the seastar reactor threads. Finally we start to get some hints like this one in which it looks like some of the reactor stall handling and avx2 (related scram authentication primitives like sha256) are interacting?
Below is the full backtrace. Threads 2-4 are doing kafka::client stuff, and the the other threads seem to be asleep.
|
See https://gitlab.com/gnutls/gnutls/-/issues/1111 And some release notes from Scylla
|
Version & Environment
Redpanda version: (use
rpk version
): devLink - https://buildkite.com/redpanda/redpanda/builds/17108#0184026e-77dd-4367-984f-286288a816d2/6-1398
The crash:
The text was updated successfully, but these errors were encountered: