-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141
Comments
Initial observationsThe problem occurred during a
Looking at the code, there are only two places in the schema registry where we issue list_offsets requests:
Observe there are two areas where we call
ss::future<server::reply_t>
get_subjects(server::request_t rq, server::reply_t rp) {
parse_accept_header(rq, rp);
auto inc_del{
parse::query_param<std::optional<include_deleted>>(*rq.req, "deleted")
.value_or(include_deleted::no)};
rq.req.reset();
// List-type request: must ensure we see latest writes
co_await rq.service().writer().read_sync(); // <-- problem could be here
auto subjects = co_await rq.service().schema_store().get_subjects(inc_del);
auto json_rslt{json::rjson_serialize(subjects)};
rp.rep->write_body("json", json_rslt);
co_return rp;
}
if (!valid) {
// It's possible that the schema registry doesn't have a newly
// written schema, update and retry.
co_await _api->_sequencer.local().read_sync(); // <-- problem could be here
valid = co_await ss::visit(
slice, [this](const auto& d) { return validate(d); });
} In Conclusion, I believe this failure happened because of a failed Misc. notesWe also see authorization fail early on when creating the _schemas topic
|
Spoke with @BenPope , server side schema validation is likely not the culprit. We believe that somehow the principal was lost at some point during runtime because: We see that a metadata update was successful
Then, immediately after, we notice that the sasl handshake was successful
And then, immediately after, the list_offsets request occurs and we get
Therefore, more investigation is needed with the idea that either the principal was lost or something happened to the client connection between calls to update metadata, the sasl handshake, and list offsets. Perhaps something was moved into a lambda when it should not have been moved. |
The Furthermore, from the DT and RP logs, it appears that this failure happens around the time that two events occur:
When a broker is restarted, it re-populates it's local instance of the scram credential store by replaying the controller log or after receiving updates from the controller leader. Therefore, it seems like this is a timing issue between the kafka broker replaying the controller log and SchemaRegistry startup resetting ACLs. I'm going to run tests overnight to see if this reproduces the problem. I am assigning |
Thanks to @michael-redpanda , I was able to reliably repro this error on my local with: Adding the I also added some logging to the line in schema registry service that creates ACLs for the ephemeral user: auto err_vec = co_await _controller->get_security_frontend().local().create_acls(
{security::acl_binding{
security::resource_pattern{
security::resource_type::topic,
model::schema_registry_internal_tp.topic,
security::pattern_type::literal},
security::acl_entry{
principal,
security::acl_host::wildcard_host(),
security::acl_operation::all,
security::acl_permission::allow}}},
5s);
// Added in the below
for (auto& err : err_vec) {
if (err != cluster::errc::success) {
vlog(plog.info, "Failed to create ACLs, User {}, err {} - {}", principal, err, cluster::make_error_code(err).message());
}
} The results of the test run show
Points (1) and (2) reveal that this is a timing issue between replaying the controller log and the ListOffsets request. Kafka brokers, after restart, become aware of any preset ACLs by reloading those perms from the controller log. Therefore, node 3 is reporting I see two potential solutions:
IMO, (1) is just a bandaid while (2) is the better solution; however, I am skeptical to implement (2) because it may introduce new startup problems. So I am currently testing to see which approach works. |
It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a few seconds sleep. Fixes redpanda-data#11141
FAIL test: SchemaRegistryAutoAuthTest.test_restarts (1/13 runs) |
It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a small sleep. This commit also puts schema registry startup in the background so the sleeps do not block startup. Fixes redpanda-data#11141
It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a small sleep. Fixes redpanda-data#11141
The other half of the problem in redpanda-data#11141 is that it is possible for a REST request to come in before ACLs have replicated to the rest of the cluster. This will lead to authorization failures. The solution is to increase the retry backoff in the internal kafka client.
The other half of the problem in redpanda-data#11141 is that it is possible for a REST request to come in before ACLs have replicated to the rest of the cluster. This will lead to authorization failures. The solution is to increase the retry backoff in the internal kafka client.
When AutoAuth is enabled, the ACLs check could fail because: 1. The internal client used a random string as the username 2. The ACLs check compared principal type and name Point (1) is a problem because when the client issues a Kafka request, the Kafka server will treat the username as the client's principal. This led to a ACLs failure in acl.cc::36 because the credential store had the ephemeral principal but the client had a random string. Point (2) is a problem because comparing principals will check for matching principal name and type. This led to a ACLs failure in acl.cc::36 because the credential store had principal type ephemeral but the client had principal type user. Therefore, this commit forces the internal client to use the ephemeral principal name as the username and modifies the ACLs check to only compare principal names. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141
https://buildkite.com/redpanda/vtools/builds/7874#01887498-d3d1-4c33-937a-06080819570b
The text was updated successfully, but these errors were encountered: