CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

vshtokman · 2023-06-01T14:34:18Z

https://buildkite.com/redpanda/vtools/builds/7874#01887498-d3d1-4c33-937a-06080819570b

Module: rptest.tests.schema_registry_test
Class:  SchemaRegistryAutoAuthTest
Method: test_restarts

test_id:    rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_restarts
status:     FAIL
run time:   1 minute 52.443 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/schema_registry_test.py", line 1372, in test_restarts
    check_connection(n.account.hostname)
  File "/home/ubuntu/redpanda/tests/rptest/tests/schema_registry_test.py", line 1360, in check_connection
    assert result_raw.status_code == requests.codes.ok
AssertionError

The text was updated successfully, but these errors were encountered:

NyaliaLui · 2023-06-02T15:40:07Z

Initial observations

The problem occurred during a get_subjects SR request because authorization failed on the _schemas topic during a list_offsets call.

DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-01 09:16:17,530 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}

Looking at the code, there are only two places in the schema registry where we issue list_offsets requests:

At startup in service::fetch_internal_topic() - the logs indicate that this method was successful and we do not see topic_authorization_failed
At any call to seq_writer::read_sync() - So I continued the investigation with read_sync() as the suspect

Observe there are two areas where we call read_sync():

Before executing a SR request in handlers.cc. Relevant to this CI ticket, we see a call to read_sync() before RP executes get_subjects. This is likely the culprit:

ss::future<server::reply_t>
get_subjects(server::request_t rq, server::reply_t rp) {
    parse_accept_header(rq, rp);
    auto inc_del{
      parse::query_param<std::optional<include_deleted>>(*rq.req, "deleted")
        .value_or(include_deleted::no)};
    rq.req.reset();

    // List-type request: must ensure we see latest writes
    co_await rq.service().writer().read_sync(); // <-- problem could be here

    auto subjects = co_await rq.service().schema_store().get_subjects(inc_del);
    auto json_rslt{json::rjson_serialize(subjects)};
    rp.rep->write_body("json", json_rslt);
    co_return rp;
}

There is a call to read_sync() within validation.cc::schema_id_validator::impl. I also consider this suspicous since this ci-failure is new and the validation code was newly added (3d ago) for server side schema validation (Server side schema id validation #10650). I will meet with @BenPope to see if this point is relevant.

        if (!valid) {
            // It's possible that the schema registry doesn't have a newly
            // written schema, update and retry.
            co_await _api->_sequencer.local().read_sync(); // <-- problem could be here
            valid = co_await ss::visit(
              slice, [this](const auto& d) { return validate(d); });
        }

In Conclusion, I believe this failure happened because of a failed seq_writer::read_sync() early on during a SR get_subjects() request; however, more investigation is needed to determine why authorization failed.

Misc. notes

We also see authorization fail early on when creating the _schemas topic

INFO  2023-06-01 09:15:05,352 [shard 0] kafka - 172.31.5.177:49923 failed authorization - connection_context.h:193 - proto: kafka rpc protocol, sasl state: complete, acl op: create, principal: {type ephemeral user name __schema_registry}, resource: {kafka-cluster}

NyaliaLui · 2023-06-02T16:23:20Z

Spoke with @BenPope , server side schema validation is likely not the culprit.

We believe that somehow the principal was lost at some point during runtime because:

We see that a metadata update was successful

DEBUG 2023-06-01 09:16:17,524 [shard 0] kafka/client - broker.h:69 - Dispatch to node 3: metadata req: {topics={nullopt} allow_auto_topic_creation=true include_cluster_authorized_operations=false include_topic_authorized_operations=false}
DEBUG 2023-06-01 09:16:17,524 [shard 0] kafka/client - broker.h:76 - Dispatch from node 3: metadata res: {throttle_time_ms=0 brokers={{node_id=1 host=ip-172-31-5-177 port=9092 rack={nullopt}}, {node_id=2 host=ip-172-31-9-54 port=9092 rack={nullopt}}, {node_id=3 host=ip-172-31-13-49 port=9092 rack={nullopt}}} cluster_id={redpanda.28289869-a246-4cbc-a0ff-b00d946d891a} controller_id=1 topics={{error_code={ error_code: none [0] } name={_schemas} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=2 leader_epoch=16 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}},] topic_authorized_operations=-2147483648}} cluster_authorized_operations=-2147483648}

Then, immediately after, we notice that the sasl handshake was successful

INFO  2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.cc:43 - connected to broker:2 - ip-172-31-9-54:9092
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - sasl_client.cc:46 - Connecting to broker 2 with authentication: SCRAM-SHA-512:zMCyV2y0vFhSoQ8oKELeEjraw4uffuzp
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_handshake req: {mechanism=SCRAM-SHA-512}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_handshake res: {error_code={ error_code: none [0] } mechanisms={SCRAM-SHA-256, SCRAM-SHA-512}}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_authenticate req: {auth_bytes=****}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_authenticate res: {error_code={ error_code: none [0] } error_message={nullopt} auth_bytes=**** session_lifetime_ms=0}
DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_authenticate req: {auth_bytes=****}
DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_authenticate res: {error_code={ error_code: none [0] } error_message={nullopt} auth_bytes=**** session_lifetime_ms=0}
TRACE 2023-06-01 09:16:17,529 [shard 0] kafka/client - client.cc:137 - updated metadata

And then, immediately after, the list_offsets request occurs and we get topic_authorization_failed.

DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-01 09:16:17,530 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}

Therefore, more investigation is needed with the idea that either the principal was lost or something happened to the client connection between calls to update metadata, the sasl handshake, and list offsets. Perhaps something was moved into a lambda when it should not have been moved.

andijcr · 2023-06-05T11:52:09Z

https://buildkite.com/redpanda/redpanda/builds/30493#01887dd1-991d-4028-90b4-c69663026efa

ztlpn · 2023-06-14T10:51:17Z

https://buildkite.com/redpanda/redpanda/builds/31157#0188b37c-1d67-40e8-ab0f-087d577d1ac3

NyaliaLui · 2023-06-14T21:14:56Z

The topic_authorization_failed error comes from the ListOffsets handler because the principal did not have "describe" ACLs at the time of the request. This is strange because the ephemeral principal is configured to have "all" ACLs at SchemaRegistry startup.

Furthermore, from the DT and RP logs, it appears that this failure happens around the time that two events occur:

A leadership transfer for the _schemas topic. For example, transfer from noda A->B
The new leader, node B, is restarted shortly after becoming the leader

When a broker is restarted, it re-populates it's local instance of the scram credential store by replaying the controller log or after receiving updates from the controller leader.

Therefore, it seems like this is a timing issue between the kafka broker replaying the controller log and SchemaRegistry startup resetting ACLs. I'm going to run tests overnight to see if this reproduces the problem.

I am assigning sev/medium because this is a true RP bug (i.e., it's not sev/low) but the bug does not result in a crash (i.e., not sev/high).

michael-redpanda · 2023-06-15T16:34:37Z

https://buildkite.com/redpanda/redpanda/builds/31293#0188baa3-5c92-41e6-9e34-fc7570ffbbc0

NyaliaLui · 2023-06-16T16:01:29Z

Thanks to @michael-redpanda , I was able to reliably repro this error on my local with:
task rp:run-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_restarts --no-teardown" REDPANDA_LOG_LEVEL=trace DUCKTAPE_MAX_PARALLEL=3 DUCKTAPE_REPEAT=100

Adding the --no-teardown argument tells ducktape to not run tearDown methods. This is important for Redpanda because it allows the Redpanda controller log to grow between test runs. More on this later.

I also added some logging to the line in schema registry service that creates ACLs for the ephemeral user:

    auto err_vec = co_await _controller->get_security_frontend().local().create_acls(
      {security::acl_binding{
        security::resource_pattern{
          security::resource_type::topic,
          model::schema_registry_internal_tp.topic,
          security::pattern_type::literal},
        security::acl_entry{
          principal,
          security::acl_host::wildcard_host(),
          security::acl_operation::all,
          security::acl_permission::allow}}},
      5s);

    // Added in the below
    for (auto& err : err_vec) {
      if (err != cluster::errc::success) {
        vlog(plog.info, "Failed to create ACLs, User {}, err {} - {}", principal, err, cluster::make_error_code(err).message());
      }
    }

The results of the test run show

The topic authorization fail is due to a ListOffsets request sent to node 3 at 20:58:27:

DEBUG 2023-06-15 20:58:27,539 [shard 0] kafka/client - broker.h:69 - Dispatch to node 3: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-15 20:58:27,539 [shard 0] kafka/client - broker.h:76 - Dispatch from node 3: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}

Around the time of this request, node 3 was restarted and we see that the call to create_acls failed because there is no controller leader yet (cluster::errc:9 is no_leader_controller). Probably because controller log replay has not finished:

INFO  2023-06-15 20:57:21,686 [shard 0] pandaproxy - service.cc:220 - Failed to create ACLs, User {type ephemeral user name __schema_registry}, err cluster::errc:9

Points (1) and (2) reveal that this is a timing issue between replaying the controller log and the ListOffsets request. Kafka brokers, after restart, become aware of any preset ACLs by reloading those perms from the controller log.

Therefore, node 3 is reporting topic_authorization_failed because the ListOffsets req is sent before node 3 finished replaying the controller log.

I see two potential solutions:

Have the SR client wait a couple seconds on topic_authorization_fail before re-issuing the request
Re-order schema registry startup to occur after controller log replay.

IMO, (1) is just a bandaid while (2) is the better solution; however, I am skeptical to implement (2) because it may introduce new startup problems. So I am currently testing to see which approach works.

ztlpn · 2023-06-16T20:36:53Z

https://buildkite.com/redpanda/redpanda/builds/31473#0188c4e1-c6e9-44ae-9a03-08a72188985b

It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a few seconds sleep. Fixes redpanda-data#11141

travisdowns · 2023-06-20T22:03:19Z

FAIL test: SchemaRegistryAutoAuthTest.test_restarts (1/13 runs)
failure at 2023-06-20T13:48:22.957Z: AssertionError()
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31647#0188d8d7-9ae3-4ff0-96b4-f969146e190b

It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a small sleep. This commit also puts schema registry startup in the background so the sleeps do not block startup. Fixes redpanda-data#11141

It is possible that the call to create ACLs fails. For example, there could not be a controller leader yet. This commit will auto-retry the request after a small sleep. Fixes redpanda-data#11141

The other half of the problem in redpanda-data#11141 is that it is possible for a REST request to come in before ACLs have replicated to the rest of the cluster. This will lead to authorization failures. The solution is to increase the retry backoff in the internal kafka client.

When AutoAuth is enabled, the ACLs check could fail because: 1. The internal client used a random string as the username 2. The ACLs check compared principal type and name Point (1) is a problem because when the client issues a Kafka request, the Kafka server will treat the username as the client's principal. This led to a ACLs failure in acl.cc::36 because the credential store had the ephemeral principal but the client had a random string. Point (2) is a problem because comparing principals will check for matching principal name and type. This led to a ACLs failure in acl.cc::36 because the credential store had principal type ephemeral but the client had principal type user. Therefore, this commit forces the internal client to use the ephemeral principal name as the username and modifies the ACLs check to only compare principal names. Fixes redpanda-data#11141

Previously, the security manager was inserting ephemeral credentials into the snapshot that is reloaded on controller log replay. This is a problem because, upon log replay, RP will reload an ephemeral credential that has a username and password but no principal. This leads to authorization failures. Therefore, this commit stops the security manager from insterting ephemeral credentials into a snapshot. Fixes redpanda-data#11141

vshtokman added kind/bug Something isn't working ci-failure labels Jun 1, 2023

piyushredpanda added the area/schema-registry Schema Registry service within Redpanda label Jun 2, 2023

piyushredpanda assigned NyaliaLui Jun 2, 2023

dotnwat mentioned this issue Jun 3, 2023

bytes: a few misc libbytes clean-ups #11185

Merged

7 tasks

BenPope mentioned this issue Jun 6, 2023

Schema Registry: GET /schemas/ids/<id> should return references #11216

Merged

7 tasks

NyaliaLui mentioned this issue Jun 9, 2023

CI Failure (RpkException - SASL authentication failed) in SchemaRegistryAutoAuthTest.test_schema_id_validation #11203

Closed

bharathv mentioned this issue Jun 12, 2023

node_status_backend: reset backoff on peer checkin #11342

Merged

7 tasks

NyaliaLui added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 14, 2023

NyaliaLui mentioned this issue Jun 20, 2023

sr: retry create ACLs for the ephemeral principal #11563

Merged

7 tasks

dlex mentioned this issue Jun 29, 2023

[stale] Throughput limit exemptions for users #11555

Closed

7 tasks

twmb mentioned this issue Jul 6, 2023

rpk: bump deps, build version #11902

Merged

7 tasks

dotnwat closed this as completed in #11563 Jul 7, 2023

NyaliaLui mentioned this issue Jul 31, 2023

security/credential_store: Make iteration more explicit #12502

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

vshtokman commented Jun 1, 2023 •

edited by rystsov

Loading

NyaliaLui commented Jun 2, 2023 •

edited

Loading

NyaliaLui commented Jun 2, 2023 •

edited

Loading

andijcr commented Jun 5, 2023

ztlpn commented Jun 14, 2023

NyaliaLui commented Jun 14, 2023 •

edited

Loading

michael-redpanda commented Jun 15, 2023

NyaliaLui commented Jun 16, 2023

ztlpn commented Jun 16, 2023

travisdowns commented Jun 20, 2023

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

Comments

vshtokman commented Jun 1, 2023 • edited by rystsov Loading

NyaliaLui commented Jun 2, 2023 • edited Loading

Initial observations

Misc. notes

NyaliaLui commented Jun 2, 2023 • edited Loading

andijcr commented Jun 5, 2023

ztlpn commented Jun 14, 2023

NyaliaLui commented Jun 14, 2023 • edited Loading

michael-redpanda commented Jun 15, 2023

NyaliaLui commented Jun 16, 2023

ztlpn commented Jun 16, 2023

travisdowns commented Jun 20, 2023

vshtokman commented Jun 1, 2023 •

edited by rystsov

Loading

NyaliaLui commented Jun 2, 2023 •

edited

Loading

NyaliaLui commented Jun 2, 2023 •

edited

Loading

NyaliaLui commented Jun 14, 2023 •

edited

Loading