Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

Closed
vshtokman opened this issue Jun 1, 2023 · 9 comments · Fixed by #11563
Closed

CI failure: SchemaRegistryAutoAuthTest.test_restarts #11141

vshtokman opened this issue Jun 1, 2023 · 9 comments · Fixed by #11563
Assignees
Labels
area/schema-registry Schema Registry service within Redpanda ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@vshtokman
Copy link
Contributor

vshtokman commented Jun 1, 2023

https://buildkite.com/redpanda/vtools/builds/7874#01887498-d3d1-4c33-937a-06080819570b

Module: rptest.tests.schema_registry_test
Class:  SchemaRegistryAutoAuthTest
Method: test_restarts
test_id:    rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_restarts
status:     FAIL
run time:   1 minute 52.443 seconds

    AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/schema_registry_test.py", line 1372, in test_restarts
    check_connection(n.account.hostname)
  File "/home/ubuntu/redpanda/tests/rptest/tests/schema_registry_test.py", line 1360, in check_connection
    assert result_raw.status_code == requests.codes.ok
AssertionError
@vshtokman vshtokman added kind/bug Something isn't working ci-failure labels Jun 1, 2023
@piyushredpanda piyushredpanda added the area/schema-registry Schema Registry service within Redpanda label Jun 2, 2023
@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jun 2, 2023

Initial observations

The problem occurred during a get_subjects SR request because authorization failed on the _schemas topic during a list_offsets call.

DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-01 09:16:17,530 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}

Looking at the code, there are only two places in the schema registry where we issue list_offsets requests:

  1. At startup in service::fetch_internal_topic() - the logs indicate that this method was successful and we do not see topic_authorization_failed
  2. At any call to seq_writer::read_sync() - So I continued the investigation with read_sync() as the suspect

Observe there are two areas where we call read_sync():

  1. Before executing a SR request in handlers.cc. Relevant to this CI ticket, we see a call to read_sync() before RP executes get_subjects. This is likely the culprit:
ss::future<server::reply_t>
get_subjects(server::request_t rq, server::reply_t rp) {
    parse_accept_header(rq, rp);
    auto inc_del{
      parse::query_param<std::optional<include_deleted>>(*rq.req, "deleted")
        .value_or(include_deleted::no)};
    rq.req.reset();

    // List-type request: must ensure we see latest writes
    co_await rq.service().writer().read_sync(); // <-- problem could be here

    auto subjects = co_await rq.service().schema_store().get_subjects(inc_del);
    auto json_rslt{json::rjson_serialize(subjects)};
    rp.rep->write_body("json", json_rslt);
    co_return rp;
}
  1. There is a call to read_sync() within validation.cc::schema_id_validator::impl. I also consider this suspicous since this ci-failure is new and the validation code was newly added (3d ago) for server side schema validation (Server side schema id validation #10650). I will meet with @BenPope to see if this point is relevant.
        if (!valid) {
            // It's possible that the schema registry doesn't have a newly
            // written schema, update and retry.
            co_await _api->_sequencer.local().read_sync(); // <-- problem could be here
            valid = co_await ss::visit(
              slice, [this](const auto& d) { return validate(d); });
        }

In Conclusion, I believe this failure happened because of a failed seq_writer::read_sync() early on during a SR get_subjects() request; however, more investigation is needed to determine why authorization failed.

Misc. notes

We also see authorization fail early on when creating the _schemas topic

INFO  2023-06-01 09:15:05,352 [shard 0] kafka - 172.31.5.177:49923 failed authorization - connection_context.h:193 - proto: kafka rpc protocol, sasl state: complete, acl op: create, principal: {type ephemeral user name __schema_registry}, resource: {kafka-cluster}

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jun 2, 2023

Spoke with @BenPope , server side schema validation is likely not the culprit.

We believe that somehow the principal was lost at some point during runtime because:

We see that a metadata update was successful

DEBUG 2023-06-01 09:16:17,524 [shard 0] kafka/client - broker.h:69 - Dispatch to node 3: metadata req: {topics={nullopt} allow_auto_topic_creation=true include_cluster_authorized_operations=false include_topic_authorized_operations=false}
DEBUG 2023-06-01 09:16:17,524 [shard 0] kafka/client - broker.h:76 - Dispatch from node 3: metadata res: {throttle_time_ms=0 brokers={{node_id=1 host=ip-172-31-5-177 port=9092 rack={nullopt}}, {node_id=2 host=ip-172-31-9-54 port=9092 rack={nullopt}}, {node_id=3 host=ip-172-31-13-49 port=9092 rack={nullopt}}} cluster_id={redpanda.28289869-a246-4cbc-a0ff-b00d946d891a} controller_id=1 topics={{error_code={ error_code: none [0] } name={_schemas} is_internal=false partitions=[{error_code={ error_code: none [0] } partition_index=0 leader_id=2 leader_epoch=16 replica_nodes={{3}, {2}, {1}} isr_nodes={{3}, {2}, {1}} offline_replicas={}},] topic_authorized_operations=-2147483648}} cluster_authorized_operations=-2147483648}

Then, immediately after, we notice that the sasl handshake was successful

INFO  2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.cc:43 - connected to broker:2 - ip-172-31-9-54:9092
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - sasl_client.cc:46 - Connecting to broker 2 with authentication: SCRAM-SHA-512:zMCyV2y0vFhSoQ8oKELeEjraw4uffuzp
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_handshake req: {mechanism=SCRAM-SHA-512}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_handshake res: {error_code={ error_code: none [0] } mechanisms={SCRAM-SHA-256, SCRAM-SHA-512}}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_authenticate req: {auth_bytes=****}
DEBUG 2023-06-01 09:16:17,525 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_authenticate res: {error_code={ error_code: none [0] } error_message={nullopt} auth_bytes=**** session_lifetime_ms=0}
DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: sasl_authenticate req: {auth_bytes=****}
DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: sasl_authenticate res: {error_code={ error_code: none [0] } error_message={nullopt} auth_bytes=**** session_lifetime_ms=0}
TRACE 2023-06-01 09:16:17,529 [shard 0] kafka/client - client.cc:137 - updated metadata

And then, immediately after, the list_offsets request occurs and we get topic_authorization_failed.

DEBUG 2023-06-01 09:16:17,529 [shard 0] kafka/client - broker.h:69 - Dispatch to node 2: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-01 09:16:17,530 [shard 0] kafka/client - broker.h:76 - Dispatch from node 2: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}

Therefore, more investigation is needed with the idea that either the principal was lost or something happened to the client connection between calls to update metadata, the sasl handshake, and list offsets. Perhaps something was moved into a lambda when it should not have been moved.

@andijcr
Copy link
Contributor

andijcr commented Jun 5, 2023

@ztlpn
Copy link
Contributor

ztlpn commented Jun 14, 2023

@NyaliaLui
Copy link
Contributor

NyaliaLui commented Jun 14, 2023

The topic_authorization_failed error comes from the ListOffsets handler because the principal did not have "describe" ACLs at the time of the request. This is strange because the ephemeral principal is configured to have "all" ACLs at SchemaRegistry startup.

Furthermore, from the DT and RP logs, it appears that this failure happens around the time that two events occur:

  1. A leadership transfer for the _schemas topic. For example, transfer from noda A->B
  2. The new leader, node B, is restarted shortly after becoming the leader

When a broker is restarted, it re-populates it's local instance of the scram credential store by replaying the controller log or after receiving updates from the controller leader.

Therefore, it seems like this is a timing issue between the kafka broker replaying the controller log and SchemaRegistry startup resetting ACLs. I'm going to run tests overnight to see if this reproduces the problem.

I am assigning sev/medium because this is a true RP bug (i.e., it's not sev/low) but the bug does not result in a crash (i.e., not sev/high).

@NyaliaLui NyaliaLui added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Jun 14, 2023
@michael-redpanda
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

Thanks to @michael-redpanda , I was able to reliably repro this error on my local with:
task rp:run-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_restarts --no-teardown" REDPANDA_LOG_LEVEL=trace DUCKTAPE_MAX_PARALLEL=3 DUCKTAPE_REPEAT=100

Adding the --no-teardown argument tells ducktape to not run tearDown methods. This is important for Redpanda because it allows the Redpanda controller log to grow between test runs. More on this later.

I also added some logging to the line in schema registry service that creates ACLs for the ephemeral user:

    auto err_vec = co_await _controller->get_security_frontend().local().create_acls(
      {security::acl_binding{
        security::resource_pattern{
          security::resource_type::topic,
          model::schema_registry_internal_tp.topic,
          security::pattern_type::literal},
        security::acl_entry{
          principal,
          security::acl_host::wildcard_host(),
          security::acl_operation::all,
          security::acl_permission::allow}}},
      5s);

    // Added in the below
    for (auto& err : err_vec) {
      if (err != cluster::errc::success) {
        vlog(plog.info, "Failed to create ACLs, User {}, err {} - {}", principal, err, cluster::make_error_code(err).message());
      }
    }

The results of the test run show

  1. The topic authorization fail is due to a ListOffsets request sent to node 3 at 20:58:27:
DEBUG 2023-06-15 20:58:27,539 [shard 0] kafka/client - broker.h:69 - Dispatch to node 3: list_offsets req: {replica_id=-2147483648 isolation_level=0 topics={{name={_schemas} partitions={{partition_index=0 current_leader_epoch=-2147483648 timestamp={timestamp: missing} max_num_offsets=1}}}}}
DEBUG 2023-06-15 20:58:27,539 [shard 0] kafka/client - broker.h:76 - Dispatch from node 3: list_offsets res: {throttle_time_ms=0 topics={{name={_schemas} partitions={{partition_index=0 error_code={ error_code: topic_authorization_failed [29] } old_style_offsets={} timestamp={timestamp: missing} offset=-1 leader_epoch=-2147483648}}}}}
  1. Around the time of this request, node 3 was restarted and we see that the call to create_acls failed because there is no controller leader yet (cluster::errc:9 is no_leader_controller). Probably because controller log replay has not finished:
INFO  2023-06-15 20:57:21,686 [shard 0] pandaproxy - service.cc:220 - Failed to create ACLs, User {type ephemeral user name __schema_registry}, err cluster::errc:9

Points (1) and (2) reveal that this is a timing issue between replaying the controller log and the ListOffsets request. Kafka brokers, after restart, become aware of any preset ACLs by reloading those perms from the controller log.

Therefore, node 3 is reporting topic_authorization_failed because the ListOffsets req is sent before node 3 finished replaying the controller log.

I see two potential solutions:

  1. Have the SR client wait a couple seconds on topic_authorization_fail before re-issuing the request
  2. Re-order schema registry startup to occur after controller log replay.

IMO, (1) is just a bandaid while (2) is the better solution; however, I am skeptical to implement (2) because it may introduce new startup problems. So I am currently testing to see which approach works.

@ztlpn
Copy link
Contributor

ztlpn commented Jun 16, 2023

NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 20, 2023
It is possible that the call to create ACLs fails. For example, there
could not be a controller leader yet. This commit will auto-retry the
request after a few seconds sleep.

Fixes redpanda-data#11141
@travisdowns
Copy link
Member

FAIL test: SchemaRegistryAutoAuthTest.test_restarts (1/13 runs)
failure at 2023-06-20T13:48:22.957Z: AssertionError()
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31647#0188d8d7-9ae3-4ff0-96b4-f969146e190b

NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 26, 2023
It is possible that the call to create ACLs fails. For example, there
could not be a controller leader yet. This commit will auto-retry the
request after a small sleep. This commit also puts schema registry
startup in the background so the sleeps do not block startup.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 27, 2023
It is possible that the call to create ACLs fails. For example, there
could not be a controller leader yet. This commit will auto-retry the
request after a small sleep.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 27, 2023
The other half of the problem in redpanda-data#11141 is that it is possible for a
REST request to come in before ACLs have replicated to the rest of the
cluster. This will lead to authorization failures. The solution is to
increase the retry backoff in the internal kafka client.
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 27, 2023
The other half of the problem in redpanda-data#11141 is that it is possible for a
REST request to come in before ACLs have replicated to the rest of the
cluster. This will lead to authorization failures. The solution is to
increase the retry backoff in the internal kafka client.
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 29, 2023
When AutoAuth is enabled, the ACLs check could fail because:
1. The internal client used a random string as the username
2. The ACLs check compared principal type and name

Point (1) is a problem because when the client issues a Kafka request,
the Kafka server will treat the username as the client's principal. This
led to a ACLs failure in acl.cc::36 because the credential store had the
ephemeral principal but the client had a random string.

Point (2) is a problem because comparing principals will check for
matching principal name and type. This led to a ACLs failure in
acl.cc::36 because the credential store had principal type ephemeral but
the client had principal type user.

Therefore, this commit forces the internal client to use the ephemeral
principal name as the username and modifies the ACLs check to only
compare principal names.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 29, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 30, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jun 30, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 3, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 5, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 6, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
NyaliaLui added a commit to NyaliaLui/redpanda that referenced this issue Jul 7, 2023
Previously, the security manager was inserting ephemeral credentials
into the snapshot that is reloaded on controller log replay. This is a
problem because, upon log replay, RP will reload an ephemeral credential
that has a username and password but no principal. This leads to
authorization failures.

Therefore, this commit stops the security manager from insterting
ephemeral credentials into a snapshot.

Fixes redpanda-data#11141
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/schema-registry Schema Registry service within Redpanda ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants