Basic auth scale tests2 #7111

NyaliaLui · 2022-11-07T00:28:49Z

Cover letter

Draft of changes proposed in #6781

Backport Required

UX changes

none

Release notes

none

NyaliaLui

After removing the gate and lock from the sharded client cache, I'm seeing node crashes and connection timeouts from the Python Requests library with tests with 1k and 2k users.

I think we still need a gate and lock for concurrency.

~~Doesn't the webserver use worker threads to handle requests? If so, then we need to control access to the client cache which is a sharded service. I'll look into this more.~~ We use seastar http server, not apache so this comment does not apply.

NyaliaLui · 2022-11-07T00:37:05Z

src/v/pandaproxy/server.h

+              ss::smp_submit_to_options{context().smp_sg},
+              [func{std::forward<Func>(func)}](
+                kafka_client_cache& cache) mutable {
+                  return std::invoke(std::move(func), cache);


Gate or lock here. OR really a gate/lock on the sharded service. The method invoke_on_cache was meant to the be the wrapper for the gate and lock.

You don't need a gate; kafka_client_cache encapsulates a gate that protects async calls.
You don't need a lock; kafka_client_cache encapsulates a lock that protects against reentrancy of clean_stale_clients.

Without the gate and lock, the test fails with 2k users (and 1k users but 2/10 runs) with a crash rptest.services.utils.NodeCrash: <NodeCrash docker-rp-1: redpanda: /home/nyalia/redpanda/vbuild/debug/clang/rp_deps_install/include/seastar/core/iostream.hh:391: seastar::output_stream<char>::~output_stream() [CharType = char]: Assertion !_in_batch && "Was this stream properly closed?"' failed.

There are also many connection errors due to timeout, likely because the broker crashed:
41ea10>: Failed to establish a new connection: [Errno 111] Connection refused')) raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='docker-rp-1', port=8082): Max retries exceeded with url: /topics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4834380220>: Failed to establish a new connection: [Errno 111] Connection refused'))

The assert failure Assertion !_in_batch && "Was this stream properly closed? indicates that a output_stream was destroyed/closed while it still held bytes to transport.

OK, so what's happening is that the client is being evicted during do_connect, and its lifetime isn't extended.

Try something like this:

diff --git a/src/v/kafka/client/client.cc b/src/v/kafka/client/client.cc index 38abfdbd2..931963e46 100644 --- a/src/v/kafka/client/client.cc +++ b/src/v/kafka/client/client.cc @@ -63,9 +63,9 @@ ss::future<> client::do_connect(net::unresolved_address addr) { return make_broker(unknown_node_id, addr, _config) .then([this](shared_broker_t broker) { return broker->dispatch(metadata_request{.list_all_topics = true}) - .then([this, broker](metadata_response res) { - return apply(std::move(res)); - }); + .then( + [this](metadata_response res) { return apply(std::move(res)); }) + .finally([broker]() {}); }); }

Done.
And this makes sense. I think Noah also mentioned something about this scenario but I thought I accounted for that already but I did not think about the broker!

src/v/pandaproxy/kafka_client_cache.cc

BenPope · 2022-11-07T15:03:23Z

src/v/pandaproxy/server.h

+              ss::smp_submit_to_options{context().smp_sg},
+              [func{std::forward<Func>(func)}](
+                kafka_client_cache& cache) mutable {
+                  return std::invoke(std::move(func), cache);


You don't need a gate; kafka_client_cache encapsulates a gate that protects async calls.
You don't need a lock; kafka_client_cache encapsulates a lock that protects against reentrancy of clean_stale_clients.

src/v/pandaproxy/kafka_client_cache.cc

This commit reduces cross-shard communication by moving the garbage collection timer into the individual sharded instances. Otherwise we risk a seastar assert failure on the shared timer. The assert failure happens when two or more sharded instances evict and then trigger garabage collection at the sametime.

Prior to this commit, all sharded state was handled in the sharded_client_cache wrapper. That wrapper existed on a single core which led to concurrency failures since there was cross-core communication betweem the wrapper and the sharded services. This commit reduces cross-core communication by moving concurrency mechanisms into the sharded service. This commit also serves as the pre-req to remove the sharded_client_cache wrapper. Finally, the auth_ctx_server is the new "frontend" for the kafka client cache. The ctx server can pass function handles to the sharded instance directly. So invoke_on_cache is no longer needed.

The wrapper is obsolete now since the auth_ctx_server calls the sharded client cache directly.

NyaliaLui · 2022-11-08T21:23:40Z

Closing now since #6781 has the updated code

NyaliaLui requested a review from BenPope November 7, 2022 00:28

github-actions bot added the area/redpanda label Nov 7, 2022

NyaliaLui commented Nov 7, 2022

View reviewed changes

BenPope reviewed Nov 7, 2022

View reviewed changes

NyaliaLui force-pushed the basic-auth-scale-tests2 branch from 9f00864 to ee98958 Compare November 7, 2022 16:55

NyaliaLui added 4 commits November 7, 2022 17:22

pp: remove sharded_client_cache wrapper

1ec5be7

The wrapper is obsolete now since the auth_ctx_server calls the sharded client cache directly.

tests/pp: add many users tests

a232c79

NyaliaLui force-pushed the basic-auth-scale-tests2 branch from ee98958 to a232c79 Compare November 7, 2022 22:25

NyaliaLui closed this Nov 8, 2022

NyaliaLui deleted the basic-auth-scale-tests2 branch November 18, 2022 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic auth scale tests2 #7111

Basic auth scale tests2 #7111

NyaliaLui commented Nov 7, 2022

NyaliaLui left a comment •

edited

Loading

NyaliaLui Nov 7, 2022

BenPope Nov 7, 2022

NyaliaLui Nov 7, 2022 •

edited

Loading

BenPope Nov 8, 2022

NyaliaLui Nov 8, 2022 •

edited

Loading

NyaliaLui Nov 8, 2022

BenPope Nov 7, 2022

NyaliaLui commented Nov 8, 2022

Basic auth scale tests2 #7111

Basic auth scale tests2 #7111

Conversation

NyaliaLui commented Nov 7, 2022

Cover letter

Backport Required

UX changes

Release notes

NyaliaLui left a comment • edited Loading

Choose a reason for hiding this comment

NyaliaLui Nov 7, 2022

Choose a reason for hiding this comment

BenPope Nov 7, 2022

Choose a reason for hiding this comment

NyaliaLui Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

BenPope Nov 8, 2022

Choose a reason for hiding this comment

NyaliaLui Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

NyaliaLui Nov 8, 2022

Choose a reason for hiding this comment

BenPope Nov 7, 2022

Choose a reason for hiding this comment

NyaliaLui commented Nov 8, 2022

NyaliaLui left a comment •

edited

Loading

NyaliaLui Nov 7, 2022 •

edited

Loading

NyaliaLui Nov 8, 2022 •

edited

Loading