Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/scale: test basic auth with many users #6781

Merged
merged 6 commits into from
Nov 10, 2022

Conversation

NyaliaLui
Copy link
Contributor

@NyaliaLui NyaliaLui commented Oct 14, 2022

Cover letter

This PR adds a ducktape test that issues many concurrent rest requests with unique principals. The intent is to test that the kafka client cache can handle multiple authenticated connections. This test may also serve as a baseline so stakeholders can gauge how much load the Pandaproxy can handle with respect to the number of users.

Closes: #6764

Changes from force-push 7810f91:

  • Move the handler into invoke_on_cache
  • Create a start method in the client cache

Changes from force-push 4d8312d:

  • Rebased to resolve conflicts
  • Account for gate closed exception on all pp handlers
  • In invoke_on_cache capture the function handle by forwarding instead of reference
  • Removed the garbage collection lock from remove_client_if

Changes from force-push ecf90f1:

  • Extend brokers lifetime in client::do_connect
  • Add a semaphore to the cached items
  • Use the semaphore to synchronize request handling and garbage collection
  • Typos
  • Edit DT test to retry on gate closed exceptions

Changes from force-push 1244e18:

  • Remove unused code causing compile errors

Changes from force-push ba92cc3:

  • Remove fetch_or_insert_impl
  • Use mutex, not semaphore.

Changes from force-push 44a8184:

  • Make fetch_or_insert protected so it is not a leaky abstraction
  • Make a wrapper for fetch_or_insert so the client mutex is always used with the client.

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

  • none

Release notes

  • Adds a test that issues many concurrent rest requests with unique principals

@NyaliaLui NyaliaLui self-assigned this Oct 14, 2022
@NyaliaLui NyaliaLui force-pushed the basic-auth-scale-tests branch 2 times, most recently from 0926955 to 97a18a8 Compare October 24, 2022 18:02
@NyaliaLui
Copy link
Contributor Author

NyaliaLui commented Oct 27, 2022

After #6953 merged I'm starting to see crashes and Sanitizer failures tests with small numbers of concurrent users (e.g., 10, 20, 30, 40, 50 unique users)

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-5(1) example="==180==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff6d430a80 at pc 0x55c6e8b414e9 bp 0x7faa1f65e390 sp 0x7faa1f65e388">

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-1(1) example="==177==ERROR: AddressSanitizer: unknown-crash on address 0x7ffc31798640 at pc 0x55fcc027f4e9 bp 0x7fec15486390 sp 0x7fec15486388">

@BenPope any ideas on the above?

And then I'm still dealing with a gate closed exception that seems to occur when two users are submitted to the client cache concurrently

@BenPope
Copy link
Member

BenPope commented Oct 27, 2022

After #6953 merged I'm starting to see crashes and Sanitizer failures tests with small numbers of concurrent users (e.g., 10, 20, 30, 40, 50 unique users)

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-5(1) example="==180==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7fff6d430a80 at pc 0x55c6e8b414e9 bp 0x7faa1f65e390 sp 0x7faa1f65e388">

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-1(1) example="==177==ERROR: AddressSanitizer: unknown-crash on address 0x7ffc31798640 at pc 0x55fcc027f4e9 bp 0x7fec15486390 sp 0x7fec15486388">

@BenPope any ideas on the above?

I hope this will fix it.

And then I'm still dealing with a gate closed exception that seems to occur when two users are submitted to the client cache concurrently

I'll take another look after my tests have finished running.

@NyaliaLui NyaliaLui force-pushed the basic-auth-scale-tests branch 5 times, most recently from 826f664 to 89abe1b Compare October 31, 2022 20:46
@NyaliaLui NyaliaLui marked this pull request as ready for review October 31, 2022 20:46
@NyaliaLui
Copy link
Contributor Author

Oddly enough, I'm getting that gate closed exception in CDT but not on local. The test with 1k and 2k users takes approx 2min and 5min to complete. Why don't I add this test to CI instead of CDT instead?

@dotnwat
Copy link
Member

dotnwat commented Oct 31, 2022

Why don't I add this test to CI instead of CDT instead?

yeh, something like this in CI would be good. also, you might try to increase the number of cores used in CI. i think it is 2 by default, and you might need more to make the problematic condition more likely.

@NyaliaLui NyaliaLui mentioned this pull request Nov 7, 2022
6 tasks
This commit reduces cross-shard communication by moving the garbage
collection timer into the individual sharded instances. Otherwise we
risk a seastar assert failure on the shared timer. The assert failure
happens when two or more sharded instances evict and then trigger
garabage collection at the sametime.
Prior to this commit, all sharded state was handled in the
sharded_client_cache wrapper. That wrapper existed on a single core
which led to concurrency failures since there was cross-core
communication betweem the wrapper and the sharded services.

This commit reduces cross-core communication by moving concurrency
mechanisms into the sharded service. This commit also serves as the
pre-req to remove the sharded_client_cache wrapper.

Finally, the auth_ctx_server is the new "frontend" for the kafka client
cache. The ctx server can pass function handles to the sharded instance
directly. So invoke_on_cache is no longer needed.
The wrapper is obsolete now since the auth_ctx_server calls the sharded
client cache directly.
Previously the client (and thus the shared broker) were evicted from the
Proxy client cache while do_connect was still running. This manifested
as an assertion failure on the ss::outputstream. The solution is to
extend the brokers lifetime.
@NyaliaLui NyaliaLui force-pushed the basic-auth-scale-tests branch 2 times, most recently from ecf90f1 to 1244e18 Compare November 8, 2022 21:38
src/v/pandaproxy/kafka_client_cache.cc Outdated Show resolved Hide resolved
src/v/pandaproxy/server.h Outdated Show resolved Hide resolved
Previously, there was a chance for numerous gate closed exceptions
because the client was evicted and client::stop is called before the
first request could complete. This scenario often occurs when there is
heavy load on the cache with many unique users.

This commit addresses the problem by including a mutex to the cached
items. Then the GC process and client dispatch may use the semaphore so
GC must wait for the current request to finish before stopping the
client.
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@graphcareful graphcareful left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work

@BenPope BenPope added this to the v22.3.1-rc5 milestone Nov 9, 2022
@NyaliaLui
Copy link
Contributor Author

@dotnwat are you around? GH says you requested changes but I don't see them. Or is this a glitch?

@NyaliaLui
Copy link
Contributor Author

@dotnwat are you around? GH says you requested changes but I don't see them. Or is this a glitch?

Seems like a glitch

@NyaliaLui NyaliaLui dismissed dotnwat’s stale review November 10, 2022 01:02

should've been dismissed automatically when I force pushed

@NyaliaLui NyaliaLui merged commit 3b28ada into redpanda-data:dev Nov 10, 2022
@NyaliaLui NyaliaLui deleted the basic-auth-scale-tests branch November 18, 2022 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pp/test: create scale tests with many users
4 participants