chore: event count throttle for squashed commands #4924

kostasrim · 2025-04-11T14:48:42Z

Throttle/preempt flows that use multi command squasher and crb crosses the limit.

Signed-off-by: kostas <kostas@dragonflydb.io>

kostasrim · 2025-04-22T11:39:06Z

src/server/multi_command_squasher.cc

+thread_local size_t MultiCommandSquasher::throttle_size_limit_ =
+    absl::GetFlag(FLAGS_throttle_squashed);
+
+thread_local util::fb2::EventCount MultiCommandSquasher::ec_;

 MultiCommandSquasher::MultiCommandSquasher(absl::Span<StoredCmd> cmds, ConnectionContext* cntx,


This is used not only from async fiber but also directly from the connection. If we preempt, the connection will also "freeze". I guess this is fine, just mentioning it here for completeness.

There are 3 calls of this and all of them should be ok if we preempt from these flows.

kostasrim · 2025-04-22T11:39:19Z

tests/dragonfly/memory_test.py

+        await cl.execute_command("exec")
+
+    # With the current approach this will overshoot
+    #    await client.execute_command("multi")


I wish we also handled this case as well

what is the difference? why does it not handles this case?

kostasrim · 2025-04-22T11:39:50Z

src/server/multi_command_squasher.h

@@ -94,6 +104,9 @@ class MultiCommandSquasher {

  // we increase size in one thread and decrease in another
  static atomic_uint64_t current_reply_size_;
+  static thread_local size_t throttle_size_limit_;
+  // Used to throttle when memory is tight
+  static thread_local util::fb2::EventCount ec_;


We need this to avoid ThisFiber::Yield, ThisFiber::SleepFor in while(true).

since it's thread local, it's more efficient to use NoOpLock together with CondVarAny

@romange nice!

Actually, I added a bug here:

static atomic_uint64_t current_reply_size_;

so current_reply_size is not thread local. So what can happen is:

Core 0 -> starts multi/exec
Core 1 -> starts multi/exec but needs to throttle so it goes to sleep waiting on the thread local cond variable
Core 0 -> is done, notifies the thread local
Core 1 -> the fiber never awakes even though we decremented current_reply_size.

Since current_reply_size is global then so should ec_.

P.s. not very happy with this extra synchronization but we only pay it when we are under memory pressure

it should be thread local.

kostasrim · 2025-04-22T11:40:24Z

@adiholden pinging for an early discussion here

kostasrim · 2025-04-22T11:41:42Z

src/server/multi_command_squasher.cc

@@ -15,6 +16,8 @@
 #include "server/transaction.h"
 #include "server/tx_base.h"

+ABSL_FLAG(size_t, throttle_squashed, 0, "");


@adiholden I will adjust as we said f2f. Looking for some early feedback based on our discussion

adiholden · 2025-04-22T12:01:41Z

src/server/multi_command_squasher.cc

@@ -63,6 +66,10 @@ size_t Size(const facade::CapturingReplyBuilder::Payload& payload) {
 }  // namespace

 atomic_uint64_t MultiCommandSquasher::current_reply_size_ = 0;
+thread_local size_t MultiCommandSquasher::throttle_size_limit_ =
+    absl::GetFlag(FLAGS_throttle_squashed);


As discussed this morning multiply by thread count. The limit should be per thread and the current_reply_size_ is global counter

Yes I know, I even wrote a comment above that I will follow up with this 😄

I wanted to know if you have anything else to add 😄

adiholden · 2025-05-04T06:10:59Z

src/server/multi_command_squasher.cc

@@ -63,6 +66,9 @@ size_t Size(const facade::CapturingReplyBuilder::Payload& payload) {
 }  // namespace

 atomic_uint64_t MultiCommandSquasher::current_reply_size_ = 0;
+thread_local size_t MultiCommandSquasher::throttle_size_limit_ =


I believe we should multiply throttle_squashed by the number of io threads and not shard number

we should no do this at all. the limit should be by thread. and in general it's not well defined to initialize thread local by using another thread local that is initialized to nullptr.

adiholden · 2025-05-04T06:56:25Z

src/server/multi_command_squasher.h

@@ -37,6 +38,15 @@ class MultiCommandSquasher {
    return current_reply_size_.load(std::memory_order_relaxed);
  }

+  static bool IsMultiCommandSquasherOverLimit() {


maybe rename to IsReplySizeOverLimit?

adiholden · 2025-05-04T06:58:03Z

src/server/multi_command_squasher.h

+  // Used to throttle when memory is tight
+  static util::fb2::EventCount ec_;
+
+  static thread_local size_t throttle_size_limit_;


maybe reply_size_limit_ ?

adiholden · 2025-05-04T07:01:45Z

src/server/multi_command_squasher.cc

@@ -15,6 +16,8 @@
 #include "server/transaction.h"
 #include "server/tx_base.h"

+ABSL_FLAG(size_t, throttle_squashed, 0, "");


maybe squashed_reply_size_limit
add flag description
also I think we should have a default limit here maybe 128_MB ?

adiholden · 2025-05-04T07:29:38Z

tests/dragonfly/memory_test.py

+        # At any point we should not cross this limit
+        assert df.rss < 1_500_000_000
+        cl = df.client()
+        await cl.execute_command("multi")


I see that the flow that you are testing is the multi exec flow which I did not think about. When I suggested this throttling I was thinking about the pipeline flow.
When reviewing now the multi exec flow I am not 100% sure for implying this logic in this flow as we when you do the await in the code to wait for the size to decrease we already scheduled the transaction and I am not sure if this can lead in some cases to a deadlock

romange · 2025-05-04T09:12:10Z

src/server/server_state.h

@@ -270,6 +270,10 @@ class ServerState {  // public struct - to allow initialization.

  bool ShouldLogSlowCmd(unsigned latency_usec) const;

+  size_t GetTotalShards() const {


not needed - you can use shard_set->size() everywhere

romange · 2025-05-04T09:19:16Z

src/server/multi_command_squasher.cc

@@ -215,6 +222,9 @@ bool MultiCommandSquasher::ExecuteSquashed(facade::RedisReplyBuilder* rb) {
  if (order_.empty())
    return true;

+  MultiCommandSquasher::ec_.await(


all our current approaches of limiting memory are "per-thread", this is consistent and works nicely with shared-nothing. What is the reason for not using per thread limits? In addition, we already have per thread throttling inside dragonfly_connection code, see IsPipelineBufferOverLimit. Did you consider pigging back on this mechanism instead ?

as long as current_reply_size_ is global I dont see how we can do this per thread

romange · 2025-05-04T09:23:04Z

src/server/multi_command_squasher.cc

  for (auto idx : order_) {
    auto& replies = sharded_[idx].replies;
    CHECK(!replies.empty());

    aborted |= opts_.error_abort && CapturingReplyBuilder::TryExtractError(replies.back());

-    current_reply_size_.fetch_sub(Size(replies.back()), std::memory_order_relaxed);


I said nothing when current_reply_size_ was added. it was a mistake. I do not want anyone introduces global states in Dragonfly codebase.

@BorysTheDev FYI

but @romange we discussed this when current_reply_size_ was added. Because the multi command sqasher is adding replies in different threads you said it makes sense.

Would something like this work?
https://github.com/dragonflydb/dragonfly/compare/RemoveAtomic?expand=1

@romange The change in your branch linked above is that you count the reply size after we executed all the squashed commands. So it does uses the thread local approach correctly but what we expose to metrics is not accurate because at the time the capture reply builders grow we do not expose this until we finish with all the squashed. I think that applying such logic will impact the throttling on reply size just in delay, we will throttle but not when actually we are at the threshold but with some delay. I guess we can do this change.

There is no a "correct" solution because I can show you a scenario where single central atomic won't work either:
we throttle before we send commands to shards, but maybe there are tons of squashed commands in flight that have not filled their replies yet, so you let the next command pass and only then the reply buffer increases.

I would rather have a less accurate metric than have have all our threads contend on atomics and now on a single condvar. this kills performance. I won't be surprised that even now squashing performance is worse because of the "reply bufffer size" atomic being hammered by multiple threads.

adiholden · 2025-06-17T19:19:19Z

src/server/multi_command_squasher.cc

+  // This is not true for `multi/exec` which uses `Execute()` but locks ahead before it
+  // calls `ScheduleSingleHop` below.
+  // TODO Investigate what are the side effects for allowing it `lock ahead` mode.
+  if (opts_.is_mult_non_atomic) {


I dont remember if we discussed this - Why do you need to throttle here and not use the same mecanism we have today for pipeline backpressure to throttle also if we have squashing_current_reply_size above limit

That's a good comment. We discussed this back then reply_size was defined in multi_command_squasher which is part of /server and not facade so the connection had no access to it.

After Roman's changes however, he moved this to facade_types so now facade has access to it. I moved everything to dragonfly_connection. However, I chose not to throttle on the queue back pressure (within DispatchSingle) and do it a little later within AsyncFiber because:

throttling on reply size should only be relevant when we try to squash the pipeline. I don't want to introduce delays for commands that are executing standalone as I don't believe they will increase RSS that much for the workloads we try to "fix"

throttling before we dispatch async to the queue is not a great idea because the async fiber might sleep between iterations and inbetween the dispatch queue will aggregate a bunch of commands. Once the async fiber wakes up, it will try to squash the pipeline -- bypassing the protective mechanism we just added (since we the state might be that the reply size is already over the limit yet we don't know it because we made that decision on the dispatch and not on the async fiber level)

adiholden · 2025-06-24T11:59:16Z

src/facade/dragonfly_connection.cc

@@ -2075,6 +2098,16 @@ void Connection::DecrNumConns() {
    --stats_->num_conns_other;
 }

+bool Connection::IsReplySizeOverLimit() const {
+  std::atomic<size_t>& reply_sz = tl_facade_stats->reply_stats.squashing_current_reply_size;


.load(memory_order_relaxed)

Oh, why did you split this into 2 lines?

.load(memory_order_relaxed)

We should synchronize acquire and release semantics, otherwise the load might be an older value in the modification order of the atomic variable.

I splited it in two lines because otherwise the expression was too big

adiholden · 2025-06-24T12:01:09Z

src/facade/dragonfly_connection.cc

+  std::atomic<size_t>& reply_sz = tl_facade_stats->reply_stats.squashing_current_reply_size;
+  size_t current = reply_sz.load(std::memory_order_acquire);
+  const bool over_limit = reply_size_limit != 0 && current > 0 && current > reply_size_limit;
+  LOG(INFO) << "current: " << current << "/" << reply_size_limit;


Log info on this will flood the log file

It was accidental, I changed that when I wanted to debug something. will fix

adiholden · 2025-06-24T12:02:01Z

src/facade/dragonfly_connection.cc

+  size_t current = reply_sz.load(std::memory_order_acquire);
+  const bool over_limit = reply_size_limit != 0 && current > 0 && current > reply_size_limit;
+  LOG(INFO) << "current: " << current << "/" << reply_size_limit;
+  VLOG_IF(2, over_limit) << "MultiCommandSquasher overlimit: " << current << "/"


Actually if we are over limit we want to see this in logs, make this a warning and print once a second

adiholden · 2025-06-24T12:03:33Z

src/facade/dragonfly_connection.cc

@@ -2105,7 +2138,7 @@ void Connection::EnsureMemoryBudget(unsigned tid) {

 Connection::WeakRef::WeakRef(std::shared_ptr<Connection> ptr, unsigned thread_id,
                             uint32_t client_id)
-    : ptr_{ptr}, thread_id_{thread_id}, client_id_{client_id} {
+    : ptr_{std::move(ptr)}, thread_id_{thread_id}, client_id_{client_id} {


deep copying a shared ptr increments the atomic counter so we pay for an atomic operation. move avoids that by copying the control pointer instead. I just saw the misuse and used move 😄

adiholden · 2025-06-24T12:03:51Z

src/facade/facade_types.h

@@ -14,6 +14,7 @@

 #include "base/iterator.h"
 #include "facade/op_status.h"
+#include "util/fibers/synchronization.h"


adiholden · 2025-06-24T12:18:15Z

src/facade/dragonfly_connection.cc

+  // We need to first set async_dispatch before we preempt. Otherwise, when the AsyncFiber
+  // wakes up, sync_dispatch might be true, violating the precondition of this flow
+  // (when we block earlier in the body of AsyncFiber at cnd_.wait(noop_lk, [this])...)
+  fb2::NoOpLock noop;


still I dont see you use QueueBackpressure class today. Why not use the same class for throttle on squashed pipeline reply size?

Yes I did that on purpose:

throttling before we dispatch async to the queue is not a great idea because the async fiber might sleep between iterations and inbetween the dispatch queue will aggregate a bunch of commands. Once the async fiber wakes up, it will try to squash the pipeline -- bypassing the protective mechanism we just added (since we the state might be that the reply size is already over the limit yet we don't know it because we made that decision on the dispatch and not on the async fiber level)

see #4924 (comment)

Can you please describe scenario of how squashing_current_reply_size can be greater than zero at this point of time?

yes, squashing_current_reply_size is an atomic thread local variable that is shared and can be incremented on different threads. Different connections from thread 1 can schedule a series of squashed pipelines on thread 2 (and will increment the same squashing_current_reply_size). When an async fiber on thread 1 wakes up after its squashed pipeline finished executing (and decremented the `squashing_current_reply_size), it will try to squash another one but in parallel the same variable got incremented on thread 2 (because of another of another tx) and potentially crossed the hard limit.

However, your question raised an interesting point. Is it reasonable to expect that squashing_current_reply_size is usually close to 0 even when multiple connections dispatch squashed pipelines ? I would say maybe, because the rate we increment and decrement this variable on separate threads should be close. By the time we decremented it on thread 1 another tx incremented it on thread 2 and the same cycle repeats in such a way that we always have a value for this variable that is close to 0.

ok, you are right, squashing_current_reply_size can be higher than a threshold due to other connections on this thread. In that case I suggest that instead of throttling the connection - we just avoid calling SquashPipeline, and choose the usual route of sending a single command (lines 1644-1647)

Yeah and that would remove the throttling/extra synchronization all together +1

kostasrim self-assigned this Apr 11, 2025

kostasrim force-pushed the kpr1 branch from 56acf77 to 7f19ba2 Compare April 22, 2025 11:37

kostasrim added 2 commits April 22, 2025 14:37

chore: event count throttle for squashed commands

8f53c94

Signed-off-by: kostas <kostas@dragonflydb.io>

Merge branch 'main' into kpr1

8aa26b2

kostasrim force-pushed the kpr1 branch from 7f19ba2 to 8aa26b2 Compare April 22, 2025 11:37

kostasrim commented Apr 22, 2025

View reviewed changes

kostasrim changed the title ~~[experiment do not review] chore: reject squashed when crb exceeds limit~~ chore: reject squashed when crb exceeds limit Apr 22, 2025

kostasrim marked this pull request as ready for review April 22, 2025 11:40

kostasrim requested a review from adiholden April 22, 2025 11:40

kostasrim changed the title ~~chore: reject squashed when crb exceeds limit~~ chore: event count throttle for squashed commands Apr 22, 2025

kostasrim commented Apr 22, 2025

View reviewed changes

adiholden reviewed Apr 22, 2025

View reviewed changes

fixes

2faa2a0

kostasrim requested review from romange and adiholden April 30, 2025 11:42

adiholden reviewed May 4, 2025

View reviewed changes

romange reviewed May 4, 2025

View reviewed changes

kostasrim added 2 commits May 6, 2025 14:01

Merge branch 'main' into kpr1

8e3134e

comments

44c607a

kostasrim mentioned this pull request Jun 6, 2025

Throttle pipeline connections if capture reply builder is above threshold #5239

Open

Merge branch 'main' into kpr1

d8ad66a

tmp

0446feb

adiholden reviewed Jun 17, 2025

View reviewed changes

kostasrim added 3 commits June 19, 2025 16:18

comments

2a77f81

Merge branch 'main' into kpr1

3a47352

remove unused

77856c2

adiholden reviewed Jun 24, 2025

View reviewed changes

		@@ -270,6 +270,10 @@ class ServerState { // public struct - to allow initialization.

		bool ShouldLogSlowCmd(unsigned latency_usec) const;

		size_t GetTotalShards() const {

chore: event count throttle for squashed commands #4924

Are you sure you want to change the base?

chore: event count throttle for squashed commands #4924

Conversation

kostasrim commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kostasrim commented Apr 11, 2025 •

edited

Loading

kostasrim Jun 19, 2025 •

edited

Loading

kostasrim Jun 24, 2025 •

edited

Loading

kostasrim Jun 24, 2025 •

edited

Loading

kostasrim Jun 24, 2025 •

edited

Loading