A84: PID LB policy #430

s-matyukevich · 2024-04-26T19:50:11Z

Follow up on #383 and #423

This proposal was implemented and tested with a few real apps in production environment, here are the results for one of the apps:

A80-pid.md

markdroth

This looks very interesting -- and the graphs definitely show impressive results!

My main concern here is the comment about making the WRR policy extensible as a public API. I think that needs some careful thought.

I'd like @ejona86 and @dfawley to review as well.

Please let me know if you have any questions. Thanks!

markdroth · 2024-07-12T16:05:19Z

A80-pid.md

@@ -0,0 +1,290 @@
+A68: PID LB policy.


This says A68, and the filename says A80, and both of those numbers are already taken. :)

Looks like the next available number is A84, so let's use that.

markdroth · 2024-07-12T18:35:07Z

A80-pid.md

+message PIDLbConfig {
+  // Configuration for the WRR load balancer as defined in [gRFC A58][A58].
+  // The PID balancer is an extension of WRR and all settings applicable to WRR also apply to PID identically.
+  WeightedRoundRobinLbConfig wrr_config = 1;


I don't think we want to include the WRR config here. Instead, let's just duplicate the fields from that proto in this one, so that the two are independent.

markdroth · 2024-07-12T18:36:51Z

A80-pid.md

+
+  // Threshold beyond which the balancer starts considering the ErrorUtilizationPenalty.
+  // This helps avoid oscillations in cases where the server experiences a very high and spiky error rate.
+  // We avoid eliminating the error_utilization_penalty entirely to prevent redirecting all traffic to an instance 


If this is the only thing that you need the error utilization penalty for, an alternative would be to set the penalty to zero and instead use outlier detection (see gRFC A50) to avoid sending traffic to such a backend. Then we presumably wouldn't need this knob.

Good point, we can use outlier detection with zero utilization penalty for servers with very high and spiky error rates. I can remove this knob.

markdroth · 2024-07-12T18:46:52Z

A80-pid.md

+}
+```
+
+The proposal is to make `wrrCallbacks` public. This has a number of significant benefits. Besides PID, there are other cases where one might need to extend `wrr`. For example, Spotify [demonstrates](https://www.youtube.com/watch?v=8E5zVdEfwi0) a gRPC load balancer to reduce cross-zone traffic – this can be implemented nicely in terms of `wrr` weights. We are also considering the same and incorporating things like latency into our load balancing decisions. Existing ORCA extension points don't cover these use cases. We leverage ORCA for custom server utilization metrics, but we also need the ability to combine server and client metrics to generate the resulting weight. The alternative is to write our own balancer with custom EDF scheduler and handle details related to subchannel management and interactions with resolvers. With this new API, use cases like this can be covered naturally, users have full control over the end-to-end definition of weights.


I am hesitant to make this a public API. This seems like a valuable internal thing to allow us to reuse LB policy implementations, but making it public implies a stability commitment that I'm not super comfortable with.

It's worth noting that the example of using latency would require that this API surface be even broader, because we'd need a way for this to trigger the LB policy to measure the latency on each RPC. (We do have that mechanism internally, but we'd need to expose it via this API in addition to the existing mechanism as part of the LB policy API itself.)

I'd like to hear thoughts from @ejona86 and @dfawley on this.

I expected this answer but decided to give it a try anyway. Making this interface private works for us as well - next time we need to extend PID of WRR we'll come back with more gRFCs. I assume that it is much easier to make an interface public than otherwise, so maybe we can revisit this decision in the future when we have more data and better defined use-cases where such interface might be useful.

For now I can remove this paragraph and replace it with a note that the interface will be private. @markdroth does this work for you?

markdroth · 2024-07-12T18:53:02Z

A80-pid.md

+
+### Moving Average Window for Load Reporting
+
+As outlined in the previous section, smoothing the utilization measurements in server load reports is essential for the `pid` balancer to achieve convergence on spiky workloads. To address this, we propose integrating a moving average window mechanism into the `MetricRecorder` component, as described in [gRFC A51][A51]. This involves adding a `MovingAverageWindowSize` parameter to the component. Instead of storing a single value per metric, `MetricRecorder` will now maintain the last MovingAverageWindowSize reported values in a circular buffer. The process is detailed in the following pseudo-code:


It's not clear to me that this is something we need to build directly into gRPC APIs. Can't the application do this smoothing itself before it reports the data to the gRPC MetricRecorder?

Yes, that's what we are doing now. However, this is not a trivial amount of code and PID almost certainly requires it. If we provide the balancer but don't provide the smoothing implementation the UX won't be ideal, as users most likely will start using it without any smoothing and soon come to the conclusion that it doesn't work.
My plan was to document that smoothing is required for PID and then mention "Use property X of the MetricsResorder to configure smoothing" IMO this is a lot better than say "implement smoothing yourself".

If this is really critical to the behavior of the PID policy, wouldn't it be better to build this directly into the PID policy, so that it's not possible to have a misconfiguration between the client and the server? In other words, why not do this smoothing on the client side?

We considered this but decided to implement it on the server for the following reasons:

How much we should smooth is mostly a server property (it depends on how spiky is the workload, which in turn depends on how expensive are individual requests and how requests are distributed in time) It easier to reason about this from the perspective of service owner rather than the client.

Because of the first point, it doesn't really make sense to have different clients with different smoothing parameters.

It protects the server from misconfigured clients: if some clients don't use enough smoothing this could easily result in oscillations, which could hurt the server.

Most importantly smoothing on the server is more accurate as it takes into account actual load and is not delayed in time. If we do smoothing on the client the result might be less accurate as the client only have data sampled at random points. If there was a big CPU spike between 2 consecutive requests sent by a particular client this information will be missed by client-side smoothing, so different clients may have different resulting view of the server load. If we do smoothing on the server we don't have such problems as we can sample CPU as frequently as a few milliseconds and use monotonically increasing cgrpup CPU counter, so we never miss any CPU spikes or drops.

Bullets 1-3: these are resolved by this being configurable via xds/service config. The service is still in control. We do agree the service should be in control, but that's a bit different than which side calculates the average.

Bullet 4: We agree it is very helpful to have the higher sampling when producing the utilization. That would matter a lot for something like memory utilization, which can be measured at any instant. CPU utilization however is always an average over some time period (cpu seconds used / time period). There is no instantaneous value (other than a weak boolean (running/not running) per core). WRR assumes the cpu utilization is roughly the same time period as rps, so I'd hope server-side is already averaging over at least a second-ish. We'd be very interested if this is wildly inaccurate. It seems load for short periods should be covered.

For longer time periods, an exponential moving average on client-side would seem sufficient (updated each weight recalculation). The biggest concern would be too-infrequent of utilization updates, but even with server-side smoothing PD won't be able to function in such a case as there is no feedback loop.

Part of the concern about server-side smoothing is the smoothing period matters to the D behavior. Having it server-side would make it harder to guarantee that all the knobs are self-consistent and harder to change the smoothing period.

Aside: Understanding the server utilization monitoring period is essential for monitoring spikiness. If the PD oscillates at a rate faster than the server's utilization monitoring frequency, then you won't the see oscillations even though they are happening. This is a risk I've seen in multiple utilization-based LB schemes that prioritize fast updates.

(I've done further review/consideration since this conversation so I'll have additional comments on this that I'll post separately. Specifically I'm considering slower/larger RPCs and thus longer weight_expiration_period cases, which will have interplay here.)

Trying to brainstorm new ideas here: what do you think if we change a bit the structure of load reports and make them additionally and optionally report the timestamp at the time of measurement and the CPU time as monotonically increased counter. Basically we take the same interface as kernel provide us via getrusage syscall and make the server report the raw values reported by this syscall and don't do any aggregations. Now aggregation can be done on the client and we can use any type of smoothing we want and still get perfect accuracy. PID balancer should try to use this new load report data and if it not present it fallbacks to using the default CPU utilization.

I am not 100% sure what would be the best way to handle AppUtilization in this case, but we have a couple of options:

Make the new load report field more generic (e.g. call it UtilizationCounter) make it take precedence over both CpuUtilization and AppUtilization.

Make existing AppUtilization field take precedence over the new field. In this case we can also either apply or skip client-side smoothing for AppUtilization.

ejona86

Mark, Doug, Craig, and I discussed this on Friday. These are essentially our notes from the meeting with some minor extra review items from me. I tried to say "we" for the meeting and "I" for the other things. But I'm also going to follow this with a review just from myself.

ejona86 · 2024-09-13T22:13:01Z

A80-pid.md

+  // Controls the convergence speed of the PID controller. Higher values accelerate convergence but may induce oscillations,
+  // especially if server load changes more rapidly than the PID controller can adjust. Oscillations might also occur due to
+  // significant delays in load report propagation or extremely spiky server load. To mitigate spiky loads, server owners should
+  // employ a moving average to smooth the load reporting. Default is 0.1.


It would be very hard for us to change these values later. We're thinking it'd be better for each service to define these. That'd mean we wouldn't define defaults for PD and require they be specified in the service config.

We'd still provide a suggestion. But we'd have a way to change that suggestion over time.

ejona86 · 2024-09-13T22:13:11Z

A80-pid.md

+
+### Moving Average Window for Load Reporting
+
+As outlined in the previous section, smoothing the utilization measurements in server load reports is essential for the `pid` balancer to achieve convergence on spiky workloads. To address this, we propose integrating a moving average window mechanism into the `MetricRecorder` component, as described in [gRFC A51][A51]. This involves adding a `MovingAverageWindowSize` parameter to the component. Instead of storing a single value per metric, `MetricRecorder` will now maintain the last MovingAverageWindowSize reported values in a circular buffer. The process is detailed in the following pseudo-code:


Bullets 1-3: these are resolved by this being configurable via xds/service config. The service is still in control. We do agree the service should be in control, but that's a bit different than which side calculates the average.

Bullet 4: We agree it is very helpful to have the higher sampling when producing the utilization. That would matter a lot for something like memory utilization, which can be measured at any instant. CPU utilization however is always an average over some time period (cpu seconds used / time period). There is no instantaneous value (other than a weak boolean (running/not running) per core). WRR assumes the cpu utilization is roughly the same time period as rps, so I'd hope server-side is already averaging over at least a second-ish. We'd be very interested if this is wildly inaccurate. It seems load for short periods should be covered.

For longer time periods, an exponential moving average on client-side would seem sufficient (updated each weight recalculation). The biggest concern would be too-infrequent of utilization updates, but even with server-side smoothing PD won't be able to function in such a case as there is no feedback loop.

Part of the concern about server-side smoothing is the smoothing period matters to the D behavior. Having it server-side would make it harder to guarantee that all the knobs are self-consistent and harder to change the smoothing period.

Aside: Understanding the server utilization monitoring period is essential for monitoring spikiness. If the PD oscillates at a rate faster than the server's utilization monitoring frequency, then you won't the see oscillations even though they are happening. This is a risk I've seen in multiple utilization-based LB schemes that prioritize fast updates.

(I've done further review/consideration since this conversation so I'll have additional comments on this that I'll post separately. Specifically I'm considering slower/larger RPCs and thus longer weight_expiration_period cases, which will have interplay here.)

ejona86 · 2024-09-13T22:13:29Z

A80-pid.md

+
+## Abstract
+
+This document proposes a design for a new load balancing policy called pid. The term pid stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy builds upon the [A58: weighted_round_robin LB policy (WRR)][A58] and requires direct load reporting from backends to clients. Similar to wrr, it utilizes client-side weighted round robin load balancing. However, unlike wrr, it does not determine weights deterministically. Instead, it employs a feedback loop with the pid controller to adjust the weights in a manner that allows the load on all backends to converge to the same value. The policy supports either per-call or periodic out-of-band load reporting as per [gRFC A51][A51].


I'd appreciate making it more clear earlier this is a PD controller, not full PID.

ejona86 · 2024-09-14T00:22:09Z

A80-pid.md

+
+The `pid` LB policy config will be as follows.
+
+```textproto


This is "protobuf" as it it the schema. Textproto is textual representation for a message.

ejona86 · 2024-09-14T00:23:25Z

A80-pid.md

+```
+
+Here is how `pid` balancer implements `wrrCallbacks` interface.
+```


It'd be great to define this as ```go, just to have some level of syntax highlighting.

ejona86 · 2024-09-14T00:31:46Z

A80-pid.md

+
+The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight.
+
+The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be additionally scaled by the `WeightUpdatePeriod` value. This is necessary because derivative error is calculated like `controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds()` and dividing by a very small `samplingInterval` value makes the result too big. `WeightUpdatePeriod` is roughly equal to `samplingInterval` as we will be updating the PID state once per `WeightUpdatePeriod`.


s-matyukevich added 13 commits April 25, 2024 11:30

first draft

9ded058

chatGPT review

f1a4661

remove unused proposal

7e14d72

review comments

ff1e037

review comments

684ed20

split section

530beba

public wrrCallbacks interface

0d6f76e

change moving average pseudocode

71cb41d

fix formatting

1786b29

fix formatting

f1d8921

fix formatting

12110b6

fix formatting

dbc7adc

add link to the discussion

a1fe191

s-matyukevich commented Apr 29, 2024

View reviewed changes

A80-pid.md Show resolved Hide resolved

s-matyukevich mentioned this pull request May 6, 2024

A68: Random subsetting with rendezvous hashing LB policy #423

Merged

markdroth changed the title ~~PID LB policy~~ A83: PID LB policy Jul 12, 2024

markdroth changed the title ~~A83: PID LB policy~~ A84: PID LB policy Jul 12, 2024

markdroth reviewed Jul 12, 2024

View reviewed changes

ejona86 reviewed Sep 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A84: PID LB policy #430

A84: PID LB policy #430

s-matyukevich commented Apr 26, 2024

markdroth left a comment

markdroth Jul 12, 2024

markdroth Jul 12, 2024

markdroth Jul 12, 2024 •

edited

Loading

s-matyukevich Jul 12, 2024

markdroth Jul 12, 2024

s-matyukevich Jul 15, 2024

markdroth Jul 12, 2024

s-matyukevich Jul 12, 2024

markdroth Jul 15, 2024

s-matyukevich Jul 15, 2024

ejona86 Sep 13, 2024

s-matyukevich Sep 16, 2024

ejona86 left a comment •

edited

Loading

ejona86 Sep 13, 2024

ejona86 Sep 13, 2024

ejona86 Sep 13, 2024

ejona86 Sep 14, 2024

ejona86 Sep 14, 2024

ejona86 Sep 14, 2024


		### Moving Average Window for Load Reporting

		As outlined in the previous section, smoothing the utilization measurements in server load reports is essential for the `pid` balancer to achieve convergence on spiky workloads. To address this, we propose integrating a moving average window mechanism into the `MetricRecorder` component, as described in [gRFC A51][A51]. This involves adding a `MovingAverageWindowSize` parameter to the component. Instead of storing a single value per metric, `MetricRecorder` will now maintain the last MovingAverageWindowSize reported values in a circular buffer. The process is detailed in the following pseudo-code:


		## Abstract

		This document proposes a design for a new load balancing policy called pid. The term pid stands for [Proportional–integral–derivative controller](https://en.wikipedia.org/wiki/Proportional%E2%80%93integral%E2%80%93derivative_controller). This policy builds upon the [A58: weighted_round_robin LB policy (WRR)][A58] and requires direct load reporting from backends to clients. Similar to wrr, it utilizes client-side weighted round robin load balancing. However, unlike wrr, it does not determine weights deterministically. Instead, it employs a feedback loop with the pid controller to adjust the weights in a manner that allows the load on all backends to converge to the same value. The policy supports either per-call or periodic out-of-band load reporting as per [gRFC A51][A51].


		The `update` method is expected to be called on a regular basis, with `samplingInterval` being the duration since the last update. The return value is the control signal which, if applied to the system, should minimize the control error. In the next section, we'll discuss how this control signal is converted to `wrr` weight.

		The `proportionalGain` and `derivativeGain` parameters are taken from the LB config. `proportionalGain` should be additionally scaled by the `WeightUpdatePeriod` value. This is necessary because derivative error is calculated like `controlErrorDerivative = (this.controlError - previousError) / samplingInterval.Seconds()` and dividing by a very small `samplingInterval` value makes the result too big. `WeightUpdatePeriod` is roughly equal to `samplingInterval` as we will be updating the PID state once per `WeightUpdatePeriod`.

A84: PID LB policy #430

Are you sure you want to change the base?

A84: PID LB policy #430

Conversation

s-matyukevich commented Apr 26, 2024

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejona86 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth Jul 12, 2024 •

edited

Loading

ejona86 left a comment •

edited

Loading