Skip to content

RFC: Hash-based routing #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 212 additions & 0 deletions toc/rfc/rfc-draft-hash-based-routing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Meta

[meta]: #meta

- Name: Implementing a Hash-Based Load Balancing Algorithm for CF Routing
- Start Date: 2025-04-07
- Author(s): b1tamara, Soha-Albaghdady
- Status: Draft <!-- Acceptable values: Draft, Approved, On Hold, Superseded -->
- RFC Pull Request: https://github.com/cloudfoundry/community/pull/1222

## Summary

Cloud Foundry uses round-robin and least-connection algorithms for load balancing between Gorouters and backends. While
effective in many scenarios, these algorithms may not be ideal for certain use cases. Therefore, this RFC proposes to
introduce a hash-based routing on a per-route basis.
The hash-based load balancing algorithm uses the hash of a request header to make routing decisions, focusing on
distributing users across instances rather than individual requests, thereby improving load balancing in specific
scenarios.

## Motivation

Cloud Foundry offers two load balancing algorithms to manage request distribution between Gorouters and backends. The
round-robin algorithm ensures the number of requests is distributed equally across all available backends, and the
least-connection algorithm tries to keep the number of active requests equal across all backends. A recent enhancement
allows these load balancing algorithms to be configured on the application route level.

However, these existing algorithms are not ideal for scenarios that require routing based on specific identifiers.

An example scenario: users from different tenants send requests to application instances that establish connections to
tenant-specific databases.

![](rfc-draft-hash-based-routing/problem.drawio.png)

With the current load balancing algorithms, each tenant eventually creates a connection to
each application instance, which then creates connection pools to every customer database. As a result, all tenants
might span up a full mesh, leading to too many open connections to the customer databases, impacting performance. This
limitation highlights a gap in achieving efficient load distribution, particularly when dealing with limited or
memory-intensive resources in backend services, and can be addressed through hash-based routing. In short, hash-based
routing is an algorithm that facilitates the distribution of requests to application instances by using a stable hash
derived from request identifiers, such as headers.

## Proposal

We propose introducing hash-based routing as a load balancing algorithm for use on a per-route basis to address the
issues described in the earlier scenario.

The approach leverages an HTTP header, which is associated with each incoming request and contains the specific
identifier. This one is used to compute a hash value, which will serve as the basis for routing decisions.

In the previously mentioned scenario, the tenant ID acts as the identifier included in the header and serves as the
basis for hash calculation. This hash value determines the appropriate application instance for each request, ensuring
that all requests with this identifier are consistently routed to the same instance or might be routed to another
instance when the instance is saturated. Consequently, the load balancing algorithm effectively directs requests for a
single tenant to a particular application instance, so that instance can minimize database connection overhead and
optimize connection pooling, enhancing efficiency and system performance.

### Requirements

#### Only Application Per-Route Load Balancing

Hash-based load balancing solves a particular load pattern, rather than serving as a general-purpose load balancing
algorithm. Consequently, it will be configured exclusively as a per-route option for applications and will not be
offered as a global setting.

#### Minimal rehashing over all Gorouter VMs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During a deploy app instances will be changing all over the place. For large deployments this could take hours. Though now that I write this, I suppose the idea is that this is for very rare cases. This would never be on for all apps. So it is (hopefully) not going to be a big performance hit.

I would be interested in seeing a performance metrics around this once it is written to see the cost during a deployment.


Rehashing should be minimized, especially when the number of application instances changes over time.

For the scenario when a new application instance (e.g. app_instance3) is added, Gorouter updates the mapping so that it
maps part of the hashes to the new instance.

| Hash | Application instance(s) before | Application instance(s) after a new instance added |
|-------|--------------------------------|----------------------------------------------------|
| Hash1 | app_instance1 | app_instance1 |
| Hash2 | app_instance1 | app_instance3 |
| Hash3 | app_instance2 | app_instance2 |
| ... | ... | ... |
| HashN | app_instance2 | app_instance3 |

For the scenario when the application is scaled down, Gorouter updates the mapping immediately after routes update, so
that it remaps hashes associated with the app_instance3:

| Hash | Application instance(s) before | Application instance(s) after the app_instance_3 removed |
|-------|--------------------------------|----------------------------------------------------------|
| Hash1 | app_instance1 | app_instance1 |
| Hash2 | app_instance3 | app_instance1 |
| Hash3 | app_instance2 | app_instance2 |
| ... | ... | ... |
| HashN | app_instance3 | app_instance2 |


#### Considering a balance factor

Before routing a request, the current load on each application instance must be evaluated using a balance factor. This
load is measured by the number of in-flight requests. For example, with a balance factor of 150, no application instance
should exceed 150% of the average number of in-flight requests across all application instances. Consequently, requests
must be distributed to different application instances that are not overloaded.

Example:

| Application instance | Current request count | Current request count / Average number of in-flight requests |
|----------------------|-----------------------|--------------------------------------------------------------|
| app_instance1 | 10 | 20% |
| app_instance2 | 50 | 100% |
| app_instance3 | 90 | 180% |

Based on the average number of 50 requests, the current request count to app_instance3 exceeds the balance factor. As a
result, new requests to app_instance3 must be distributed to different application instances.

#### Deterministic handling of overflow traffic to the next application instance

The application instance is considered overloaded when the current request load of this application exceeds the balance
factor. Overflow traffic should always be directed to the same next instance rather than to a random one.

A possible presentation of deterministic handling can be a ring like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to round-robin, but the "starting point" is always the hash's target instance, right?


![](rfc-draft-hash-based-routing/HashRing.drawio.png)

### Required Changes

#### Gorouter

- Gorouter MUST be extended to take a specific identifier via the request header
- Gorouter MUST implement hash calculation, based on the provided header
- Gorouter SHOULD store the mapping between computed hash values and application instances locally to avoid
expensive recalculations for each incoming request
- Gorouters SHOULD NOT implement a distributed shared cache
- Gorouter MUST assess the current number of in-flight requests across all application instances mapped to a
particular route to consider overload situations
- Gorouter MUST update its local hash table following the registration or deregistration of an endpoint, ensuring
minimal rehashing
- Gorouter SHOULD NOT not incur any performance hit when 0 apps use hash routing.

For a detailed understanding of the workflows on Gorouter's side, please refer to the [activity diagrams](#diagrams).

#### Cloud Controller

- The `loadbalancing` property of
the [route object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object) MUST be
updated to include `hash` as an acceptable value
- The [route options object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object)
MUST include two new properties, `hash_header` and `hash_balance`, to configure a request header as the hashing key
and the balance factor
- It MUST implement the validation of the following requirements:
- The `hash_header` property is mandatory when load balancing is set to hash
- The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` means the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that "hash_balance" is optional, but also "values should be greater than 110", how would someone unset their "hash_balance" property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern we (somewhat) agreed on is that providing "property": null will unset the property. It seems to be not stated in the RFC so maybe it was only implemented that way or my memory is failing me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have removed accidentally that hash_balance: 0 will unset the property, and in this case, the overflow situation will not be considered. I will add it here.

load situation will not be considered
- To account for overload situations, `hash_balance` values should be greater than 110. During the implementation
Copy link
Contributor

@peanball peanball Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some points on the value of hash_balance:

  1. We should have a unit (i.e. 100%, not just 100), if it's percentage.
  2. Do we want to keep it a percentage, or could it be a factor?
  3. Should it be absolute or relative?

i.e. 100% vs. 1.00.

The % might be misleading/hard to understand, because the actual determination of "overload" depends on the number of available instances if I understood correctly.

Thinking of it as "100% of in-flight requests" is wrong, but "100% balance of the part of requests against N instances" is quite complicated to think about.

A factor for "allowed imbalance of requests" of e.g. 1.2 could be easier to understand. But that's my view and might not represent the majority.

We could also consider an "overcommit" or "imbalance" factor or percentage. "120%" hash_balance would then be hash_imbalance (or hash_rebalance?) of 20% or 0.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strongly inspired by this: https://docs.haproxy.org/2.8/configuration.html#4.2-hash-balance-factor.

I don't really get your explanation but this is how it works: It is only relative to the average load across all instances. Essentially avg. load = 100% and when you specify 120% you allow instances to have up to 120% of the average load but not more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great. Maybe we can link the source of inspiration then?

The HAProxy definition is also all over the place. A "factor" is an operand in a multiplication. They also don't use the "%" sign to indicate that it's not a factor of 125 but 125%, so 1.25. I would have the same comments to the HAProxy description to be honest.

For consistency we can keep it the same, but then also like to the place that we're consistent with (i.e. the HAProxy docs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share your thought and do not think there is a need to be consistent with HAProxy. But for me, both solutions are fine, most important is a good (visual) explanation in the docs and CLI-help!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with both solutions too. I would keep this thread open to hear other opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not link the HAProxy doc, it seems a bit arbitrary. I also don't have a strong preference for 125 vs. 1.25.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is actually on the "overcommit" / allowed imbalance, so 25% "more", not 125% of the whole.

HAProxy docs would only make sense to link if we stick with exactly their semantics. From what I read, we don't necessarily want that, so the link to HAProxy doesn't make sense. Fully agree.

phase, the values will be evaluated to identify the best fit for the recommended range
- For load balancing algorithms other than hash, the `hash_balance` and `hash_header` properties MUST not be set

An example for manifest with these properties:

```yaml
version: 1
applications:
- name: test
routes:
- route: test.example.com
options:
loadbalancing: hash
hash_header: tenant-id
hash_balance: 125
- route: anothertest.example.com
options:
loadbalancing: least-connection
```

The decision to introduce plain keys was influenced by the following points:

- Simple to use
- It allows for easy addition of more load-balancing-related properties if new requirements arise in the future
- It complies with
the [RFC #0027 that introduced per-route options](https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0027-generic-per-route-features.md#proposal),
which states that the map must use strings as keys and can use numbers, strings, and the literals true and false as
values

### Components Where No Changes Are Required

#### CF CLI

The [current implementation of route option in the CF CLI](https://github.com/cloudfoundry/cli/blob/main/resources/options_resource.go)
supports the use of `--option KEY=VALUE`, where the key and value are sent directly to CC for validation. Consequently,
the `create-route`, `update-route`, and `map-route` commands require no modifications, as they already accept the
proposed properties.
Example:

```bash
cf create-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_balance=125
cf map-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125
```

#### Route-Emitter

The options are raw JSON and will be passed directly to the Gorouter without any modifications.

#### Route-Registrar

In the scope of this RFC, it is not planned to implement hash-based routing in route-registrar for platform-routes.

### Diagrams

#### An activity diagram for routing decision for an incoming request

![](rfc-draft-hash-based-routing/ActivityDiagram.drawio.png)

#### A simplified activity diagram for Gorouter's endpoint registration process

![](rfc-draft-hash-based-routing/EndpointRegistration.drawio.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.