-
Notifications
You must be signed in to change notification settings - Fork 213
RFC: Hash-based routing #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RFC: Hash-based routing #1222
Changes from all commits
a9fef36
0ead3ae
1c0e32b
d996092
459b7e6
b23adde
f7d649f
7f7c926
9924541
816fe0b
3d1ae88
8966a9d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
# Meta | ||
|
||
[meta]: #meta | ||
|
||
- Name: Implementing a Hash-Based Load Balancing Algorithm for CF Routing | ||
- Start Date: 2025-04-07 | ||
- Author(s): b1tamara, Soha-Albaghdady | ||
- Status: Draft <!-- Acceptable values: Draft, Approved, On Hold, Superseded --> | ||
- RFC Pull Request: https://github.com/cloudfoundry/community/pull/1222 | ||
|
||
## Summary | ||
|
||
Cloud Foundry uses round-robin and least-connection algorithms for load balancing between Gorouters and backends. While | ||
effective in many scenarios, these algorithms may not be ideal for certain use cases. Therefore, this RFC proposes to | ||
introduce a hash-based routing on a per-route basis. | ||
The hash-based load balancing algorithm uses the hash of a request header to make routing decisions, focusing on | ||
distributing users across instances rather than individual requests, thereby improving load balancing in specific | ||
scenarios. | ||
|
||
## Motivation | ||
|
||
Cloud Foundry offers two load balancing algorithms to manage request distribution between Gorouters and backends. The | ||
round-robin algorithm ensures the number of requests is distributed equally across all available backends, and the | ||
least-connection algorithm tries to keep the number of active requests equal across all backends. A recent enhancement | ||
allows these load balancing algorithms to be configured on the application route level. | ||
|
||
However, these existing algorithms are not ideal for scenarios that require routing based on specific identifiers. | ||
|
||
An example scenario: users from different tenants send requests to application instances that establish connections to | ||
tenant-specific databases. | ||
|
||
 | ||
|
||
With the current load balancing algorithms, each tenant eventually creates a connection to | ||
each application instance, which then creates connection pools to every customer database. As a result, all tenants | ||
might span up a full mesh, leading to too many open connections to the customer databases, impacting performance. This | ||
limitation highlights a gap in achieving efficient load distribution, particularly when dealing with limited or | ||
memory-intensive resources in backend services, and can be addressed through hash-based routing. In short, hash-based | ||
routing is an algorithm that facilitates the distribution of requests to application instances by using a stable hash | ||
derived from request identifiers, such as headers. | ||
|
||
## Proposal | ||
|
||
We propose introducing hash-based routing as a load balancing algorithm for use on a per-route basis to address the | ||
issues described in the earlier scenario. | ||
|
||
The approach leverages an HTTP header, which is associated with each incoming request and contains the specific | ||
identifier. This one is used to compute a hash value, which will serve as the basis for routing decisions. | ||
|
||
In the previously mentioned scenario, the tenant ID acts as the identifier included in the header and serves as the | ||
basis for hash calculation. This hash value determines the appropriate application instance for each request, ensuring | ||
that all requests with this identifier are consistently routed to the same instance or might be routed to another | ||
instance when the instance is saturated. Consequently, the load balancing algorithm effectively directs requests for a | ||
single tenant to a particular application instance, so that instance can minimize database connection overhead and | ||
optimize connection pooling, enhancing efficiency and system performance. | ||
|
||
### Requirements | ||
|
||
#### Only Application Per-Route Load Balancing | ||
|
||
Hash-based load balancing solves a particular load pattern, rather than serving as a general-purpose load balancing | ||
algorithm. Consequently, it will be configured exclusively as a per-route option for applications and will not be | ||
offered as a global setting. | ||
|
||
#### Minimal rehashing over all Gorouter VMs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. During a deploy app instances will be changing all over the place. For large deployments this could take hours. Though now that I write this, I suppose the idea is that this is for very rare cases. This would never be on for all apps. So it is (hopefully) not going to be a big performance hit. I would be interested in seeing a performance metrics around this once it is written to see the cost during a deployment. |
||
|
||
Rehashing should be minimized, especially when the number of application instances changes over time. | ||
|
||
For the scenario when a new application instance (e.g. app_instance3) is added, Gorouter updates the mapping so that it | ||
maps part of the hashes to the new instance. | ||
|
||
| Hash | Application instance(s) before | Application instance(s) after a new instance added | | ||
|-------|--------------------------------|----------------------------------------------------| | ||
| Hash1 | app_instance1 | app_instance1 | | ||
| Hash2 | app_instance1 | app_instance3 | | ||
| Hash3 | app_instance2 | app_instance2 | | ||
| ... | ... | ... | | ||
| HashN | app_instance2 | app_instance3 | | ||
|
||
For the scenario when the application is scaled down, Gorouter updates the mapping immediately after routes update, so | ||
that it remaps hashes associated with the app_instance3: | ||
|
||
| Hash | Application instance(s) before | Application instance(s) after the app_instance_3 removed | | ||
|-------|--------------------------------|----------------------------------------------------------| | ||
| Hash1 | app_instance1 | app_instance1 | | ||
| Hash2 | app_instance3 | app_instance1 | | ||
| Hash3 | app_instance2 | app_instance2 | | ||
| ... | ... | ... | | ||
| HashN | app_instance3 | app_instance2 | | ||
|
||
|
||
#### Considering a balance factor | ||
|
||
Before routing a request, the current load on each application instance must be evaluated using a balance factor. This | ||
load is measured by the number of in-flight requests. For example, with a balance factor of 150, no application instance | ||
should exceed 150% of the average number of in-flight requests across all application instances. Consequently, requests | ||
must be distributed to different application instances that are not overloaded. | ||
|
||
Example: | ||
|
||
| Application instance | Current request count | Current request count / Average number of in-flight requests | | ||
|----------------------|-----------------------|--------------------------------------------------------------| | ||
| app_instance1 | 10 | 20% | | ||
| app_instance2 | 50 | 100% | | ||
| app_instance3 | 90 | 180% | | ||
|
||
Based on the average number of 50 requests, the current request count to app_instance3 exceeds the balance factor. As a | ||
result, new requests to app_instance3 must be distributed to different application instances. | ||
|
||
#### Deterministic handling of overflow traffic to the next application instance | ||
|
||
The application instance is considered overloaded when the current request load of this application exceeds the balance | ||
factor. Overflow traffic should always be directed to the same next instance rather than to a random one. | ||
|
||
A possible presentation of deterministic handling can be a ring like: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is similar to |
||
|
||
 | ||
|
||
### Required Changes | ||
|
||
#### Gorouter | ||
|
||
- Gorouter MUST be extended to take a specific identifier via the request header | ||
- Gorouter MUST implement hash calculation, based on the provided header | ||
- Gorouter SHOULD store the mapping between computed hash values and application instances locally to avoid | ||
expensive recalculations for each incoming request | ||
- Gorouters SHOULD NOT implement a distributed shared cache | ||
- Gorouter MUST assess the current number of in-flight requests across all application instances mapped to a | ||
particular route to consider overload situations | ||
- Gorouter MUST update its local hash table following the registration or deregistration of an endpoint, ensuring | ||
minimal rehashing | ||
b1tamara marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Gorouter SHOULD NOT not incur any performance hit when 0 apps use hash routing. | ||
|
||
For a detailed understanding of the workflows on Gorouter's side, please refer to the [activity diagrams](#diagrams). | ||
|
||
#### Cloud Controller | ||
|
||
- The `loadbalancing` property of | ||
the [route object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object) MUST be | ||
updated to include `hash` as an acceptable value | ||
- The [route options object](https://v3-apidocs.cloudfoundry.org/version/3.190.0/index.html#the-route-options-object) | ||
MUST include two new properties, `hash_header` and `hash_balance`, to configure a request header as the hashing key | ||
and the balance factor | ||
- It MUST implement the validation of the following requirements: | ||
- The `hash_header` property is mandatory when load balancing is set to hash | ||
- The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` means the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that "hash_balance" is optional, but also "values should be greater than 110", how would someone unset their "hash_balance" property? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the pattern we (somewhat) agreed on is that providing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might have removed accidentally that hash_balance: 0 will unset the property, and in this case, the overflow situation will not be considered. I will add it here. |
||
load situation will not be considered | ||
- To account for overload situations, `hash_balance` values should be greater than 110. During the implementation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some points on the value of
i.e. The % might be misleading/hard to understand, because the actual determination of "overload" depends on the number of available instances if I understood correctly. Thinking of it as "100% of in-flight requests" is wrong, but "100% balance of the part of requests against N instances" is quite complicated to think about. A factor for "allowed imbalance of requests" of e.g. 1.2 could be easier to understand. But that's my view and might not represent the majority. We could also consider an "overcommit" or "imbalance" factor or percentage. "120%" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is strongly inspired by this: https://docs.haproxy.org/2.8/configuration.html#4.2-hash-balance-factor. I don't really get your explanation but this is how it works: It is only relative to the average load across all instances. Essentially avg. load = 100% and when you specify 120% you allow instances to have up to 120% of the average load but not more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's great. Maybe we can link the source of inspiration then? The HAProxy definition is also all over the place. A "factor" is an operand in a multiplication. They also don't use the "%" sign to indicate that it's not a factor of 125 but 125%, so 1.25. I would have the same comments to the HAProxy description to be honest. For consistency we can keep it the same, but then also like to the place that we're consistent with (i.e. the HAProxy docs). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I share your thought and do not think there is a need to be consistent with HAProxy. But for me, both solutions are fine, most important is a good (visual) explanation in the docs and CLI-help! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am fine with both solutions too. I would keep this thread open to hear other opinions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd not link the HAProxy doc, it seems a bit arbitrary. I also don't have a strong preference for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My preference is actually on the "overcommit" / allowed imbalance, so 25% "more", not 125% of the whole. HAProxy docs would only make sense to link if we stick with exactly their semantics. From what I read, we don't necessarily want that, so the link to HAProxy doesn't make sense. Fully agree. |
||
phase, the values will be evaluated to identify the best fit for the recommended range | ||
- For load balancing algorithms other than hash, the `hash_balance` and `hash_header` properties MUST not be set | ||
|
||
An example for manifest with these properties: | ||
|
||
```yaml | ||
version: 1 | ||
applications: | ||
- name: test | ||
routes: | ||
- route: test.example.com | ||
options: | ||
loadbalancing: hash | ||
hash_header: tenant-id | ||
hash_balance: 125 | ||
- route: anothertest.example.com | ||
options: | ||
loadbalancing: least-connection | ||
``` | ||
|
||
The decision to introduce plain keys was influenced by the following points: | ||
|
||
- Simple to use | ||
- It allows for easy addition of more load-balancing-related properties if new requirements arise in the future | ||
- It complies with | ||
the [RFC #0027 that introduced per-route options](https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0027-generic-per-route-features.md#proposal), | ||
which states that the map must use strings as keys and can use numbers, strings, and the literals true and false as | ||
values | ||
|
||
### Components Where No Changes Are Required | ||
|
||
#### CF CLI | ||
|
||
The [current implementation of route option in the CF CLI](https://github.com/cloudfoundry/cli/blob/main/resources/options_resource.go) | ||
supports the use of `--option KEY=VALUE`, where the key and value are sent directly to CC for validation. Consequently, | ||
the `create-route`, `update-route`, and `map-route` commands require no modifications, as they already accept the | ||
proposed properties. | ||
Example: | ||
|
||
```bash | ||
cf create-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125 | ||
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125 | ||
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id | ||
cf update-route MY-APP example.com -n test -o loadbalancing=hash -o hash_balance=125 | ||
cf map-route MY-APP example.com -n test -o loadbalancing=hash -o hash_header=tenant-id -o hash_balance=125 | ||
``` | ||
|
||
#### Route-Emitter | ||
|
||
The options are raw JSON and will be passed directly to the Gorouter without any modifications. | ||
|
||
#### Route-Registrar | ||
|
||
In the scope of this RFC, it is not planned to implement hash-based routing in route-registrar for platform-routes. | ||
|
||
### Diagrams | ||
|
||
#### An activity diagram for routing decision for an incoming request | ||
|
||
 | ||
|
||
#### A simplified activity diagram for Gorouter's endpoint registration process | ||
|
||
 | ||
b1tamara marked this conversation as resolved.
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.