Skip to content

RFC: Hash-based routing #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

b1tamara
Copy link

@b1tamara b1tamara commented Jun 26, 2025

Cloud Foundry uses round-robin and least-connection algorithms for load balancing between Gorouters and backends. Still, they are unsuitable for specific use cases, prompting a proposal to introduce hash-based routing on a per-route basis.

🚀 Link for easy viewing.

@beyhan beyhan requested review from a team, rkoster, beyhan, stephanme, ameowlia and ChrisMcGowan and removed request for a team June 26, 2025 08:44
@beyhan beyhan added the rfc CFF community RFC label Jun 26, 2025
@b1tamara
Copy link
Author

Hi @beyhan, thanks for picking this up. We are still doing a team-internal review first and keeping this PR in "Draft" state. We will ping again once it's ready for the big stage.

b1tamara and others added 3 commits June 27, 2025 12:39
Co-authored-by: Maximilian Moehl <44866320+maxmoehl@users.noreply.github.com>
@b1tamara b1tamara force-pushed the rfc-hash-based-routing branch from 2479883 to d996092 Compare June 30, 2025 08:20
The application instance is considered overloaded when the current request load of this application exceeds the balance
factor. Overflow traffic should always be directed to the same next instance rather than to a random one.

A possible presentation of deterministic handling can be a ring like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to round-robin, but the "starting point" is always the hash's target instance, right?

Copy link
Contributor

@peanball peanball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of the first part, before "Cloud Controller". Further review coming.

- The `hash_header` property is mandatory when load balancing is set to hash
- The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` means the
load situation will not be considered
- To account for overload situations, `hash_balance` values should be greater than 110. During the implementation
Copy link
Contributor

@peanball peanball Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some points on the value of hash_balance:

  1. We should have a unit (i.e. 100%, not just 100), if it's percentage.
  2. Do we want to keep it a percentage, or could it be a factor?
  3. Should it be absolute or relative?

i.e. 100% vs. 1.00.

The % might be misleading/hard to understand, because the actual determination of "overload" depends on the number of available instances if I understood correctly.

Thinking of it as "100% of in-flight requests" is wrong, but "100% balance of the part of requests against N instances" is quite complicated to think about.

A factor for "allowed imbalance of requests" of e.g. 1.2 could be easier to understand. But that's my view and might not represent the majority.

We could also consider an "overcommit" or "imbalance" factor or percentage. "120%" hash_balance would then be hash_imbalance (or hash_rebalance?) of 20% or 0.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strongly inspired by this: https://docs.haproxy.org/2.8/configuration.html#4.2-hash-balance-factor.

I don't really get your explanation but this is how it works: It is only relative to the average load across all instances. Essentially avg. load = 100% and when you specify 120% you allow instances to have up to 120% of the average load but not more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great. Maybe we can link the source of inspiration then?

The HAProxy definition is also all over the place. A "factor" is an operand in a multiplication. They also don't use the "%" sign to indicate that it's not a factor of 125 but 125%, so 1.25. I would have the same comments to the HAProxy description to be honest.

For consistency we can keep it the same, but then also like to the place that we're consistent with (i.e. the HAProxy docs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share your thought and do not think there is a need to be consistent with HAProxy. But for me, both solutions are fine, most important is a good (visual) explanation in the docs and CLI-help!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with both solutions too. I would keep this thread open to hear other opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not link the HAProxy doc, it seems a bit arbitrary. I also don't have a strong preference for 125 vs. 1.25.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is actually on the "overcommit" / allowed imbalance, so 25% "more", not 125% of the whole.

HAProxy docs would only make sense to link if we stick with exactly their semantics. From what I read, we don't necessarily want that, so the link to HAProxy doesn't make sense. Fully agree.

b1tamara and others added 3 commits June 30, 2025 13:22
Copy link
Contributor

@plowin plowin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!
Some minor comments.

General remark (to the TOC): from a reviewers perspective, the newline-format is not the best choice as it would be simpler to make a proposal for a full sentence.

- The `hash_header` property is mandatory when load balancing is set to hash
- The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` means the
load situation will not be considered
- To account for overload situations, `hash_balance` values should be greater than 110. During the implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share your thought and do not think there is a need to be consistent with HAProxy. But for me, both solutions are fine, most important is a good (visual) explanation in the docs and CLI-help!

@b1tamara b1tamara marked this pull request as ready for review July 3, 2025 15:01
and the balance factor
- It MUST implement the validation of the following requirements:
- The `hash_header` property is mandatory when load balancing is set to hash
- The `hash_balance` property is optional when load balancing is set to hash. Leaving out `hash_balance` means the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that "hash_balance" is optional, but also "values should be greater than 110", how would someone unset their "hash_balance" property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern we (somewhat) agreed on is that providing "property": null will unset the property. It seems to be not stated in the RFC so maybe it was only implemented that way or my memory is failing me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have removed accidentally that hash_balance: 0 will unset the property, and in this case, the overflow situation will not be considered. I will add it here.

algorithm. Consequently, it will be configured exclusively as a per-route option for applications and will not be
offered as a global setting.

#### Minimal rehashing over all Gorouter VMs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During a deploy app instances will be changing all over the place. For large deployments this could take hours. Though now that I write this, I suppose the idea is that this is for very rare cases. This would never be on for all apps. So it is (hopefully) not going to be a big performance hit.

I would be interested in seeing a performance metrics around this once it is written to see the cost during a deployment.

Co-authored-by: Amelia Downs <amelia.downs@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc CFF community RFC
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

6 participants