Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fail_traffic_on_panic without zone aware routing #17659

Closed
hochuenw-dd opened this issue Aug 10, 2021 · 6 comments
Closed

Support fail_traffic_on_panic without zone aware routing #17659

hochuenw-dd opened this issue Aug 10, 2021 · 6 comments
Labels
enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently

Comments

@hochuenw-dd
Copy link

hochuenw-dd commented Aug 10, 2021

Hello folks,
I'm trying to implement client-side throttling using envoy. My use case is to prevent further overwhelming the upstream cluster when it's degraded/down. To do this, apart from the admission control (which is awesome but current doesn't support per cluster/vhost), I can see 2 potential solutions using outlier detection.

  • outlier detection + disabled panic mode
  • outlier detection + enabled panic mode + enabled fail_traffic_on_panic

In our environment, we have to disable panic mode, because to enable fail_traffic_on_panic, it needs to enable zone aware routing. To enable zone aware routing, according to this doc, it needs users to define static local cluster in the bootstrap file, and it also needs change in the upstream service. These are hard to do in our environment.

But I think a potential concern with panic mode disabled approach is that envoy would return 503 only after every upstream host is down. Let's say 80% upstream cluster is down, if we disable panic mode, the traffic may further overwhelm the rest 20% of the traffic and bring the whole upstream service down which is the opposite of what we want to do here. Instead, enabled panic mode + fail_traffic_on_panic don't have this issue because it can start failing all requests when x% of traffic is down.

I don't know whether this is just a rare case in reality. I guess most likely in reality 100% service would go down, so panic mode disabled should work fine. But I feel this may happen when the upstream cluster does a partial rollout (say 80% on v2, 20% on v1)

Is my understanding correct? If so, is there a way to use fail_traffic_on_panic without zone aware routing?

Thanks!!

@hochuenw-dd hochuenw-dd added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Aug 10, 2021
@tonya11en
Copy link
Member

Can you link to the docs you're referencing? From my understanding, you don't need to have zone-aware routing enabled for this since it's a common LB config.

That being said, I really think what you want here is admission control-- this is exactly the scenario it's mean for, but it's just missing the per-cluster state tracking. The idea is based on client-side throttling from that Google SRE book.

Are you trying to accomplish this without any code changes? If so, I think you may be limited to what you mention in the description. Otherwise, you'll have a much better experience by adding this functionality to the admission control filter.

@hochuenw-dd
Copy link
Author

hochuenw-dd commented Aug 11, 2021

@tonya11en Thanks for the reply!

Can you link to the docs you're referencing? From my understanding, you don't need to have zone-aware routing enabled for this since it's a common LB config.

To disable panic mode (set healthy_panic_threshold to 0), users don't need zone-aware routing. But that may cause the issue I described above (clients may overwhelm servers unless ALL servers are down).
To enable panic mode and fail requests immediately when it's in panic mode (set fail_traffic_on_panic to true), I think users need to have zone-aware routing since it's a zone-aware lb config (#8024). But I feel like technically zone-aware routing shouldn't be a requirement. That's why I created this issue asking if we can extend it to the common lb config. If this can be added, clients wouldn't overwhelm servers because envoy starts returning 503 directly when only x% of the servers are down (vs 100% when we disable panic mode).

That being said, I really think what you want here is admission control-- this is exactly the scenario it's mean for, but it's just missing the per-cluster state tracking. The idea is based on client-side throttling from that Google SRE book.

Are you trying to accomplish this without any code changes? If so, I think you may be limited to what you mention in the description. Otherwise, you'll have a much better experience by adding this functionality to the admission control filter.

I agree admission control is better for our use case if it supports per-cluster state tracking. We are trying to achieve the goal in a very short term (by the end of August), so for now contributing to the admission control filter is not an option for me given my knowledge to envoy codebase. But I may look into this later.

@ggreenway ggreenway removed the triage Issue requires triage label Aug 11, 2021
@hochuenw-dd
Copy link
Author

Hi @csssuf, I see you added the fail_traffic_on_panic feature. Is my above understanding correct (users have to turn on zone-aware routing if they want to use fail_traffic_on_panic)? Does this feature request sounds reasonable to you? Thanks

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 13, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@gongcon
Copy link

gongcon commented Apr 12, 2024

Can you reopen this issue? We use locality weighted lb instead of zone aware lb. Per CommonLbConfig doc, only one of zone_aware_lb_config, locality_weighted_lb_config may be set. We are out of luck to use fail_traffic_on_panic.

Question is why zone aware lb is required for fail_traffic_on_panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants