Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc/outlier: inspect response grpc status code to determine host success rate #860

Closed
junr03 opened this issue Apr 28, 2017 · 10 comments
Closed
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help! no stalebot Disables stalebot from closing an issue

Comments

@junr03
Copy link
Member

junr03 commented Apr 28, 2017

Right now we base success rate outlier detection only on HTTP status codes, which excludes request failures based on grpc status codes.

The router filter should look for grpc status codes and feed a failure to the outlier detector in cases when the HTTP status code is a success.

Related to #721

@junr03 junr03 added the enhancement Feature requests. Not bugs or questions. label Apr 28, 2017
@junr03 junr03 added this to the 1.4.0 milestone Apr 28, 2017
@junr03 junr03 self-assigned this Apr 28, 2017
@mattklein123
Copy link
Member

cc @louiscryan

@junr03 junr03 changed the title Inspect response grpc status code to determine host success rate grpc/outlier: inspect response grpc status code to determine host success rate Apr 28, 2017
@mattklein123 mattklein123 modified the milestones: 1.4.0, 1.5.0 Jul 21, 2017
@mattklein123 mattklein123 assigned mattklein123 and unassigned junr03 Oct 6, 2017
@mattklein123 mattklein123 modified the milestones: 1.5.0, 1.6.0 Nov 3, 2017
@mattklein123 mattklein123 removed this from the 1.6.0 milestone Jan 9, 2018
@mattklein123 mattklein123 removed their assignment Mar 11, 2018
@stale
Copy link

stale bot commented Jun 20, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 20, 2018
@mattklein123 mattklein123 added the help wanted Needs help! label Jun 20, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jun 20, 2018
@louiscryan
Copy link

cc @lizan

I don't have time to pick this up but for context there is a semi-canonical mapping from GRPC to HTTP status codes that might be useful

https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md

@lizan
Copy link
Member

lizan commented Jun 20, 2018

Yes, envoy has implemented a reverse mapping as well.
https://github.com/envoyproxy/envoy/blob/master/source/common/grpc/status.cc#L28

@wenbozhu
Copy link

As we are working on the fix , it seems that the grpc trailer status needs be handled explicitly (possibly as a new concept) in OutlierDetector
for both monitoring (so we don't overwrite the initial ok status) and outlier detection purposes.

Extending the non-HTTP Result enum doesn't seem right either.

Any thoughts on this? I'd like to file a new issue to add explicit trailer status support in OutlierDetector.

@mattklein123
Copy link
Member

@wenbozhu can you describe in more detail what the issue is? Unfortunately the outlier detection code and the way we do mappings is incredibly complicated. cc @cpakulski.

@wenbozhu
Copy link

Currently, grpc status in response headers is not checked in OutlierDetector.

We are fixing this behavior with #7942
I.e. when grpc status is present, it will be mapped to http status, which will be set to OutlierDetector.

We believe this is the minimum fix required to make OutlierDetector work for grpc services.

However, we also need define how OutlierDetector should handle the trailer status (when it's an error). We believe we should not delay recording the initial status till the trailer is received or the stream is aborted.

For monitoring, we want to differentiate between immediate failures and "partial" failures as indicated by an error trailer status.

For ejection etc, I think we need more discussion, e.g. whether a new API needs be introduced or we should just treat trailer errors the same way as HTTP error status. Even with the latter approach, there will be extra complexity to handle the extra status for a single request.

===

Also as I understand, HTTP doesn't allow status code (header) to be overwritten in a trailer, so the trailer status support will be unique to grpc requests.

@mattklein123
Copy link
Member

Thanks for the detail. I will review #7942 as I agree this is very complicated and it took us a while to get through the last round of changes in this area that @cpakulski was doing.

Even with the latter approach, there will be extra complexity to handle the extra status for a single request.

Agreed we need to think about this carefully, double counting has caused us issues in the past. We may need to split out gRPC into its own thing. Let's discuss in the other PR and I'm hoping @cpakulski can weigh in also.

@cpakulski
Copy link
Contributor

cpakulski commented Aug 23, 2019 via email

@lizan
Copy link
Member

lizan commented Sep 12, 2019

Fixed in #7942

@lizan lizan closed this as completed Sep 12, 2019
jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: move factory registration out of the engine file. This will eventually allow third party builds to replace the extensions installed by the build system. The directory name and location is not necessarily final; this is just the first step in separating this out to its own target.
Risk Level: low
Testing: existing CI

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: move factory registration out of the engine file. This will eventually allow third party builds to replace the extensions installed by the build system. The directory name and location is not necessarily final; this is just the first step in separating this out to its own target.
Risk Level: low
Testing: existing CI

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help! no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

7 participants