Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry policies for gRPC error codes #721

Closed
twoism opened this issue Apr 7, 2017 · 22 comments
Closed

Add retry policies for gRPC error codes #721

twoism opened this issue Apr 7, 2017 · 22 comments
Assignees
Labels
enhancement Feature requests. Not bugs or questions.
Milestone

Comments

@twoism
Copy link

twoism commented Apr 7, 2017

From the documentation on retries, the x-envoy-retry-on header can be configured for handling HTTP status codes like 5XX and 4XX. This works great for HTTP services. However, with gRPC all HTTP status codes (if the server is running properly) will return a 200 OK. The actual error code is found within the gRPC response.

Would it be feasible to create a retry policy for a x-envoy-retry-grpc-on header that respects a list of gRPC error codes? There are a few codes that could be deemed as retriable CANCELLED, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED but I am hesitant to make assumptions about implementation details within a service by grouping them together. Which is why I think a list may work best. I'm open to ideas here.

Sample Header

x-envoy-retry-grpc-on: CANCELLED, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED

The existing retry header would then be configured as a fallback for when a service is unreachable and the HTTP status codes have more meaning.

@mattklein123 mattklein123 added the enhancement Feature requests. Not bugs or questions. label Apr 8, 2017
@mattklein123 mattklein123 added this to the 1.3.0 milestone Apr 8, 2017
@mattklein123
Copy link
Member

mattklein123 commented Apr 8, 2017

This makes sense to me. @louiscryan @lizan @fengli79 @ctiller do you have any thoughts/comments on this?

@fengli79
Copy link
Contributor

fengli79 commented Apr 8, 2017

Make sense to me, also need to clarify whether these headers will be eliminated by envoy or will be propagated to next hop in multiple layers proxying case. I'm expecting they are going to be propagated, as the retry will happen in a place closer to the backend.
A per URL config in envoy is also a good complementary.
Also, probably better to formalize in gRPC, thus all proxies can have consistent behavior.

@ctiller
Copy link

ctiller commented Apr 8, 2017 via email

@mattklein123
Copy link
Member

@alyssawilk is going to take this one. @alyssawilk feel free to follow up with me if you need any extra context.

@alyssawilk
Copy link
Contributor

alyssawilk commented May 25, 2017 via email

@htuch
Copy link
Member

htuch commented May 26, 2017

@alyssawilk Can we just not do retries unless the user configures the router filter as both? I think it's fair to expect a config change to opt into new behavior. I agree we don't want to break existing users by requiring both for existing behavior, this would be a breaking change according to https://github.com/lyft/envoy/blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this requirement as well.

See also https://github.com/lyft/envoy-api/issues/63 for related discussion in v2 API.

@alyssawilk
Copy link
Contributor

alyssawilk commented May 26, 2017 via email

@louiscryan
Copy link

louiscryan commented May 26, 2017 via email

@alyssawilk
Copy link
Contributor

I suspect one can implement retries at the gRPC client fairly easily but by punting to the client you end up adding latency of the (usually longer haul) proxy-to-client RTT. It would be nice to avoid if there's a clean way to refactor.

@mattklein123
Copy link
Member

@alyssawilk I was envisioning a much simpler v1 implementation of this. Basically, we only handle gRPC errors where the gRPC error is part of a header only response (immediate failure). Essentially, don't handle the trailers case at all. I think this will cover a large portion of the cases that people actually want to retry on and be very simple to implement (essentially identical to the HTTP variants, just looking at different headers). Then very little needs to change.

I agree that to do full retry semantics, we would need to buffer, and handle response trailers, etc., but IMO we can skip that for now, and if a client really wants that I think the client can do it. (As long as we very clearly document what is implemented) Thoughts?

@alyssawilk
Copy link
Contributor

alyssawilk commented Jun 8, 2017 via email

@mattklein123
Copy link
Member

Hm, that would certainly be a lot easier, but my intuition is that most
gRPC errors (especially time outs etc.) would be in trailers rather than
headers

In Lyft's case, I think almost all of them would be header only responses FWIW (even timeout would be header only if app does not respond to unary request during deadline). cc @twoism and @rodaine for further comment for Lyft.

@alyssawilk
Copy link
Contributor

alyssawilk commented Jun 8, 2017 via email

@markdroth
Copy link
Contributor

Sorry for not chiming in this sooner -- someone tried to tag me above with the wrong username, so I only saw this on Thursday. I spent some time on Thursday and Friday talking with other folks on gRPC team about this, so that I could give you a clear picture of what we're planning to do for our retry implementation.

To avoid confusion, I'm going to use the terms defined in the gRPC wire protocol spec. It's also worth understanding that our API exposes Response-Headers as initial metadata and Trailers as trailing metadata, and the difference between them is semantically significant to applications.

Our retry design states that retries are only possible until the client receives Response-Headers from the server. This is because the client will return the initial metadata to the application as soon as it receives Response-Headers, and once we have given that metadata to the application, we cannot retry the RPC, because we might get different metadata when we retry. As a result, any RPC in which the server sends Response-Headers separately from Trailers is not retryable; only RPCs in which the server sends a Trailers-Only response are retryable.

That having been said, we are somewhat unsatisfied with this approach, since it basically eliminates the ability to send Response-Headers on a failure and still have the RPC be retryable -- users can send Response-Headers or have their RPCs be retryable, but not both. We would ideally like to avoid forcing applications to make this choice. So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application. We are currently gathering information about real-world use-cases before we decide whether or not to implement this, but even if we don't do it now, we could still decide to do it later.

In terms of current status, it's worth noting that gRPC's C-core implementation currently never sends out Trailers-Only responses. I have a PR open to fix that (grpc/grpc#10906), but it exposed a bug: Currently, when receiving a Trailers-Only response, we are returning the resulting metadata as initial metadata instead of trailing metadata. I will need to fix that before merging the aforementioned PR.

Taking a step back, one question I have here is whether it's necessary for Envoy to have its own retry implementation or whether it's possible for it to just use the one that we're adding to gRPC. I can imagine that there may be reasons to more closely integrate this into Envoy, and if so I'm not opposed to that. But I just want to make sure that we're not needlessly reinventing a wheel here.

Anyway, I hope the above info answers your questions. If I've missed something, or if you have any other questions or concerns, please don't hesitate to ask.

@mattklein123
Copy link
Member

mattklein123 commented Jun 12, 2017

In terms of current status, it's worth noting that gRPC's C-core implementation currently never sends out Trailers-Only responses

@markdroth the Go impl does send trailers-only right? Because I've definitely seen that here at Lyft.

Taking a step back, one question I have here is whether it's necessary for Envoy to have its own retry implementation or whether it's possible for it to just use the one that we're adding to gRPC

I'm happy to be swayed that we shouldn't do it, but I think there are 2 compelling reasons to add this support directly to envoy even for the trailers-only case.

  1. For clients that are not using the actual gRPC client. We do this at Lyft for example since the Python clients don't work for us (gevent). I think other companies deal with similar issues.
  2. For edge/high latency requests where we would like Envoy to retry before going all the way back to high latency client.

So IMO even supporting trailers-only in Envoy as @alyssawilk PR does is pretty compelling. Let me know what you think about this.

@mattklein123
Copy link
Member

So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application

P.S., this is pretty cool, and we could easily implement this also in envoy if needed.

@htuch htuch closed this as completed in 7828e38 Jun 12, 2017
@markdroth
Copy link
Contributor

markdroth commented Jun 12, 2017

@markdroth the Go impl does send trailers-only right? Because I've definitely seen that here at Lyft.

Yes, the gRPC Go implementation does send Trailers-Only, as long as the application has not explicitly chosen to send initial metadata. I believe that the Java implementation does the same thing.

For clients that are not using the actual gRPC client. We do this at Lyft for example since the Python clients don't work for us (gevent). I think other companies deal with similar issues.

Out of curiosity, what do you use instead? Did you write your own python gRPC client?

For edge/high latency requests where we would like Envoy to retry before going all the way back to high latency client.

Does Envoy use the gRPC C-core implementation for its outgoing client connections? If so, it should be possible to configure retries in Envoy differently from retries in the original client. In fact, I would expect this to be a common use case for any L7 proxy, especially in the edge case.

Speaking of which, another thing to keep in mind here is that the gRPC retry mechanism is going to be configured via the gRPC service config. In a case where a service is getting some traffic from local clients (which use the service config) and other traffic from external clients via Envoy as an edge proxy, it may be convenient to have a single mechanism to configure how many retries clients should attempt, rather than having to do it in two places. (On the other hand, there may also be reasons why the service owner might want to configure them differently. Although they could do that even if you were using the gRPC retry mechanism simply by having two different server names pointing to the same set of backends, each with their own service config.)

Anyway, I'm sure there are a lot of arguments in either direction, and I'm not necessarily trying to push you to use the gRPC retry mechanism. I just wanted to make sure that you had considered that as an alternative, and it sounds like you have.

So IMO even supporting trailers-only in Envoy as @alyssawilk PR does is pretty compelling. Let me know what you think about this.

That sounds reasonable to me, at least for now. It could always be changed if/when we implement a way to return initial metadata in a retryable way.

So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application

P.S., this is pretty cool, and we could easily implement this also in envoy if needed.

If we decide to go forward with this, we'll publish a gRFC for it. I'll try to remember to loop you folks in if/when that happens.

@mattklein123
Copy link
Member

Out of curiosity, what do you use instead? Did you write your own python gRPC client?

We use the Envoy HTTP/1.1 bridge filter along with machine generated clients. https://lyft.github.io/envoy/docs/configuration/http_filters/grpc_http1_bridge_filter.html. This will be replaced probably with grpc-web that @fengli79 is working on, which I think still has the same general retry problem.

Does Envoy use the gRPC C-core implementation for its outgoing client connections?

No (unfortunately). This is a very long conversation that I'm happy to have out of band, but we have gone back and forth on this and have gotten nowhere in figuring out how to integrate gRPC core into envoy. At this point a bunch of gRPC is being reimplemented inside envoy (by Google).

@markdroth
Copy link
Contributor

Ah, okay, I didn't realize that you weren't actually using our implementation. That makes the decision about whether or not to use our retry code pretty clear. :)

@Daniel-B-Smith
Copy link

What is the current state of this? We (Datadog) are currently looking into putting Envoy on our edge and would really like to do gRPC retries inside our edge with Envoy but only for certain gRPC status codes.

@alyssawilk
Copy link
Contributor

This was fixed a while back, for the case where the status code is immediately returned in headers.

Hopefully the docs checked in with the associated PR will be helpful in setting it up:
https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-retry-grpc-on

@Daniel-B-Smith
Copy link

Thanks for the quick response! Yep, that's exactly what I was looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

9 participants