Add retry policies for gRPC error codes #721

twoism · 2017-04-07T20:53:25Z

From the documentation on retries, the x-envoy-retry-on header can be configured for handling HTTP status codes like 5XX and 4XX. This works great for HTTP services. However, with gRPC all HTTP status codes (if the server is running properly) will return a 200 OK. The actual error code is found within the gRPC response.

Would it be feasible to create a retry policy for a x-envoy-retry-grpc-on header that respects a list of gRPC error codes? There are a few codes that could be deemed as retriable CANCELLED, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED but I am hesitant to make assumptions about implementation details within a service by grouping them together. Which is why I think a list may work best. I'm open to ideas here.

Sample Header

x-envoy-retry-grpc-on: CANCELLED, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED

The existing retry header would then be configured as a fallback for when a service is unreachable and the HTTP status codes have more meaning.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2017-04-08T00:38:23Z

This makes sense to me. @louiscryan @lizan @fengli79 @ctiller do you have any thoughts/comments on this?

fengli79 · 2017-04-08T02:52:55Z

Make sense to me, also need to clarify whether these headers will be eliminated by envoy or will be propagated to next hop in multiple layers proxying case. I'm expecting they are going to be propagated, as the retry will happen in a place closer to the backend.
A per URL config in envoy is also a good complementary.
Also, probably better to formalize in gRPC, thus all proxies can have consistent behavior.

ctiller · 2017-04-08T22:02:15Z

Probably worth thinking through how this interacts with https://github.com/grpc/proposal/blob/master/A6-client-retries.md. @markroth probably has some thoughts.

…

On Fri, Apr 7, 2017, 7:52 PM Feng Li ***@***.***> wrote: Make sense to me, also need to clarify whether these headers will be eliminated by envoy or will be propragated to next hop in a multiple layers proxying case. A per URL config in envoy is also a good complementary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJpudT5QhNOSo3rue3mesvx4MvAm7d3Zks5rtvaKgaJpZM4M3YPS> .

mattklein123 · 2017-05-11T15:09:27Z

@alyssawilk is going to take this one. @alyssawilk feel free to follow up with me if you need any extra context.

alyssawilk · 2017-05-25T17:41:14Z

Hey Matt, can I get some guidance on this one? Having spent some time figuring out the code base, I believe that in order to retry based on trailers we can't simply extend the current retry logic based on Http::StreamDecoder callbacks to enable buffering the response until we've gotten trailers. As such I think we need to extend the router filter to be an StreamFilter rather than just an StreamDecoderFilter. Then the question comes, do we allow it to be either a decoder or both encoder/decoder? I'm not a fan of changes which break (all) existing configurations, but if you allow both old and new styles the code gets somewhat complicated. For one example, because the retry logic is spread between the UpstreamRequest's StreamDecoder functions and the new StreamEncoderFilter, one would have to clean up the retry state depending on which mode the filter was in to ensure it wasn't hanging around. I'm happy to go either route, but as I'm working to get the intergration tests working it'd be helpful to know if I should go the messy route and test both filter modes or just update all the checked in configs and hope there aren't too many annoying breakages. thanks! Alyssa

…

On Thu, May 11, 2017 at 1:57 PM, Matt Klein ***@***.***> wrote: Assigned #721 <#721> to @alyssawilk <https://github.com/alyssawilk>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvWXOMYgU0z7gD9-p8CtikWuA76Qnks5r40v4gaJpZM4M3YPS> .

htuch · 2017-05-26T03:45:47Z

@alyssawilk Can we just not do retries unless the user configures the router filter as both? I think it's fair to expect a config change to opt into new behavior. I agree we don't want to break existing users by requiring both for existing behavior, this would be a breaking change according to https://github.com/lyft/envoy/blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this requirement as well.

See also https://github.com/lyft/envoy-api/issues/63 for related discussion in v2 API.

alyssawilk · 2017-05-26T14:56:51Z

tl;dr I think I'm going to sit on this until Matt's back from vacation and stick it on the agenda of the weekly sync @htuch, we totally can support dual modes int he router, it just makes the code a bit more fragile. I think the cleanest route would be if we end up opting to not explicitly configure encode/decode and rather have the filters do the right thing. We'll see how that discussion goes. Unfortunately there's the other problem, which is how far we extend rewinding state for replayability. Currently all the replay state is handled it the StreamDecoder bits of the router, so before any headers/body/trailers are handed off to the ConnectionManagerImpl to start processing filters. We can't (currently) do that for gRPC as for gRPC we can't determine if we're retrying until trailers are processed at which point we've headers and body data on. My original plan was to extend the router Filter to be an EncoderFilter, pause header processing and trigger buffering the body until it has inspected trailers. The problem is if the router does that as a filter, this implies the headers and body have already been passed to the ConnectionManagerImpl, and any prior filters, which are not guaranteed to handle retries gracefully. The options I can think of is we make sure any prior filters (and the ConnectionManagerImpl) are willing to replay, we only support gRPC retries best-effort (e.g. if the router is the first filter), or we instead handle this earlier in the StreamDecoder path. The latter seems easiest but strikes me as an architectural hack because then we're secretly treating the router's StreamDecoder functions as Filter functions (stop processing and buffer). There existing retry code is best-effort but only supporting features based on document-only requirements of filter ordering seems like asking for trouble. And making everything retry-aware is a pretty large architectural change. Hopefully there's some fourth option which hasn't occurred to me due to my newness to the code base, and given there's no hurry on this feature I'm inclined to get Matt's take once he's back.

…

On Thu, May 25, 2017 at 11:45 PM, htuch ***@***.***> wrote: @alyssawilk <https://github.com/alyssawilk> Can we just not do retries unless the user configures the router filter as both? I think it's fair to expect a config change to opt into new behavior. I agree we don't want to break existing users by requiring both for existing behavior, this would be a breaking change according to https://github.com/lyft/envoy/ blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this requirement as well. See also lyft/envoy-api#63 <https://github.com/lyft/envoy-api/issues/63> for related discussion in v2 API. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvRYpwEaVpX-Jk5TSmE-MWjmNx5lhks5r9krvgaJpZM4M3YPS> .

louiscryan · 2017-05-26T20:37:41Z

An alternative would be to punt to grpc itself. If the goal is to allow operators to inject retry behavior where there was none previously then it might be reasonable to inject a signal in the response trailers passed back to the grpc client. We'd need to standardize how that works of course but it could be useful behavior outside the context of envoy

-louis (from phone) On May 26, 2017 8:57 AM, "alyssawilk" <notifications@github.com> wrote: tl;dr I think I'm going to sit on this until Matt's back from vacation and stick it on the agenda of the weekly sync @htuch, we totally can support dual modes int he router, it just makes the code a bit more fragile. I think the cleanest route would be if we end up opting to not explicitly configure encode/decode and rather have the filters do the right thing. We'll see how that discussion goes. Unfortunately there's the other problem, which is how far we extend rewinding state for replayability. Currently all the replay state is handled it the StreamDecoder bits of the router, so before any headers/body/trailers are handed off to the ConnectionManagerImpl to start processing filters. We can't (currently) do that for gRPC as for gRPC we can't determine if we're retrying until trailers are processed at which point we've headers and body data on. My original plan was to extend the router Filter to be an EncoderFilter, pause header processing and trigger buffering the body until it has inspected trailers. The problem is if the router does that as a filter, this implies the headers and body have already been passed to the ConnectionManagerImpl, and any prior filters, which are not guaranteed to handle retries gracefully. The options I can think of is we make sure any prior filters (and the ConnectionManagerImpl) are willing to replay, we only support gRPC retries best-effort (e.g. if the router is the first filter), or we instead handle this earlier in the StreamDecoder path. The latter seems easiest but strikes me as an architectural hack because then we're secretly treating the router's StreamDecoder functions as Filter functions (stop processing and buffer). There existing retry code is best-effort but only supporting features based on document-only requirements of filter ordering seems like asking for trouble. And making everything retry-aware is a pretty large architectural change. Hopefully there's some fourth option which hasn't occurred to me due to my newness to the code base, and given there's no hurry on this feature I'm inclined to get Matt's take once he's back.

On Thu, May 25, 2017 at 11:45 PM, htuch ***@***.***> wrote: @alyssawilk <https://github.com/alyssawilk> Can we just not do retries unless the user configures the router filter as both? I think it's fair to expect a config change to opt into new behavior. I agree we don't want to break existing users by requiring both for existing behavior, this would be a breaking change according to https://github.com/lyft/envoy/ blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this requirement as well. See also lyft/envoy-api#63 <https://github.com/lyft/envoy-api/issues/63> for related discussion in v2 API. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvRYpwEaVpX-Jk5TSmE-

MWjmNx5lhks5r9krvgaJpZM4M3YPS>

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIoKPEtXuGfFx25dYlrsaJmW6EvDBfzUks5r9ug7gaJpZM4M3YPS> .

alyssawilk · 2017-05-31T13:31:59Z

I suspect one can implement retries at the gRPC client fairly easily but by punting to the client you end up adding latency of the (usually longer haul) proxy-to-client RTT. It would be nice to avoid if there's a clean way to refactor.

mattklein123 · 2017-06-06T01:41:23Z

@alyssawilk I was envisioning a much simpler v1 implementation of this. Basically, we only handle gRPC errors where the gRPC error is part of a header only response (immediate failure). Essentially, don't handle the trailers case at all. I think this will cover a large portion of the cases that people actually want to retry on and be very simple to implement (essentially identical to the HTTP variants, just looking at different headers). Then very little needs to change.

I agree that to do full retry semantics, we would need to buffer, and handle response trailers, etc., but IMO we can skip that for now, and if a client really wants that I think the client can do it. (As long as we very clearly document what is implemented) Thoughts?

alyssawilk · 2017-06-08T15:42:44Z

Hm, that would certainly be a lot easier, but my intuition is that most gRPC errors (especially time outs etc.) would be in trailers rather than headers. Maybe @ctiller can comment on if a header-only solution would be a useful step?

…

On Mon, Jun 5, 2017 at 9:41 PM, Matt Klein ***@***.***> wrote: @alyssawilk <https://github.com/alyssawilk> I was envisioning a much simpler v1 implementation of this. Basically, we only handle gRPC errors where the gRPC error is part of a header only response (immediate failure). Essentially, don't handle the trailers case at all. I think this will cover a large portion of the cases that people actually want to retry on and be very simple to implement (essentially identical to the HTTP variants, just looking at different headers). Then very little needs to change. I agree that to do full retry semantics, we would need to buffer, and handle response trailers, etc., but IMO we can skip that for now, and if a client really wants that I think the client can do it. (As long as we very clearly document what is implemented) Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvfhcUFbga9LfEE-NCbrQ4SZlR9GQks5sBK5HgaJpZM4M3YPS> .

mattklein123 · 2017-06-08T16:09:21Z

Hm, that would certainly be a lot easier, but my intuition is that most
gRPC errors (especially time outs etc.) would be in trailers rather than
headers

In Lyft's case, I think almost all of them would be header only responses FWIW (even timeout would be header only if app does not respond to unary request during deadline). cc @twoism and @rodaine for further comment for Lyft.

alyssawilk · 2017-06-08T19:05:53Z

Ah, if it'll be useful to you folks I'll go ahead and do it, especially as it'd be a nice clean first step if we opt to do trailers later down the road.

…

On Thu, Jun 8, 2017 at 12:09 PM, Matt Klein ***@***.***> wrote: Hm, that would certainly be a lot easier, but my intuition is that most gRPC errors (especially time outs etc.) would be in trailers rather than headers In Lyft's case, I think almost all of them would be header only responses FWIW (even timeout would be header only if app does not respond to unary request during deadline). cc @twoism <https://github.com/twoism> and @rodaine <https://github.com/rodaine> for further comment for Lyft. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#721 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvXxFJCsaoNzUEfXV7_mec7hMXO2Vks5sCBy3gaJpZM4M3YPS> .

markdroth · 2017-06-12T16:56:36Z

Sorry for not chiming in this sooner -- someone tried to tag me above with the wrong username, so I only saw this on Thursday. I spent some time on Thursday and Friday talking with other folks on gRPC team about this, so that I could give you a clear picture of what we're planning to do for our retry implementation.

To avoid confusion, I'm going to use the terms defined in the gRPC wire protocol spec. It's also worth understanding that our API exposes Response-Headers as initial metadata and Trailers as trailing metadata, and the difference between them is semantically significant to applications.

Our retry design states that retries are only possible until the client receives Response-Headers from the server. This is because the client will return the initial metadata to the application as soon as it receives Response-Headers, and once we have given that metadata to the application, we cannot retry the RPC, because we might get different metadata when we retry. As a result, any RPC in which the server sends Response-Headers separately from Trailers is not retryable; only RPCs in which the server sends a Trailers-Only response are retryable.

That having been said, we are somewhat unsatisfied with this approach, since it basically eliminates the ability to send Response-Headers on a failure and still have the RPC be retryable -- users can send Response-Headers or have their RPCs be retryable, but not both. We would ideally like to avoid forcing applications to make this choice. So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application. We are currently gathering information about real-world use-cases before we decide whether or not to implement this, but even if we don't do it now, we could still decide to do it later.

In terms of current status, it's worth noting that gRPC's C-core implementation currently never sends out Trailers-Only responses. I have a PR open to fix that (grpc/grpc#10906), but it exposed a bug: Currently, when receiving a Trailers-Only response, we are returning the resulting metadata as initial metadata instead of trailing metadata. I will need to fix that before merging the aforementioned PR.

Taking a step back, one question I have here is whether it's necessary for Envoy to have its own retry implementation or whether it's possible for it to just use the one that we're adding to gRPC. I can imagine that there may be reasons to more closely integrate this into Envoy, and if so I'm not opposed to that. But I just want to make sure that we're not needlessly reinventing a wheel here.

Anyway, I hope the above info answers your questions. If I've missed something, or if you have any other questions or concerns, please don't hesitate to ask.

mattklein123 · 2017-06-12T18:43:39Z

In terms of current status, it's worth noting that gRPC's C-core implementation currently never sends out Trailers-Only responses

@markdroth the Go impl does send trailers-only right? Because I've definitely seen that here at Lyft.

Taking a step back, one question I have here is whether it's necessary for Envoy to have its own retry implementation or whether it's possible for it to just use the one that we're adding to gRPC

I'm happy to be swayed that we shouldn't do it, but I think there are 2 compelling reasons to add this support directly to envoy even for the trailers-only case.

For clients that are not using the actual gRPC client. We do this at Lyft for example since the Python clients don't work for us (gevent). I think other companies deal with similar issues.
For edge/high latency requests where we would like Envoy to retry before going all the way back to high latency client.

So IMO even supporting trailers-only in Envoy as @alyssawilk PR does is pretty compelling. Let me know what you think about this.

mattklein123 · 2017-06-12T18:49:57Z

So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application

P.S., this is pretty cool, and we could easily implement this also in envoy if needed.

markdroth · 2017-06-12T19:48:35Z

@markdroth the Go impl does send trailers-only right? Because I've definitely seen that here at Lyft.

Yes, the gRPC Go implementation does send Trailers-Only, as long as the application has not explicitly chosen to send initial metadata. I believe that the Java implementation does the same thing.

For clients that are not using the actual gRPC client. We do this at Lyft for example since the Python clients don't work for us (gevent). I think other companies deal with similar issues.

Out of curiosity, what do you use instead? Did you write your own python gRPC client?

For edge/high latency requests where we would like Envoy to retry before going all the way back to high latency client.

Does Envoy use the gRPC C-core implementation for its outgoing client connections? If so, it should be possible to configure retries in Envoy differently from retries in the original client. In fact, I would expect this to be a common use case for any L7 proxy, especially in the edge case.

Speaking of which, another thing to keep in mind here is that the gRPC retry mechanism is going to be configured via the gRPC service config. In a case where a service is getting some traffic from local clients (which use the service config) and other traffic from external clients via Envoy as an edge proxy, it may be convenient to have a single mechanism to configure how many retries clients should attempt, rather than having to do it in two places. (On the other hand, there may also be reasons why the service owner might want to configure them differently. Although they could do that even if you were using the gRPC retry mechanism simply by having two different server names pointing to the same set of backends, each with their own service config.)

Anyway, I'm sure there are a lot of arguments in either direction, and I'm not necessarily trying to push you to use the gRPC retry mechanism. I just wanted to make sure that you had considered that as an alternative, and it sounds like you have.

So IMO even supporting trailers-only in Envoy as @alyssawilk PR does is pretty compelling. Let me know what you think about this.

That sounds reasonable to me, at least for now. It could always be changed if/when we implement a way to return initial metadata in a retryable way.

So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application

P.S., this is pretty cool, and we could easily implement this also in envoy if needed.

If we decide to go forward with this, we'll publish a gRFC for it. I'll try to remember to loop you folks in if/when that happens.

mattklein123 · 2017-06-13T14:33:17Z

Out of curiosity, what do you use instead? Did you write your own python gRPC client?

We use the Envoy HTTP/1.1 bridge filter along with machine generated clients. https://lyft.github.io/envoy/docs/configuration/http_filters/grpc_http1_bridge_filter.html. This will be replaced probably with grpc-web that @fengli79 is working on, which I think still has the same general retry problem.

Does Envoy use the gRPC C-core implementation for its outgoing client connections?

No (unfortunately). This is a very long conversation that I'm happy to have out of band, but we have gone back and forth on this and have gotten nowhere in figuring out how to integrate gRPC core into envoy. At this point a bunch of gRPC is being reimplemented inside envoy (by Google).

markdroth · 2017-06-13T15:00:38Z

Ah, okay, I didn't realize that you weren't actually using our implementation. That makes the decision about whether or not to use our retry code pretty clear. :)

Daniel-B-Smith · 2018-08-29T16:33:53Z

What is the current state of this? We (Datadog) are currently looking into putting Envoy on our edge and would really like to do gRPC retries inside our edge with Envoy but only for certain gRPC status codes.

alyssawilk · 2018-08-29T16:41:12Z

This was fixed a while back, for the case where the status code is immediately returned in headers.

Hopefully the docs checked in with the associated PR will be helpful in setting it up:
https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-retry-grpc-on

Daniel-B-Smith · 2018-08-29T16:46:27Z

Thanks for the quick response! Yep, that's exactly what I was looking for.

mattklein123 added the enhancement Feature requests. Not bugs or questions. label Apr 8, 2017

mattklein123 added this to the 1.3.0 milestone Apr 8, 2017

mattklein123 modified the milestones: 1.3.0, 1.4.0 Apr 13, 2017

junr03 mentioned this issue Apr 28, 2017

grpc/outlier: inspect response grpc status code to determine host success rate #860

Closed

mattklein123 assigned alyssawilk May 11, 2017

alyssawilk mentioned this issue Jun 8, 2017

gRPC retries, assuming gRPC status code is in the returned headers #1067

Merged

htuch closed this as completed in 7828e38 Jun 12, 2017

ganmacs mentioned this issue Aug 29, 2018

Response status as Trailers-Only if initial metadata has not been sent grpc/grpc#16490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry policies for gRPC error codes #721

Add retry policies for gRPC error codes #721

twoism commented Apr 7, 2017 •

edited

Loading

mattklein123 commented Apr 8, 2017 •

edited

Loading

fengli79 commented Apr 8, 2017 •

edited

Loading

ctiller commented Apr 8, 2017 via email

mattklein123 commented May 11, 2017

alyssawilk commented May 25, 2017 via email

htuch commented May 26, 2017

alyssawilk commented May 26, 2017 via email

louiscryan commented May 26, 2017 via email

alyssawilk commented May 31, 2017

mattklein123 commented Jun 6, 2017

alyssawilk commented Jun 8, 2017 via email

mattklein123 commented Jun 8, 2017

alyssawilk commented Jun 8, 2017 via email

markdroth commented Jun 12, 2017

mattklein123 commented Jun 12, 2017 •

edited

Loading

mattklein123 commented Jun 12, 2017

markdroth commented Jun 12, 2017 •

edited

Loading

mattklein123 commented Jun 13, 2017

markdroth commented Jun 13, 2017

Daniel-B-Smith commented Aug 29, 2018

alyssawilk commented Aug 29, 2018

Daniel-B-Smith commented Aug 29, 2018

Add retry policies for gRPC error codes #721

Add retry policies for gRPC error codes #721

Comments

twoism commented Apr 7, 2017 • edited Loading

mattklein123 commented Apr 8, 2017 • edited Loading

fengli79 commented Apr 8, 2017 • edited Loading

ctiller commented Apr 8, 2017 via email

mattklein123 commented May 11, 2017

alyssawilk commented May 25, 2017 via email

htuch commented May 26, 2017

alyssawilk commented May 26, 2017 via email

louiscryan commented May 26, 2017 via email

alyssawilk commented May 31, 2017

mattklein123 commented Jun 6, 2017

alyssawilk commented Jun 8, 2017 via email

mattklein123 commented Jun 8, 2017

alyssawilk commented Jun 8, 2017 via email

markdroth commented Jun 12, 2017

mattklein123 commented Jun 12, 2017 • edited Loading

mattklein123 commented Jun 12, 2017

markdroth commented Jun 12, 2017 • edited Loading

mattklein123 commented Jun 13, 2017

markdroth commented Jun 13, 2017

Daniel-B-Smith commented Aug 29, 2018

alyssawilk commented Aug 29, 2018

Daniel-B-Smith commented Aug 29, 2018

twoism commented Apr 7, 2017 •

edited

Loading

mattklein123 commented Apr 8, 2017 •

edited

Loading

fengli79 commented Apr 8, 2017 •

edited

Loading

mattklein123 commented Jun 12, 2017 •

edited

Loading

markdroth commented Jun 12, 2017 •

edited

Loading