-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry policies for gRPC error codes #721
Comments
This makes sense to me. @louiscryan @lizan @fengli79 @ctiller do you have any thoughts/comments on this? |
Make sense to me, also need to clarify whether these headers will be eliminated by envoy or will be propagated to next hop in multiple layers proxying case. I'm expecting they are going to be propagated, as the retry will happen in a place closer to the backend. |
Probably worth thinking through how this interacts with
https://github.com/grpc/proposal/blob/master/A6-client-retries.md.
@markroth probably has some thoughts.
…On Fri, Apr 7, 2017, 7:52 PM Feng Li ***@***.***> wrote:
Make sense to me, also need to clarify whether these headers will be
eliminated by envoy or will be propragated to next hop in a multiple layers
proxying case.
A per URL config in envoy is also a good complementary.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJpudT5QhNOSo3rue3mesvx4MvAm7d3Zks5rtvaKgaJpZM4M3YPS>
.
|
@alyssawilk is going to take this one. @alyssawilk feel free to follow up with me if you need any extra context. |
Hey Matt, can I get some guidance on this one?
Having spent some time figuring out the code base, I believe that in order
to retry based on trailers we can't simply extend the current retry logic
based on Http::StreamDecoder callbacks to enable buffering the response
until we've gotten trailers. As such I think we need to extend the router
filter to be an StreamFilter rather than just an StreamDecoderFilter.
Then the question comes, do we allow it to be either a decoder or both
encoder/decoder? I'm not a fan of changes which break (all) existing
configurations, but if you allow both old and new styles the code gets
somewhat complicated. For one example, because the retry logic is spread
between the UpstreamRequest's StreamDecoder functions and the new
StreamEncoderFilter, one would have to clean up the retry state depending
on which mode the filter was in to ensure it wasn't hanging around.
I'm happy to go either route, but as I'm working to get the intergration
tests working it'd be helpful to know if I should go the messy route and
test both filter modes or just update all the checked in configs and hope
there aren't too many annoying breakages.
thanks!
Alyssa
…On Thu, May 11, 2017 at 1:57 PM, Matt Klein ***@***.***> wrote:
Assigned #721 <#721> to @alyssawilk
<https://github.com/alyssawilk>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ARYFvWXOMYgU0z7gD9-p8CtikWuA76Qnks5r40v4gaJpZM4M3YPS>
.
|
@alyssawilk Can we just not do retries unless the user configures the router filter as See also https://github.com/lyft/envoy-api/issues/63 for related discussion in v2 API. |
tl;dr I think I'm going to sit on this until Matt's back from vacation and
stick it on the agenda of the weekly sync
@htuch, we totally can support dual modes int he router, it just makes the
code a bit more fragile. I think the cleanest route would be if we end up
opting to not explicitly configure encode/decode and rather have the
filters do the right thing. We'll see how that discussion goes.
Unfortunately there's the other problem, which is how far we extend
rewinding state for replayability.
Currently all the replay state is handled it the StreamDecoder bits of the
router, so before any headers/body/trailers are handed off to the
ConnectionManagerImpl to start processing filters. We can't (currently)
do that for gRPC as for gRPC we can't determine if we're retrying until
trailers are processed at which point we've headers and body data on.
My original plan was to extend the router Filter to be an EncoderFilter,
pause header processing and trigger buffering the body until it has
inspected trailers. The problem is if the router does that as a filter,
this implies the headers and body have already been passed to the
ConnectionManagerImpl, and any prior filters, which are not guaranteed to
handle retries gracefully. The options I can think of is we make sure any
prior filters (and the ConnectionManagerImpl) are willing to replay, we
only support gRPC retries best-effort (e.g. if the router is the first
filter), or we instead handle this earlier in the StreamDecoder path.
The latter seems easiest but strikes me as an architectural hack because
then we're secretly treating the router's StreamDecoder functions as Filter
functions (stop processing and buffer). There existing retry code is
best-effort but only supporting features based on document-only
requirements of filter ordering seems like asking for trouble. And making
everything retry-aware is a pretty large architectural change. Hopefully
there's some fourth option which hasn't occurred to me due to my newness to
the code base, and given there's no hurry on this feature I'm inclined to
get Matt's take once he's back.
…On Thu, May 25, 2017 at 11:45 PM, htuch ***@***.***> wrote:
@alyssawilk <https://github.com/alyssawilk> Can we just not do retries
unless the user configures the router filter as both? I think it's fair
to expect a config change to opt into new behavior. I agree we don't want
to break existing users by requiring both for existing behavior, this
would be a breaking change according to https://github.com/lyft/envoy/
blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this
requirement as well.
See also lyft/envoy-api#63 <https://github.com/lyft/envoy-api/issues/63>
for related discussion in v2 API.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvRYpwEaVpX-Jk5TSmE-MWjmNx5lhks5r9krvgaJpZM4M3YPS>
.
|
An alternative would be to punt to grpc itself. If the goal is to allow
operators to inject retry behavior where there was none previously then it
might be reasonable to inject a signal in the response trailers passed back
to the grpc client. We'd need to standardize how that works of course but
it could be useful behavior outside the context of envoy
-louis (from phone)
On May 26, 2017 8:57 AM, "alyssawilk" <notifications@github.com> wrote:
tl;dr I think I'm going to sit on this until Matt's back from vacation and
stick it on the agenda of the weekly sync
@htuch, we totally can support dual modes int he router, it just makes the
code a bit more fragile. I think the cleanest route would be if we end up
opting to not explicitly configure encode/decode and rather have the
filters do the right thing. We'll see how that discussion goes.
Unfortunately there's the other problem, which is how far we extend
rewinding state for replayability.
Currently all the replay state is handled it the StreamDecoder bits of the
router, so before any headers/body/trailers are handed off to the
ConnectionManagerImpl to start processing filters. We can't (currently)
do that for gRPC as for gRPC we can't determine if we're retrying until
trailers are processed at which point we've headers and body data on.
My original plan was to extend the router Filter to be an EncoderFilter,
pause header processing and trigger buffering the body until it has
inspected trailers. The problem is if the router does that as a filter,
this implies the headers and body have already been passed to the
ConnectionManagerImpl, and any prior filters, which are not guaranteed to
handle retries gracefully. The options I can think of is we make sure any
prior filters (and the ConnectionManagerImpl) are willing to replay, we
only support gRPC retries best-effort (e.g. if the router is the first
filter), or we instead handle this earlier in the StreamDecoder path.
The latter seems easiest but strikes me as an architectural hack because
then we're secretly treating the router's StreamDecoder functions as Filter
functions (stop processing and buffer). There existing retry code is
best-effort but only supporting features based on document-only
requirements of filter ordering seems like asking for trouble. And making
everything retry-aware is a pretty large architectural change. Hopefully
there's some fourth option which hasn't occurred to me due to my newness to
the code base, and given there's no hurry on this feature I'm inclined to
get Matt's take once he's back.
On Thu, May 25, 2017 at 11:45 PM, htuch ***@***.***> wrote:
@alyssawilk <https://github.com/alyssawilk> Can we just not do retries
unless the user configures the router filter as both? I think it's fair
to expect a config change to opt into new behavior. I agree we don't want
to break existing users by requiring both for existing behavior, this
would be a breaking change according to https://github.com/lyft/envoy/
blob/master/CONTRIBUTING.md#breaking-change-policy. We can document this
requirement as well.
See also lyft/envoy-api#63 <https://github.com/lyft/envoy-api/issues/63>
for related discussion in v2 API.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvRYpwEaVpX-Jk5TSmE-
MWjmNx5lhks5r9krvgaJpZM4M3YPS>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIoKPEtXuGfFx25dYlrsaJmW6EvDBfzUks5r9ug7gaJpZM4M3YPS>
.
|
I suspect one can implement retries at the gRPC client fairly easily but by punting to the client you end up adding latency of the (usually longer haul) proxy-to-client RTT. It would be nice to avoid if there's a clean way to refactor. |
@alyssawilk I was envisioning a much simpler v1 implementation of this. Basically, we only handle gRPC errors where the gRPC error is part of a header only response (immediate failure). Essentially, don't handle the trailers case at all. I think this will cover a large portion of the cases that people actually want to retry on and be very simple to implement (essentially identical to the HTTP variants, just looking at different headers). Then very little needs to change. I agree that to do full retry semantics, we would need to buffer, and handle response trailers, etc., but IMO we can skip that for now, and if a client really wants that I think the client can do it. (As long as we very clearly document what is implemented) Thoughts? |
Hm, that would certainly be a lot easier, but my intuition is that most
gRPC errors (especially time outs etc.) would be in trailers rather than
headers. Maybe @ctiller can comment on if a header-only solution would be
a useful step?
…On Mon, Jun 5, 2017 at 9:41 PM, Matt Klein ***@***.***> wrote:
@alyssawilk <https://github.com/alyssawilk> I was envisioning a much
simpler v1 implementation of this. Basically, we only handle gRPC errors
where the gRPC error is part of a header only response (immediate failure).
Essentially, don't handle the trailers case at all. I think this will cover
a large portion of the cases that people actually want to retry on and be
very simple to implement (essentially identical to the HTTP variants, just
looking at different headers). Then very little needs to change.
I agree that to do full retry semantics, we would need to buffer, and
handle response trailers, etc., but IMO we can skip that for now, and if a
client really wants that I think the client can do it. (As long as we very
clearly document what is implemented) Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvfhcUFbga9LfEE-NCbrQ4SZlR9GQks5sBK5HgaJpZM4M3YPS>
.
|
In Lyft's case, I think almost all of them would be header only responses FWIW (even timeout would be header only if app does not respond to unary request during deadline). cc @twoism and @rodaine for further comment for Lyft. |
Ah, if it'll be useful to you folks I'll go ahead and do it, especially as
it'd be a nice clean first step if we opt to do trailers later down the
road.
…On Thu, Jun 8, 2017 at 12:09 PM, Matt Klein ***@***.***> wrote:
Hm, that would certainly be a lot easier, but my intuition is that most
gRPC errors (especially time outs etc.) would be in trailers rather than
headers
In Lyft's case, I think almost all of them would be header only responses
FWIW (even timeout would be header only if app does not respond to unary
request during deadline). cc @twoism <https://github.com/twoism> and
@rodaine <https://github.com/rodaine> for further comment for Lyft.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#721 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvXxFJCsaoNzUEfXV7_mec7hMXO2Vks5sCBy3gaJpZM4M3YPS>
.
|
Sorry for not chiming in this sooner -- someone tried to tag me above with the wrong username, so I only saw this on Thursday. I spent some time on Thursday and Friday talking with other folks on gRPC team about this, so that I could give you a clear picture of what we're planning to do for our retry implementation. To avoid confusion, I'm going to use the terms defined in the gRPC wire protocol spec. It's also worth understanding that our API exposes Response-Headers as initial metadata and Trailers as trailing metadata, and the difference between them is semantically significant to applications. Our retry design states that retries are only possible until the client receives Response-Headers from the server. This is because the client will return the initial metadata to the application as soon as it receives Response-Headers, and once we have given that metadata to the application, we cannot retry the RPC, because we might get different metadata when we retry. As a result, any RPC in which the server sends Response-Headers separately from Trailers is not retryable; only RPCs in which the server sends a Trailers-Only response are retryable. That having been said, we are somewhat unsatisfied with this approach, since it basically eliminates the ability to send Response-Headers on a failure and still have the RPC be retryable -- users can send Response-Headers or have their RPCs be retryable, but not both. We would ideally like to avoid forcing applications to make this choice. So we are considering supporting a response in which the server sends both Response-Headers and Trailers but includes some special state in Response-Headers that tells the client that the request has failed and it should wait for the Trailers before returning any data to the application. We are currently gathering information about real-world use-cases before we decide whether or not to implement this, but even if we don't do it now, we could still decide to do it later. In terms of current status, it's worth noting that gRPC's C-core implementation currently never sends out Trailers-Only responses. I have a PR open to fix that (grpc/grpc#10906), but it exposed a bug: Currently, when receiving a Trailers-Only response, we are returning the resulting metadata as initial metadata instead of trailing metadata. I will need to fix that before merging the aforementioned PR. Taking a step back, one question I have here is whether it's necessary for Envoy to have its own retry implementation or whether it's possible for it to just use the one that we're adding to gRPC. I can imagine that there may be reasons to more closely integrate this into Envoy, and if so I'm not opposed to that. But I just want to make sure that we're not needlessly reinventing a wheel here. Anyway, I hope the above info answers your questions. If I've missed something, or if you have any other questions or concerns, please don't hesitate to ask. |
@markdroth the Go impl does send trailers-only right? Because I've definitely seen that here at Lyft.
I'm happy to be swayed that we shouldn't do it, but I think there are 2 compelling reasons to add this support directly to envoy even for the trailers-only case.
So IMO even supporting trailers-only in Envoy as @alyssawilk PR does is pretty compelling. Let me know what you think about this. |
P.S., this is pretty cool, and we could easily implement this also in envoy if needed. |
Yes, the gRPC Go implementation does send Trailers-Only, as long as the application has not explicitly chosen to send initial metadata. I believe that the Java implementation does the same thing.
Out of curiosity, what do you use instead? Did you write your own python gRPC client?
Does Envoy use the gRPC C-core implementation for its outgoing client connections? If so, it should be possible to configure retries in Envoy differently from retries in the original client. In fact, I would expect this to be a common use case for any L7 proxy, especially in the edge case. Speaking of which, another thing to keep in mind here is that the gRPC retry mechanism is going to be configured via the gRPC service config. In a case where a service is getting some traffic from local clients (which use the service config) and other traffic from external clients via Envoy as an edge proxy, it may be convenient to have a single mechanism to configure how many retries clients should attempt, rather than having to do it in two places. (On the other hand, there may also be reasons why the service owner might want to configure them differently. Although they could do that even if you were using the gRPC retry mechanism simply by having two different server names pointing to the same set of backends, each with their own service config.) Anyway, I'm sure there are a lot of arguments in either direction, and I'm not necessarily trying to push you to use the gRPC retry mechanism. I just wanted to make sure that you had considered that as an alternative, and it sounds like you have.
That sounds reasonable to me, at least for now. It could always be changed if/when we implement a way to return initial metadata in a retryable way.
If we decide to go forward with this, we'll publish a gRFC for it. I'll try to remember to loop you folks in if/when that happens. |
We use the Envoy HTTP/1.1 bridge filter along with machine generated clients. https://lyft.github.io/envoy/docs/configuration/http_filters/grpc_http1_bridge_filter.html. This will be replaced probably with grpc-web that @fengli79 is working on, which I think still has the same general retry problem.
No (unfortunately). This is a very long conversation that I'm happy to have out of band, but we have gone back and forth on this and have gotten nowhere in figuring out how to integrate gRPC core into envoy. At this point a bunch of gRPC is being reimplemented inside envoy (by Google). |
Ah, okay, I didn't realize that you weren't actually using our implementation. That makes the decision about whether or not to use our retry code pretty clear. :) |
What is the current state of this? We (Datadog) are currently looking into putting Envoy on our edge and would really like to do gRPC retries inside our edge with Envoy but only for certain gRPC status codes. |
This was fixed a while back, for the case where the status code is immediately returned in headers. Hopefully the docs checked in with the associated PR will be helpful in setting it up: |
Thanks for the quick response! Yep, that's exactly what I was looking for. |
From the documentation on retries, the
x-envoy-retry-on
header can be configured for handlingHTTP
status codes like5XX
and4XX
. This works great forHTTP
services. However, withgRPC
allHTTP
status codes (if the server is running properly) will return a200 OK
. The actual error code is found within thegRPC
response.Would it be feasible to create a retry policy for a
x-envoy-retry-grpc-on
header that respects a list ofgRPC
error codes? There are a few codes that could be deemed as retriableCANCELLED
,DEADLINE_EXCEEDED
,RESOURCE_EXHAUSTED
but I am hesitant to make assumptions about implementation details within a service by grouping them together. Which is why I think a list may work best. I'm open to ideas here.Sample Header
The existing retry header would then be configured as a fallback for when a service is unreachable and the
HTTP
status codes have more meaning.The text was updated successfully, but these errors were encountered: