Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier Detection: use gRPC status code for detecting failures #7942

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/root/intro/arch_overview/upstream/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,13 @@ Most configuration items, namely
types of errors, but :ref:`outlier_detection.enforcing_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_success_rate>` applies
to externally originated errors only and :ref:`outlier_detection.enforcing_local_origin_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>` applies to locally originated errors only.

.. _arch_overview_outlier_detection_grpc:

gRPC
----------------------

For gRPC requests, the outlier detection will use the HTTP status mapped from the `grpc-status <https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md#responses>`_ response header. This behavior is guarded by the runtime feature `envoy.reloadable_features.outlier_detection_support_for_grpc_status` which defaults to true.


.. _arch_overview_outlier_detection_logging:

Expand Down
1 change: 1 addition & 0 deletions docs/root/intro/version_history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Version history
* listeners: added :ref:`continue_on_listener_filters_timeout <envoy_api_field_Listener.continue_on_listener_filters_timeout>` to configure whether a listener will still create a connection when listener filters time out.
* listeners: added :ref:`HTTP inspector listener filter <config_listener_filters_http_inspector>`.
* lua: extended `httpCall()` and `respond()` APIs to accept headers with entry values that can be a string or table of strings.
* outlier_detector: added :ref:`support for the grpc-status response header <arch_overview_outlier_detection_grpc>` by mapping it to HTTP status. Guarded by envoy.reloadable_features.outlier_detection_support_for_grpc_status which defaults to true.
* performance: new buffer implementation enabled by default (to disable add "--use-libevent-buffers 1" to the command-line arguments when starting Envoy).
* performance: stats symbol table implementation (disabled by default; to test it, add "--use-fake-symbol-table 0" to the command-line arguments when starting Envoy).
* rbac: added support for DNS SAN as :ref:`principal_name <envoy_api_field_config.rbac.v2.Principal.Authenticated.principal_name>`.
Expand Down
31 changes: 24 additions & 7 deletions source/common/router/router.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include "common/router/config_impl.h"
#include "common/router/debug_config.h"
#include "common/router/retry_state_impl.h"
#include "common/runtime/runtime_impl.h"
#include "common/tracing/http_tracer_impl.h"

#include "extensions/filters/http/well_known_names.h"
Expand Down Expand Up @@ -931,15 +932,14 @@ Filter::streamResetReasonToResponseFlag(Http::StreamResetReason reset_reason) {
NOT_REACHED_GCOVR_EXCL_LINE;
}

void Filter::handleNon5xxResponseHeaders(const Http::HeaderMap& headers,
UpstreamRequest& upstream_request, bool end_stream) {
void Filter::handleNon5xxResponseHeaders(absl::optional<Grpc::Status::GrpcStatus> grpc_status,
UpstreamRequest& upstream_request, bool end_stream,
uint64_t grpc_to_http_status) {
// We need to defer gRPC success until after we have processed grpc-status in
// the trailers.
if (grpc_request_) {
if (end_stream) {
absl::optional<Grpc::Status::GrpcStatus> grpc_status = Grpc::Common::getGrpcStatus(headers);
if (grpc_status &&
!Http::CodeUtility::is5xx(Grpc::Utility::grpcToHttpStatus(grpc_status.value()))) {
if (grpc_status && !Http::CodeUtility::is5xx(grpc_to_http_status)) {
upstream_request.upstream_host_->stats().rq_success_.inc();
} else {
upstream_request.upstream_host_->stats().rq_error_.inc();
Expand Down Expand Up @@ -1002,8 +1002,25 @@ void Filter::onUpstreamHeaders(uint64_t response_code, Http::HeaderMapPtr&& head
ENVOY_STREAM_LOG(debug, "upstream headers complete: end_stream={}", *callbacks_, end_stream);

modify_headers_(*headers);
// When grpc-status appears in response headers, convert grpc-status to HTTP status code
// for outlier detection. This does not currently change any stats or logging and does not
// handle the case when an error grpc-status is sent as a trailer.
absl::optional<Grpc::Status::GrpcStatus> grpc_status;
htuch marked this conversation as resolved.
Show resolved Hide resolved
uint64_t grpc_to_http_status = 0;
if (grpc_request_) {
grpc_status = Grpc::Common::getGrpcStatus(*headers);
if (grpc_status.has_value()) {
grpc_to_http_status = Grpc::Utility::grpcToHttpStatus(grpc_status.value());
}
}

upstream_request.upstream_host_->outlierDetector().putHttpResponseCode(response_code);
if (grpc_status.has_value() &&
Runtime::runtimeFeatureEnabled(
"envoy.reloadable_features.outlier_detection_support_for_grpc_status")) {
upstream_request.upstream_host_->outlierDetector().putHttpResponseCode(grpc_to_http_status);
} else {
upstream_request.upstream_host_->outlierDetector().putHttpResponseCode(response_code);
}

if (headers->EnvoyImmediateHealthCheckFail() != nullptr) {
upstream_request.upstream_host_->healthChecker().setUnhealthy();
Expand Down Expand Up @@ -1090,7 +1107,7 @@ void Filter::onUpstreamHeaders(uint64_t response_code, Http::HeaderMapPtr&& head
upstream_request.upstream_host_->canary();
chargeUpstreamCode(response_code, *headers, upstream_request.upstream_host_, false);
if (!Http::CodeUtility::is5xx(response_code)) {
handleNon5xxResponseHeaders(*headers, upstream_request, end_stream);
handleNon5xxResponseHeaders(grpc_status, upstream_request, end_stream, grpc_to_http_status);
}

// Append routing cookies
Expand Down
5 changes: 3 additions & 2 deletions source/common/router/router.h
Original file line number Diff line number Diff line change
Expand Up @@ -535,8 +535,9 @@ class Filter : Logger::Loggable<Logger::Id::router>,
void doRetry();
// Called immediately after a non-5xx header is received from upstream, performs stats accounting
// and handle difference between gRPC and non-gRPC requests.
void handleNon5xxResponseHeaders(const Http::HeaderMap& headers,
UpstreamRequest& upstream_request, bool end_stream);
void handleNon5xxResponseHeaders(absl::optional<Grpc::Status::GrpcStatus> grpc_status,
UpstreamRequest& upstream_request, bool end_stream,
uint64_t grpc_to_http_status);
TimeSource& timeSource() { return config_.timeSource(); }
Http::Context& httpContext() { return config_.http_context_; }

Expand Down
1 change: 1 addition & 0 deletions source/common/runtime/runtime_features.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ constexpr const char* runtime_features[] = {
"envoy.reloadable_features.test_feature_true",
"envoy.reloadable_features.buffer_filter_populate_content_length",
"envoy.reloadable_features.trusted_forwarded_proto",
"envoy.reloadable_features.outlier_detection_support_for_grpc_status",
};

// This is a list of configuration fields which are disallowed by default in Envoy
Expand Down
1 change: 1 addition & 0 deletions test/common/router/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ envoy_cc_test(
"//test/mocks/upstream:upstream_mocks",
"//test/test_common:environment_lib",
"//test/test_common:simulated_time_system_lib",
"//test/test_common:test_runtime_lib",
"//test/test_common:utility_lib",
],
)
Expand Down
179 changes: 176 additions & 3 deletions test/common/router/router_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
#include "test/test_common/environment.h"
#include "test/test_common/printers.h"
#include "test/test_common/simulated_time_system.h"
#include "test/test_common/test_runtime.h"
#include "test/test_common/utility.h"

#include "gmock/gmock.h"
Expand Down Expand Up @@ -1177,7 +1178,10 @@ TEST_F(RouterTest, GrpcOkTrailersOnly) {
}

// Validate gRPC AlreadyExists response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcAlreadyExistsTrailersOnly) {
TEST_F(RouterTest, GrpcAlreadyExistsTrailersOnlyRuntimeGuard) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "false"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
Expand All @@ -1200,8 +1204,94 @@ TEST_F(RouterTest, GrpcAlreadyExistsTrailersOnly) {
EXPECT_TRUE(verifyHostUpstreamStats(1, 0));
}

// Validate gRPC AlreadyExists response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcAlreadyExistsTrailersOnly) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "true"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder1, cm_.conn_pool_.host_);
return nullptr;
}));
expectResponseTimerCreate();

Http::TestHeaderMapImpl headers{{"content-type", "application/grpc"}, {"grpc-timeout", "20S"}};
HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

Http::HeaderMapPtr response_headers(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "6"}});
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(409));
response_decoder->decodeHeaders(std::move(response_headers), true);
EXPECT_TRUE(verifyHostUpstreamStats(1, 0));
}

// Validate gRPC Unavailable response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcOutlierDetectionUnavailableStatusCodeRuntimeGuard) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "false"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder1, cm_.conn_pool_.host_);
return nullptr;
}));
expectResponseTimerCreate();

Http::TestHeaderMapImpl headers{{"content-type", "application/grpc"}, {"grpc-timeout", "20S"}};
HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

Http::HeaderMapPtr response_headers(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "14"}});
// Outlier detector will use the gRPC response status code.
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(200));
response_decoder->decodeHeaders(std::move(response_headers), true);
EXPECT_TRUE(verifyHostUpstreamStats(0, 1));
}

// Validate gRPC Unavailable response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcOutlierDetectionUnavailableStatusCode) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "true"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder1, cm_.conn_pool_.host_);
return nullptr;
}));
expectResponseTimerCreate();

Http::TestHeaderMapImpl headers{{"content-type", "application/grpc"}, {"grpc-timeout", "20S"}};
HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

Http::HeaderMapPtr response_headers(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "14"}});
// Outlier detector will use the gRPC response status code.
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(503));
response_decoder->decodeHeaders(std::move(response_headers), true);
EXPECT_TRUE(verifyHostUpstreamStats(0, 1));
}
ZhouyihaiDing marked this conversation as resolved.
Show resolved Hide resolved

// Validate gRPC Internal response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcInternalTrailersOnly) {
TEST_F(RouterTest, GrpcInternalTrailersOnlyRuntimeGuard) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "false"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
Expand All @@ -1224,6 +1314,33 @@ TEST_F(RouterTest, GrpcInternalTrailersOnly) {
EXPECT_TRUE(verifyHostUpstreamStats(0, 1));
}

// Validate gRPC Internal response stats are sane when response is trailers only.
TEST_F(RouterTest, GrpcInternalTrailersOnly) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "true"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder1, cm_.conn_pool_.host_);
return nullptr;
}));
expectResponseTimerCreate();

Http::TestHeaderMapImpl headers{{"content-type", "application/grpc"}, {"grpc-timeout", "20S"}};
HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

Http::HeaderMapPtr response_headers(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "13"}});
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(500));
response_decoder->decodeHeaders(std::move(response_headers), true);
EXPECT_TRUE(verifyHostUpstreamStats(0, 1));
}

// Validate gRPC response stats are sane when response is ended in a DATA
// frame.
TEST_F(RouterTest, GrpcDataEndStream) {
Expand Down Expand Up @@ -2755,7 +2872,11 @@ TEST_F(RouterTest, RetryUpstream5xxNotComplete) {
.value());
}

TEST_F(RouterTest, RetryUpstreamGrpcCancelled) {
// Validate gRPC Cancelled response stats are sane when retry is taking effect.
TEST_F(RouterTest, RetryUpstreamGrpcCancelledRuntimeGuard) {
TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "false"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
Expand Down Expand Up @@ -2803,6 +2924,58 @@ TEST_F(RouterTest, RetryUpstreamGrpcCancelled) {
EXPECT_TRUE(verifyHostUpstreamStats(1, 1));
}

// Validate gRPC Cancelled response stats are sane when retry is taking effect.
TEST_F(RouterTest, RetryUpstreamGrpcCancelled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you generally add comments to all these TEST_F everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TestScopedRuntime scoped_runtime;
Runtime::LoaderSingleton::getExisting()->mergeValues(
{{"envoy.reloadable_features.outlier_detection_support_for_grpc_status", "true"}});
NiceMock<Http::MockStreamEncoder> encoder1;
Http::StreamDecoder* response_decoder = nullptr;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder1, cm_.conn_pool_.host_);
return nullptr;
}));
expectResponseTimerCreate();

Http::TestHeaderMapImpl headers{{"x-envoy-retry-grpc-on", "cancelled"},
{"x-envoy-internal", "true"},
{"content-type", "application/grpc"},
{"grpc-timeout", "20S"}};
HttpTestUtility::addDefaultHeaders(headers);
router_.decodeHeaders(headers, true);

// gRPC with status "cancelled" (1)
router_.retry_state_->expectHeadersRetry();
Http::HeaderMapPtr response_headers1(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "1"}});
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(499));
response_decoder->decodeHeaders(std::move(response_headers1), true);
EXPECT_TRUE(verifyHostUpstreamStats(0, 1));

// We expect the grpc-status to result in a retried request.
EXPECT_CALL(encoder1.stream_, resetStream(_)).Times(0);
NiceMock<Http::MockStreamEncoder> encoder2;
EXPECT_CALL(cm_.conn_pool_, newStream(_, _))
.WillOnce(Invoke([&](Http::StreamDecoder& decoder, Http::ConnectionPool::Callbacks& callbacks)
-> Http::ConnectionPool::Cancellable* {
response_decoder = &decoder;
callbacks.onPoolReady(encoder2, cm_.conn_pool_.host_);
return nullptr;
}));
router_.retry_state_->callback_();

// Normal response.
EXPECT_CALL(*router_.retry_state_, shouldRetryHeaders(_, _)).WillOnce(Return(RetryStatus::No));
Http::HeaderMapPtr response_headers(
new Http::TestHeaderMapImpl{{":status", "200"}, {"grpc-status", "0"}});
EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, putHttpResponseCode(200));
response_decoder->decodeHeaders(std::move(response_headers), true);
EXPECT_TRUE(verifyHostUpstreamStats(1, 1));
}

// Verifies that the initial host is select with max host count of one, but during retries
// RetryPolicy will be consulted.
TEST_F(RouterTest, RetryRespsectsMaxHostSelectionCount) {
Expand Down