diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 676a450be..21e5e4f7e 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -1,6 +1,7 @@ # OpenTelemetry Metrics -* Author: Yash Tibrewal (@yashykt), Zach Reyes (@zasweq), Vindhya Ningegowda (@DNVindhya), Xuan Wang (@XuanWang-Amos) +* Author: Yash Tibrewal (@yashykt), Zach Reyes (@zasweq), Vindhya Ningegowda + (@DNVindhya), Xuan Wang (@XuanWang-Amos) * Approver: Mark Roth (@markdroth) * Status: Final * Implemented in: @@ -406,22 +407,22 @@ the data. * **grpc.client.attempt.started**
The total number of RPC attempts started, including those that have not completed.
*Attributes*: grpc.method, grpc.target
- *Type*: Counter
+ *Type*: Counter (integer)
*Unit*: `{attempt}`
* **grpc.client.attempt.duration**
End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
*Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Latency Buckets)
+ *Type*: Histogram (floating-point)(Latency Buckets)
*Unit*: `s`
* **grpc.client.attempt.sent_total_compressed_message_size**
Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
Attributes: grpc.method, grpc.target, grpc.status
- Type: Histogram (Size Buckets)
+ Type: Histogram (integer)(Size Buckets)
*Unit*: `By`
* **grpc.client.attempt.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Size Buckets)
+ *Type*: Histogram (integer)(Size Buckets)
*Unit*: `By`
#### Client Per-Call Instruments @@ -432,7 +433,7 @@ the data. End timestamp - Before the status of the RPC is delivered to the application.
If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization should be included in the measurement.
*Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Latency Buckets)
+ *Type*: Histogram (floating-point)(Latency Buckets)
*Unit*: `s`
#### Server Instruments @@ -440,24 +441,24 @@ the data. * **grpc.server.call.started**
The total number of RPCs started, including those that have not completed.
*Attributes*: grpc.method
- *Type*: counter
+ *Type*: counter (integer)
*Unit*: {call}
* **grpc.server.call.sent_total_compressed_message_size**
Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Size Buckets)
+ *Type*: Histogram (integer)(Size Buckets)
*Unit*: `By`
* **grpc.server.call.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Size Buckets)
+ *Type*: Histogram (integer)(Size Buckets)
*Unit*: `By`
* **grpc.server.call.duration**
This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc) perspective.
Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
*Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Latency Buckets)
+ *Type*: Histogram (floating-point)(Latency Buckets)
*Unit*: `s`
### Migration from OpenCensus diff --git a/A78-grpc-metrics-wrr-pf-xds.md b/A78-grpc-metrics-wrr-pf-xds.md index 56a92486b..ff85ab37e 100644 --- a/A78-grpc-metrics-wrr-pf-xds.md +++ b/A78-grpc-metrics-wrr-pf-xds.md @@ -1,34 +1,35 @@ -A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient ----- -* Author(s): @markdroth -* Approver: @ejona86, @dfawley -* Status: {Draft, In Review, Ready for Implementation, Implemented} -* Implemented in: -* Last updated: 2024-04-08 -* Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys +## A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient + +* Author(s): @markdroth +* Approver: @ejona86, @dfawley +* Status: {Draft, In Review, Ready for Implementation, Implemented} +* Implemented in: +* Last updated: 2024-04-08 +* Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys ## Abstract This document proposes some new metrics that will be added in gRPC for the -Weighted Round Robin (WRR) and Pick First LB policies and for the XdsClient. -It also adds a new optional label for the existing per-call metrics. +Weighted Round Robin (WRR) and Pick First LB policies and for the XdsClient. It +also adds a new optional label for the existing per-call metrics. ## Background -gRPC recently added a set of basic per-call metrics, defined in [A66]. -[A79] is building upon that by providing a framework for non-per-call -metrics. The metrics described in this document will be the first -metrics added using that new non-per-call metric framework. - -### Related Proposals: -* [A66]: OpenTelemetry Metrics -* [A79]: gRPC Non-Per-Call Metrics Framework (pending) -* [A58]: Weighted Round Robin LB Policy -* [A62]: Pick First: Sticky TRANSIENT_FAILURE and Address Order Randomization -* [A27]: xDS-Based Global Load Balancing -* [A28]: xDS Traffic Splitting and Routing -* [A71]: xDS Fallback -* [A57]: XdsClient Failure Mode Behavior +gRPC recently added a set of basic per-call metrics, defined in [A66]. [A79] is +building upon that by providing a framework for non-per-call metrics. The +metrics described in this document will be the first metrics added using that +new non-per-call metric framework. + +### Related Proposals: + +* [A66]: OpenTelemetry Metrics +* [A79]: gRPC Non-Per-Call Metrics Framework (pending) +* [A58]: Weighted Round Robin LB Policy +* [A62]: Pick First: Sticky TRANSIENT_FAILURE and Address Order Randomization +* [A27]: xDS-Based Global Load Balancing +* [A28]: xDS Traffic Splitting and Routing +* [A71]: xDS Fallback +* [A57]: XdsClient Failure Mode Behavior [A66]: A66-otel-stats.md [A79]: https://github.com/grpc/proposal/pull/421 @@ -45,129 +46,162 @@ This document proposes changes to the following gRPC components. ### Optional xDS Locality Label -When xDS is used, it is desirable for some metrics to include an optional -label indicating which xDS locality the metrics are associated with. -We want to provide this optional label for the metrics in both the -existing per-call metrics defined in [A66] and in the new metrics for -the WRR LB policy, described below. +When xDS is used, it is desirable for some metrics to include an optional label +indicating which xDS locality the metrics are associated with. We want to +provide this optional label for the metrics in both the existing per-call +metrics defined in [A66] and in the new metrics for the WRR LB policy, described +below. -If locality information is available, the value of this label will be of -the form `{region="${REGION}", zone="${ZONE}", sub_zone="${SUB_ZONE}"}`, -where `${REGION}`, `${ZONE}`, and `${SUB_ZONE}` are replaced with the -actual values. If no locality information is available, the label will -be set to the empty string. +If locality information is available, the value of this label will be of the +form `{region="${REGION}", zone="${ZONE}", sub_zone="${SUB_ZONE}"}`, where +`${REGION}`, `${ZONE}`, and `${SUB_ZONE}` are replaced with the actual values. +If no locality information is available, the label will be set to the empty +string. #### Per-Call Metrics -To support the locality label in the per-call metrics, we will provide -a mechanism for LB picker to add optional labels to the call attempt -tracer. We will then use this mechanism in the `xds_cluster_impl` -policy's picker to set the locality label. It will get the locality -label from the wrapped subchannel that it is already creating for load -reporting purposes, when that subchannel is returned by the child picker. +To support the locality label in the per-call metrics, we will provide a +mechanism for LB picker to add optional labels to the call attempt tracer. We +will then use this mechanism in the `xds_cluster_impl` policy's picker to set +the locality label. It will get the locality label from the wrapped subchannel +that it is already creating for load reporting purposes, when that subchannel is +returned by the child picker. -This label will be available on the following per-call metrics: -- `grpc.client.attempt.duration` -- `grpc.client.attempt.sent_total_compressed_message_size` -- `grpc.client.attempt.rcvd_total_compressed_message_size` +This label will be available on the following per-call metrics: - +`grpc.client.attempt.duration` - +`grpc.client.attempt.sent_total_compressed_message_size` - +`grpc.client.attempt.rcvd_total_compressed_message_size` #### Weighted Target LB Policy To support the locality label in the WRR metrics, we will extend the -`weighted_target` LB policy (see [A28]) to define a resolver attribute -that indicates the name of its child. This attribute will be passed down -to each of its children with the appropriate value, so that any LB policy -that sits underneath the `weighted_target` policy will be able to use it. +`weighted_target` LB policy (see [A28]) to define a resolver attribute that +indicates the name of its child. This attribute will be passed down to each of +its children with the appropriate value, so that any LB policy that sits +underneath the `weighted_target` policy will be able to use it. ### Weighted Round Robin LB Policy -The `weighted_round_robin` LB policy is described in [A58]. We propose to -add the following metrics to it. +The `weighted_round_robin` LB policy is described in [A58]. We propose to add +the following metrics to it. WRR metrics will have the following labels: -| Name | Disposition | Description | -| ----------- | ----------- | ----------- | -| grpc.target | required | Indicates the target of the gRPC channel in which WRR is used. (Same as the attribute defined in [A66].) | -| grpc.lb.locality | optional | The locality to which the traffic is being sent. This will be set to the resolver attribute passed down from the `weighted_target` policy, or the empty string if the resolver attribute is unset. | +| Name | Disposition | Description | +| ---------------- | ----------- | ------------------------------------------- | +| grpc.target | required | Indicates the target of the gRPC channel in | +: : : which WRR is used. (Same as the attribute : +: : : defined in [A66].) : +| grpc.lb.locality | optional | The locality to which the traffic is being | +: : : sent. This will be set to the resolver : +: : : attribute passed down from the : +: : : `weighted_target` policy, or the empty : +: : : string if the resolver attribute is unset. : The following metrics will be exported: -| Name | Type | Unit | Labels | Description | -| ------------- | ----- | ----- | ------- | ----------- | -| grpc.lb.wrr.rr_fallback | Counter | {update} | grpc.target, grpc.lb.locality | Number of scheduler updates in which there were not enough endpoints with valid weight, which caused the WRR policy to fall back to RR behavior. | -| grpc.lb.wrr.endpoint_weight_not_yet_usable | Counter | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update that don't yet have usable weight information (i.e., either the load report has not yet been received, or it is within the blackout period). | -| grpc.lb.wrr.endpoint_weight_stale | Counter | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update whose latest weight is older than the expiration period. | -| grpc.lb.wrr.endpoint_weights | Histogram | {weight} | grpc.target, grpc.lb.locality | The histogram buckets will be endpoint weight ranges. Each bucket will be a counter that is incremented once for every endpoint whose weight is within that range. Note that endpoints without usable weights will have weight 0. | +Name | Type | Unit | Labels | Description +------------------------------------------ | ------------------------- | ---------- | ----------------------------- | ----------- +grpc.lb.wrr.rr_fallback | Counter(integer) | {update} | grpc.target, grpc.lb.locality | Number of scheduler updates in which there were not enough endpoints with valid weight, which caused the WRR policy to fall back to RR behavior. +grpc.lb.wrr.endpoint_weight_not_yet_usable | Counter(integer) | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update that don't yet have usable weight information (i.e., either the load report has not yet been received, or it is within the blackout period). +grpc.lb.wrr.endpoint_weight_stale | Counter(integer) | {endpoint} | grpc.target, grpc.lb.locality | Number of endpoints from each scheduler update whose latest weight is older than the expiration period. +grpc.lb.wrr.endpoint_weights | Histogram(floating-point) | {weight} | grpc.target, grpc.lb.locality | The histogram buckets will be endpoint weight ranges. Each bucket will be a counter that is incremented once for every endpoint whose weight is within that range. Note that endpoints without usable weights will have weight 0. ### Pick First LB Policy -The Pick First LB policy predates the gRFC process but was updated in -[A62]. We propose to add the following metrics to it. +The Pick First LB policy predates the gRFC process but was updated in [A62]. We +propose to add the following metrics to it. Pick First metrics will have the following labels: -| Name | Disposition | Description | -| ----------- | ----------- | ----------- | -| grpc.target | required | Indicates the target of the gRPC channel in which PF is used. (Same as the attribute defined in [A66].) | +| Name | Disposition | Description | +| ----------- | ----------- | ------------------------------------------------ | +| grpc.target | required | Indicates the target of the gRPC channel in | +: : : which PF is used. (Same as the attribute defined : +: : : in [A66].) : The following metrics will be exported: -| Name | Type | Unit | Labels | Description | -| ------------- | ----- | ----- | ------- | ----------- | -| grpc.lb.pick_first.disconnections | Counter | {disconnection} | grpc.target | Number of times the selected subchannel becomes disconnected. | -| grpc.lb.pick_first.connection_attempts_succeeded | Counter | {attempt} | grpc.target | Number of successful connection attempts. | -| grpc.lb.pick_first.connection_attempts_failed | Counter | {attempt} | grpc.target | Number of failed connection attempts. | +Name | Type | Unit | Labels | Description +------------------------------------------------ | ---------------- | --------------- | ----------- | ----------- +grpc.lb.pick_first.disconnections | Counter(integer) | {disconnection} | grpc.target | Number of times the selected subchannel becomes disconnected. +grpc.lb.pick_first.connection_attempts_succeeded | Counter(integer) | {attempt} | grpc.target | Number of successful connection attempts. +grpc.lb.pick_first.connection_attempts_failed | Counter(integer) | {attempt} | grpc.target | Number of failed connection attempts. ### XdsClient -The XdsClient component was originally described in [A27]. Note that in -[A71], we are moving from a single global XdsClient instance to a -separate global XdsClient instance for each channel target. The -proposed metric schema here reflects that change. +The XdsClient component was originally described in [A27]. Note that in [A71], +we are moving from a single global XdsClient instance to a separate global +XdsClient instance for each channel target. The proposed metric schema here +reflects that change. XdsClient metrics will have the following labels: -| Name | Disposition | Description | -| ----------- | ----------- | ----------- | -| grpc.target | required | For clients, indicates the target of the gRPC channel in which the XdsClient is used (i.e., the same as the attribute defined in [A66]). For servers, will be the string "#server". | -| grpc.xds.server | required | The target URI of the xDS server with which the XdsClient is communicating. | -| grpc.xds.authority | required | The xDS authority. The value will be "#old" for old-style non-xdstp resource names. | -| grpc.xds.cache_state | required | Indicates the cache state of an xDS resource. The value will be one of:
  • "requested": The resource has been requested from the xDS server but has not yet been received.
  • "does_not_exist": The server has indicated that the resource does not exist.
  • "acked": The resource has been received and is valid.
  • "nacked": The resource was received but was not valid.
  • "nacked_but_cached": There is a version of the resource cached, but the most recent update of the resource was invalid.
| -| grpc.xds.resource_type | required | Indicates an xDS resource type, such as "envoy.config.listener.v3.Listener". | +| Name | Disposition | Description | +| ---------------------- | ----------- | ------------------------------------ | +| grpc.target | required | For clients, indicates the target of | +: : : the gRPC channel in which the : +: : : XdsClient is used (i.e., the same as : +: : : the attribute defined in [A66]). For : +: : : servers, will be the string : +: : : "#server". : +| grpc.xds.server | required | The target URI of the xDS server | +: : : with which the XdsClient is : +: : : communicating. : +| grpc.xds.authority | required | The xDS authority. The value will be | +: : : "#old" for old-style non-xdstp : +: : : resource names. : +| grpc.xds.cache_state | required | Indicates the cache state of an xDS | +: : : resource. The value will be one of\: : +: : :
  • "requested"\: The resource : +: : : has been requested from the xDS : +: : : server but has not yet been : +: : : received.
  • "does_not_exist"\: The : +: : : server has indicated that the : +: : : resource does not : +: : : exist.
  • "acked"\: The resource has : +: : : been received and is : +: : : valid.
  • "nacked"\: The resource : +: : : was received but was not : +: : : valid.
  • "nacked_but_cached"\: : +: : : There is a version of the resource : +: : : cached, but the most recent update : +: : : of the resource was invalid.
: +| grpc.xds.resource_type | required | Indicates an xDS resource type, such | +: : : as : +: : : "envoy.config.listener.v3.Listener". : The following metrics will be exported: -| Name | Type | Unit | Labels | Description | -| ------------- | ----- | ----- | ------- | ----------- | -| grpc.xds_client.connected | Gauge | {bool} | grpc.target, grpc.xds.server | Whether or not the xDS client currently has a working ADS stream to the xDS server. For a given server, this will be set to 1 when the stream is initially created. It will be set to 0 when we have a connectivity failure or when the ADS stream fails without seeing a response message, as per [A57]. Once set to 0, it will be reset to 1 when we receive the first response on an ADS stream. | -| grpc.xds_client.server_failure | Counter | {failure} | grpc.target, grpc.xds.server | A counter of xDS servers going from healthy to unhealthy. A server goes unhealthy when we have a connectivity failure or when the ADS stream fails without seeing a response message, as per gRFC A57. | -| grpc.xds_client.resource_updates_valid | Counter | {resource} | grpc.target, grpc.xds.server, grpc.xds.resource_type | A counter of resources received that were considered valid. The counter will be incremented even for resources that have not changed. | -| grpc.xds_client.resource_updates_invalid | Counter | {resource} | grpc.target, grpc.xds.server, grpc.xds.resource_type | A counter of resources received that were considered invalid. | -| grpc.xds_client.resources | Gauge | {resource} | grpc.target, grpc.xds.authority, grpc.xds.cache_state, grpc.xds.resource_type | Number of xDS resources. | +Name | Type | Unit | Labels | Description +---------------------------------------- | ---------------- | ---------- | ----------------------------------------------------------------------------- | ----------- +grpc.xds_client.connected | Gauge(integer) | {bool} | grpc.target, grpc.xds.server | Whether or not the xDS client currently has a working ADS stream to the xDS server. For a given server, this will be set to 1 when the stream is initially created. It will be set to 0 when we have a connectivity failure or when the ADS stream fails without seeing a response message, as per [A57]. Once set to 0, it will be reset to 1 when we receive the first response on an ADS stream. +grpc.xds_client.server_failure | Counter(integer) | {failure} | grpc.target, grpc.xds.server | A counter of xDS servers going from healthy to unhealthy. A server goes unhealthy when we have a connectivity failure or when the ADS stream fails without seeing a response message, as per gRFC A57. +grpc.xds_client.resource_updates_valid | Counter(integer) | {resource} | grpc.target, grpc.xds.server, grpc.xds.resource_type | A counter of resources received that were considered valid. The counter will be incremented even for resources that have not changed. +grpc.xds_client.resource_updates_invalid | Counter(integer) | {resource} | grpc.target, grpc.xds.server, grpc.xds.resource_type | A counter of resources received that were considered invalid. +grpc.xds_client.resources | Gauge(integer) | {resource} | grpc.target, grpc.xds.authority, grpc.xds.cache_state, grpc.xds.resource_type | Number of xDS resources. ### Metric Stability -All metrics added in this proposal will start as experimental and -therefore off by default. The long term goal will be to -de-experimentalize them and have them be on by default, but the exact -criteria for that change are TBD. +All metrics added in this proposal will start as experimental and therefore off +by default. The long term goal will be to de-experimentalize them and have them +be on by default, but the exact criteria for that change are TBD. ### Temporary environment variable protection -This proposal does not include any features enabled via external I/O, so -it does not need environment variable protection. +This proposal does not include any features enabled via external I/O, so it does +not need environment variable protection. ## Rationale -The metrics defined here are generally a trade-off between the usefulness -of the metric and the cost of reporting it. As an example, for the -WRR metrics, we considered exporting a histogram of how old the last -backend load report was for each RPC sent to a particular backend, but -that would have been extremely expensive. So instead, we are reporting -the number of stale weights on each scheduler update. +The metrics defined here are generally a trade-off between the usefulness of the +metric and the cost of reporting it. As an example, for the WRR metrics, we +considered exporting a histogram of how old the last backend load report was for +each RPC sent to a particular backend, but that would have been extremely +expensive. So instead, we are reporting the number of stale weights on each +scheduler update. ## Implementation -Will be implemented in C-core by @markdroth, in Java by @dnvindhya, and -in Go by @zasweq. +Will be implemented in C-core by @markdroth, in Java by @dnvindhya, and in Go by +@zasweq.