-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the docs for all the grafana dashboards. #21795
base: main
Are you sure you want to change the base?
Changes from all commits
1f08879
0b0a4aa
40e36df
95af0be
71bbf44
3d9e426
f4ec240
1d10678
5c695c8
b3b54f9
412521d
7f9dcce
50b62f4
00bca69
e0d401c
c44a4a0
0781b55
e55756d
299a67c
0fc0cf5
f15688d
d4fa4a0
18237bf
c661a67
f870516
9417c1a
dbecc92
ff03d17
d8e2cbe
d1cd934
6974e83
eeffe3f
2284483
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
```release-note:feature | ||
docs: added the docs for the grafana dashboards | ||
``` |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,91 @@ | ||||||||||||||||||||
--- | ||||||||||||||||||||
layout: docs | ||||||||||||||||||||
page_title: Dashboard for Consul dataplane metrics | ||||||||||||||||||||
description: >- | ||||||||||||||||||||
This Grafana dashboard that provides Consul dataplane metrics on Kubernetes deployments. Learn about the Grafana queries that produce the metrics and visualizations in this dashboard. | ||||||||||||||||||||
--- | ||||||||||||||||||||
|
||||||||||||||||||||
# Consul dataplane monitoring dashboard | ||||||||||||||||||||
|
||||||||||||||||||||
This page provides reference information about the Grafana dashboard configuration included in [this GitHub repository](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh. | ||||||||||||||||||||
|
||||||||||||||||||||
You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance. | ||||||||||||||||||||
|
||||||||||||||||||||
![Preview of the Consul dataplane dashboard](../../../../public/img/grafana/consul-dataplane-dashboard.png) | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This should be an absolute path. |
||||||||||||||||||||
|
||||||||||||||||||||
## Consul dataplane metrics | ||||||||||||||||||||
|
||||||||||||||||||||
The Consul dataplane dashboard provides the following information about service mesh operations. | ||||||||||||||||||||
|
||||||||||||||||||||
### Live service count | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(envoy_server_live{app=~"$service"})` | ||||||||||||||||||||
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh. | ||||||||||||||||||||
Comment on lines
+22
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
To meet the request from @missylbytes to make the Grafana query easy to copy-and-paste, I'd suggest making these formatting changes to each of the sections:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code block should also specify the language to enable syntax highlighting.
Alternatively you can place the |
||||||||||||||||||||
|
||||||||||||||||||||
### Total request success rate | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(irate(envoy_cluster_upstream_rq_xx{...}[10m]))` | ||||||||||||||||||||
- **Description:** Tracks the percentage of successful requests across the service mesh. It excludes 4xx and 5xx response codes to focus on operational success. Use it to monitor the overall reliability of your services. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I think the mention of error response codes can be omitted because by definition a metric that tracks successful responses should only apply to successful HTTP response codes. |
||||||||||||||||||||
|
||||||||||||||||||||
### Total failed requests | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(increase(envoy_cluster_upstream_rq_xx{...}[10m]))` | ||||||||||||||||||||
- **Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to focus on problematic services. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
### Requests per second | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(rate(envoy_http_downstream_rq_total{...}[5m]))` | ||||||||||||||||||||
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
### Unhealthy clusters | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `(sum(envoy_cluster_membership_healthy{...}) - sum(envoy_cluster_membership_total{...}))` | ||||||||||||||||||||
- **Description:** This metric tracks the number of unhealthy clusters in the mesh, helping operators identify services that are experiencing issues and need attention to ensure operational health. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
### Heap size | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `SUM(envoy_server_memory_heap_size{app=~"$service"})` | ||||||||||||||||||||
- **Description:** This metric displays the total memory heap size of the Envoy proxies. Monitoring heap size is essential to detect memory issues and ensure that services are operating efficiently. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I think this last part should be omitted because the heap size is really dependent on how the proxy is configure, the number of connections it is processing, etc. Proxies that are processing high volumes of traffic may have larger heap sizes than proxies that are fronting lower traffic services. The mere presence of a large heap size alone is not indicative of a problem. Operators would need to evaluate other metrics to determine if the heap allocation is unusual given the traffic load and configuration profile of a given proxy. |
||||||||||||||||||||
|
||||||||||||||||||||
### Allocated memory | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `SUM(envoy_server_memory_allocated{app=~"$service"})` | ||||||||||||||||||||
- **Description:** This metric shows the amount of memory allocated by the Envoy proxies. It helps operators monitor the resource usage of services to prevent memory overuse and optimize performance. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
### Avg uptime per node | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `avg(envoy_server_uptime{app=~"$service"})` | ||||||||||||||||||||
- **Description:** This metric calculates the average uptime of Envoy proxies across all nodes. It helps operators monitor the stability of services and detect potential issues with service restarts or crashes. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
### Cluster state | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `(sum(envoy_cluster_membership_total{...}) - sum(envoy_cluster_membership_healthy{...})) == bool 0` | ||||||||||||||||||||
- **Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
As far as I can tel, this tracks the number of members in a given cluster. This number is going to vary per logical service based on the number of provisioned upstream instances. I don't know that it makes sense to track this other than to see how many service instances are online at a given time. |
||||||||||||||||||||
|
||||||||||||||||||||
### CPU throttled seconds by namespace | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m])` | ||||||||||||||||||||
- **Description:** This metric tracks the number of seconds during which CPU usage was throttled. Monitoring CPU throttling helps operators identify when services are exceeding their allocated CPU limits and may need optimization. | ||||||||||||||||||||
|
||||||||||||||||||||
### Memory usage by pod limits | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `100 * max(container_memory_working_set_bytes{namespace=~"$namespace"} | ||||||||||||||||||||
/ kube_pod_container_resource_limits{resource="memory"})` | ||||||||||||||||||||
- **Description:** This metric shows memory usage as a percentage of the memory limit set for each pod. It helps operators ensure that services are staying within their allocated memory limits to avoid performance degradation. | ||||||||||||||||||||
|
||||||||||||||||||||
### CPU usage by pod limits | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `100 * max(container_cpu_usage{namespace=~"$namespace"} / kube_pod_container_resource_limits{resource="cpu"})` | ||||||||||||||||||||
- **Description:** This metric displays CPU usage as a percentage of the CPU limit set for each pod. Monitoring CPU usage helps operators optimize service performance and prevent CPU exhaustion. | ||||||||||||||||||||
|
||||||||||||||||||||
### Total active upstream connections | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(envoy_cluster_upstream_cx_active{app=~"$service"})` | ||||||||||||||||||||
- **Description:** This metric tracks the total number of active upstream connections to other services in the mesh. It provides insight into service dependencies and network load. | ||||||||||||||||||||
|
||||||||||||||||||||
### Total active downstream connections | ||||||||||||||||||||
|
||||||||||||||||||||
- **Grafana query:** `sum(envoy_http_downstream_cx_active{app=~"$service"})` | ||||||||||||||||||||
- **Description:** This metric tracks the total number of active downstream connections from services to clients. It helps operators monitor service load and ensure that services are able to handle the traffic effectively. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||
|
||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,80 @@ | ||||||||||
--- | ||||||||||
layout: docs | ||||||||||
page_title: Dashboard for Consul k8s control plane metrics | ||||||||||
description: >- | ||||||||||
This documentation provides an overview of the Consul K8s Dashboard | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
--- | ||||||||||
|
||||||||||
# Consul k8s monitoring (Control Plane) dashboard | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
### Number of Consul servers | ||||||||||
|
||||||||||
- **Grafana query:** `consul_consul_server_0_consul_members_servers{pod="consul-server-0"}` | ||||||||||
- **Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment. | ||||||||||
Comment on lines
+12
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This only retrieves the metric from the pod named |
||||||||||
|
||||||||||
### Number of connected Consul dataplanes | ||||||||||
|
||||||||||
- **Grafana query:** `count(consul_dataplane_envoy_connected)` | ||||||||||
- **Description:** Tracks the number of connected Consul dataplanes. This metric helps operators understand how many Envoy sidecars are actively connected to the mesh. | ||||||||||
|
||||||||||
### CPU usage in seconds (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `rate(container_cpu_usage_seconds_total{container="consul", pod=~"consul-server-.*"}[5m])` | ||||||||||
- **Description:** This metric shows the CPU usage of the Consul servers over time, helping operators monitor resource consumption. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
### Memory usage (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `container_memory_working_set_bytes{container="consul", pod=~"consul-server-.*"}` | ||||||||||
- **Description:** Displays the memory usage of the Consul servers. This metric helps ensure that the servers have sufficient memory resources for proper operation. | ||||||||||
|
||||||||||
### Disk read/write total per 5 minutes (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `sum(rate(container_fs_writes_bytes_total{pod=~"consul-server-.*", | ||||||||||
container="consul"}[5m])) by (pod, device)` | ||||||||||
- **Grafana query:** `sum(rate(container_fs_reads_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device)` | ||||||||||
- **Description:** Monitors disk read and write operations over 5-minute intervals for Consul servers. This helps identify potential disk bottlenecks or issues. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
### Received bytes total per 5 minutes (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `sum(rate(container_network_receive_bytes_total{pod=~"consul-server-.*"}[5m])) by (pod)` | ||||||||||
- **Description:** Tracks the total network bytes received by Consul servers within a 5-minute window. This metric helps assess the network load on Consul nodes. | ||||||||||
|
||||||||||
### Memory limit (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `kube_pod_container_resource_limits{resource="memory", pod="consul-server-0"}` | ||||||||||
- **Description:** Displays the memory limit for Consul servers. This metric ensures that memory usage stays within the defined limits for each Consul server. | ||||||||||
|
||||||||||
### CPU limit in seconds (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `kube_pod_container_resource_limits{resource="cpu", pod="consul-server-0"}` | ||||||||||
- **Description:** Displays the CPU limit for Consul servers. Monitoring CPU limits helps operators ensure that the services are not constrained by resource limitations. | ||||||||||
|
||||||||||
### Disk usage (Consul servers) | ||||||||||
|
||||||||||
- **Grafana query:** `sum(container_fs_usage_bytes{}) by (pod)` | ||||||||||
- **Grafana query:** `sum(container_fs_usage_bytes{pod="consul-server-0"})` | ||||||||||
- **Description:** Shows the amount of filesystem storage used by Consul servers. This metric helps operators track disk usage and plan for capacity. | ||||||||||
|
||||||||||
### CPU usage in seconds (Connect injector) | ||||||||||
|
||||||||||
- **Grafana query:** `rate(container_cpu_usage_seconds_total{pod=~".*-connect-injector-.*", | ||||||||||
container="sidecar-injector"}[5m])` | ||||||||||
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars. Monitoring this helps ensure that Connect Injector has adequate CPU resources. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
The |
||||||||||
|
||||||||||
### CPU limit in seconds (Connect injector) | ||||||||||
|
||||||||||
- **Grafana query:** `max(kube_pod_container_resource_limits{resource="cpu", container="sidecar-injector"})` | ||||||||||
- **Description:** Displays the CPU limit for the Connect Injector. Monitoring the CPU limits ensures that Connect Injector is not constrained by resource limitations. | ||||||||||
|
||||||||||
### Memory usage (Connect injector) | ||||||||||
|
||||||||||
- **Grafana query:** `container_memory_working_set_bytes{pod=~".*-connect-injector-.*", | ||||||||||
container="sidecar-injector"}` | ||||||||||
- **Description:** Tracks the memory usage of the Connect Injector. Monitoring this helps ensure the Connect Injector has sufficient memory resources. | ||||||||||
|
||||||||||
### Memory limit (Connect injector) | ||||||||||
|
||||||||||
- **Grafana query:** `max(kube_pod_container_resource_limits{resource="memory", container="sidecar-injector"})` | ||||||||||
- **Description:** Displays the memory limit for the Connect Injector, helping to monitor if the service is nearing its resource limits. | ||||||||||
|
||||||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,96 @@ | ||||||
--- | ||||||
layout: docs | ||||||
page_title: Dashboard for Consul server metrics | ||||||
description: >- | ||||||
This documentation provides an overview of the Consul Server Dashboard | ||||||
--- | ||||||
|
||||||
# Consul server monitoring dashboard | ||||||
|
||||||
### Raft commit time | ||||||
|
||||||
- **Grafana query:** `consul_raft_commitTime` | ||||||
- **Description:** This metric measures the time it takes to commit Raft log entries. Stable values are expected for a healthy cluster. High values can indicate issues with resources such as memory, CPU, or disk space. | ||||||
|
||||||
### Raft commits per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_raft_apply[5m])` | ||||||
- **Description:** This metric tracks the rate of Raft log commits emitted by the leader, showing how quickly changes are being applied across the cluster. | ||||||
|
||||||
### Last contacted leader | ||||||
|
||||||
- **Grafana query:** `consul_raft_leader_lastContact != 0` | ||||||
- **Description:** Measures the duration since the last contact with the Raft leader. Spikes in this metric can indicate network issues or an unavailable leader, which may affect cluster stability. | ||||||
|
||||||
### Election events | ||||||
|
||||||
- **Grafana query:** `rate(consul_raft_state_candidate[1m])`, `rate(consul_raft_state_leader[1m])` | ||||||
- **Description:** Tracks Raft state transitions, indicating leadership elections. Frequent transitions might suggest cluster instability and require investigation. | ||||||
|
||||||
### Autopilot health | ||||||
|
||||||
- **Grafana query:** `consul_autopilot_healthy` | ||||||
- **Description:** A boolean metric that shows a value of 1 when Autopilot is healthy and 0 when issues are detected. Ensures that the cluster has sufficient resources and an operational leader. | ||||||
|
||||||
### DNS queries per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_dns_domain_query_count[5m])` | ||||||
- **Description:** This metric tracks the rate of DNS queries per node, bucketed into 5-minute intervals. It helps monitor the query load on Consul’s DNS service. | ||||||
|
||||||
### DNS domain query time | ||||||
|
||||||
- **Grafana query:** `consul_dns_domain_query` | ||||||
- **Description:** Measures the time spent handling DNS domain queries. Spikes in this metric may indicate high contention in the catalog or too many concurrent queries. | ||||||
|
||||||
### DNS reverse query time | ||||||
|
||||||
- **Grafana query:** `consul_dns_ptr_query` | ||||||
- **Description:** Tracks the time spent processing reverse DNS queries. Spikes in query time may indicate performance bottlenecks or increased workload. | ||||||
|
||||||
### KV applies per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_kvs_apply_count[5m])` | ||||||
- **Description:** This metric tracks the rate of Key-Value store applies over 5-minute intervals, indicating the operational load on Consul’s KV store. | ||||||
|
||||||
### KV apply time | ||||||
|
||||||
- **Grafana query:** `consul_kvs_apply` | ||||||
- **Description:** Measures the time taken to apply updates to the Key-Value store. Spikes in this metric might suggest resource contention or client overload. | ||||||
|
||||||
### Transaction apply time | ||||||
|
||||||
- **Grafana query:** `consul_txn_apply` | ||||||
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transactional workloads. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
### ACL resolves per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_acl_ResolveToken_count[5m])` | ||||||
- **Description:** This metric tracks the rate of ACL token resolutions per 5-minute intervals. It provides insights into the activity related to ACL tokens within the cluster. | ||||||
|
||||||
### ACL resolve token time | ||||||
|
||||||
- **Grafana query:** `consul_acl_ResolveToken` | ||||||
- **Description:** Measures the time taken to resolve ACL tokens into their associated policies. Spikes in this metric might indicate resource issues or configuration problems. | ||||||
|
||||||
### ACL updates per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_acl_apply_count[5m])` | ||||||
- **Description:** Tracks the rate of ACL updates per 5-minute intervals. This metric helps monitor changes in ACL configurations over time. | ||||||
|
||||||
### ACL apply time | ||||||
|
||||||
- **Grafana query:** `consul_acl_apply` | ||||||
- **Description:** Measures the time spent applying ACL changes. Spikes in apply time might suggest resource constraints or high operational load. | ||||||
|
||||||
### Catalog operations per 5 minutes | ||||||
|
||||||
- **Grafana query:** `rate(consul_catalog_register_count[5m])`, `rate(consul_catalog_deregister_count[5m])` | ||||||
- **Description:** Tracks the rate of register and deregister operations in the Consul catalog, providing insights into the churn of services within the cluster. | ||||||
|
||||||
### Catalog operation time | ||||||
|
||||||
- **Grafana query:** `consul_catalog_register`, `consul_catalog_deregister` | ||||||
- **Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric can indicate performance issues within the catalog. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Spikes in these values just mean that a large number of services were registered, or deregistered. It does not necessarily mean that there is a performance issue. |
||||||
|
||||||
|
||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,115 @@ | ||||||
--- | ||||||
layout: docs | ||||||
page_title: Service Mesh Observability - Dashboards | ||||||
description: >- | ||||||
This documentation provides an overview of several dashboards designed for monitoring and managing services within a Consul-managed Envoy service mesh. Learn how to enable access logs and configure key performance and operational metrics to ensure the reliability and performance of services in the service mesh. | ||||||
--- | ||||||
|
||||||
# Dashboards for service mesh observability | ||||||
|
||||||
This topic describes the configuration and usage of dashboards for monitoring and managing services within a Consul-managed Envoy service mesh. These dashboards provide critical insights into the health, performance, and resource utilization of services. The dashboards described here are essential tools for ensuring the stability, efficiency, and reliability of your service mesh environment. | ||||||
|
||||||
## Dashboards overview | ||||||
|
||||||
The repository includes the following dashboards: | ||||||
|
||||||
- **Consul service-to-service dashboard**: Provides a detailed view of service-to-service communications, monitoring key metrics like access logs, HTTP requests, error counts, response code distributions, and request success rates. The dashboard includes customizable filters for focusing on specific services and namespaces. | ||||||
|
||||||
- **Consul service dashboard**: Tracks key metrics for Envoy proxies at the cluster and service levels, ensuring the performance and reliability of individual services within the mesh. | ||||||
|
||||||
- **Consul dataPlane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- **Consul k8s dashboard**: Focuses on monitoring the health and resource usage of the Consul control plane within a Kubernetes environment, ensuring the stability of the control plane. | ||||||
|
||||||
- **Consul server dashboard**: Provides detailed monitoring of Consul servers, tracking key metrics like server health, CPU and memory usage, disk I/O, and network performance. This dashboard is critical for ensuring the stability and performance of Consul servers within the service mesh. | ||||||
|
||||||
## Enabling observability | ||||||
|
||||||
Add the following configurations to your Consul Helm chart to enable the observability tools in [the sample repo](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main). | ||||||
|
||||||
<CodeTabs tabs={[ "Kubernetes YAML"]}> | ||||||
|
||||||
Comment on lines
+30
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Code tabs are unnecessary since there aren't other tabs. could be used if you want to highlight specific lines in the example configuration. For the configuration - are all of these values required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So one thing about these docs, is we really can only enable prometheus in our helm chart. So to actually see the dashboards on Grafana, the user needs to deploy their own Grafana. I feel like that may be more of a tutorial thing? But we can for sure only include the values that apply to enabling prometheus. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above comment about this section: https://github.com/hashicorp/consul/pull/21795/files#r1791854125 |
||||||
```yaml | ||||||
global: | ||||||
logLevel: trace | ||||||
name: consul | ||||||
datacenter: dc1 | ||||||
tls: | ||||||
enabled: true | ||||||
enableAutoEncrypt: true | ||||||
httpsOnly: false | ||||||
acls: | ||||||
manageSystemACLs: true | ||||||
metrics: | ||||||
enabled: true | ||||||
provider: "prometheus" | ||||||
enableAgentMetrics: true | ||||||
agentMetricsRetentionTime: "10m" | ||||||
|
||||||
prometheus: | ||||||
enabled: true | ||||||
|
||||||
server: | ||||||
logLevel: trace | ||||||
replicas: 1 | ||||||
annotations: | | ||||||
"prometheus.io/scheme": "https" | ||||||
"prometheus.io/port": "8501" | ||||||
|
||||||
ui: | ||||||
enabled: true | ||||||
service: | ||||||
type: NodePort | ||||||
metrics: | ||||||
enabled: true | ||||||
provider: "prometheus" | ||||||
baseURL: http://prometheus-server.consul | ||||||
|
||||||
connectInject: | ||||||
enabled: true | ||||||
metrics: | ||||||
defaultEnabled: true | ||||||
apiGateway: | ||||||
managedGatewayClass: | ||||||
serviceType: LoadBalancer | ||||||
``` | ||||||
|
||||||
</CodeTabs> | ||||||
|
||||||
## Enable access logs | ||||||
|
||||||
Access logs configurations are defined globally in the [`proxy-defaults`](/consul/docs/connect/config-entries/proxy-defaults#accesslogs) configuration entry. | ||||||
|
||||||
The following example is a minimal configuration for enabling access logs: | ||||||
|
||||||
<CodeTabs tabs={[ "HCL", "Kubernetes YAML", "JSON" ]}> | ||||||
|
||||||
```hcl | ||||||
Kind = "proxy-defaults" | ||||||
Name = "global" | ||||||
AccessLogs { | ||||||
Enabled = true | ||||||
} | ||||||
``` | ||||||
|
||||||
```yaml | ||||||
apiVersion: consul.hashicorp.com/v1alpha1 | ||||||
kind: ProxyDefaults | ||||||
metadata: | ||||||
name: global | ||||||
spec: | ||||||
accessLogs: | ||||||
enabled: true | ||||||
``` | ||||||
|
||||||
```json | ||||||
{ | ||||||
"Kind": "proxy-defaults", | ||||||
"Name": "global", | ||||||
"AccessLogs": { | ||||||
"Enabled": true | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
</CodeTabs> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be modified to reference code that exists in the Consul repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the PR just merged we will update it.