Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the docs for all the grafana dashboards. #21795

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1f08879
Added the docs for all the grafana dashboards.
Sep 28, 2024
0b0a4aa
Seperated the dashboards
YasminLorinKaygalak Oct 3, 2024
40e36df
Delete grafana-dashboards.mdx
YasminLorinKaygalak Oct 3, 2024
95af0be
Seperated the dashboards and added descriptions
YasminLorinKaygalak Oct 3, 2024
71bbf44
Seperated the dashboards and added descriptions
YasminLorinKaygalak Oct 3, 2024
3d9e426
Seperated the dashboards and added descriptions
YasminLorinKaygalak Oct 3, 2024
f4ec240
Seperated the dashboards and added descriptions
YasminLorinKaygalak Oct 3, 2024
1d10678
Typo edit
YasminLorinKaygalak Oct 4, 2024
5c695c8
added changelog for docs PR
YasminLorinKaygalak Oct 4, 2024
b3b54f9
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
412521d
Update website/content/docs/connect/observability/grafanadashboards/i…
YasminLorinKaygalak Oct 7, 2024
7f9dcce
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
50b62f4
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
00bca69
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
e0d401c
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
c44a4a0
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 7, 2024
0781b55
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 8, 2024
e55756d
Testing the revised dashboard
YasminLorinKaygalak Oct 9, 2024
299a67c
Update website/content/docs/connect/observability/grafanadashboards/i…
YasminLorinKaygalak Oct 9, 2024
0fc0cf5
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 9, 2024
f15688d
Update website/content/docs/connect/observability/grafanadashboards/c…
YasminLorinKaygalak Oct 9, 2024
d4fa4a0
Adding the PR revised docs
YasminLorinKaygalak Oct 9, 2024
18237bf
Adding the PR revised docs consul k8s
YasminLorinKaygalak Oct 9, 2024
c661a67
Adding the PR revised docs service dashboard
YasminLorinKaygalak Oct 9, 2024
f870516
Adding the PR revised docs consul server dashboard
YasminLorinKaygalak Oct 9, 2024
9417c1a
Adding the PR revised docs consul server dashboard insertions
YasminLorinKaygalak Oct 9, 2024
dbecc92
Adding the PR revised docs consul k8s docs edit
YasminLorinKaygalak Oct 9, 2024
ff03d17
Adding the PR revised docs service dashboard
YasminLorinKaygalak Oct 9, 2024
d8e2cbe
Added the final edits for the PR feedback
YasminLorinKaygalak Oct 9, 2024
d1cd934
Adding consul dataplane dashboard screenshoots
YasminLorinKaygalak Oct 9, 2024
6974e83
Minor edit in service to service dashboard
YasminLorinKaygalak Oct 9, 2024
eeffe3f
Minor edit in overview
YasminLorinKaygalak Oct 9, 2024
2284483
Minor edit in overview page
YasminLorinKaygalak Oct 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .changelog/21795.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
```release-note:feature
docs: added the docs for the grafana dashboards
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
layout: docs
page_title: Dashboard for Consul dataplane metrics
description: >-
This Grafana dashboard that provides Consul dataplane metrics on Kubernetes deployments. Learn about the Grafana queries that produce the metrics and visualizations in this dashboard.
---

# Consul dataplane monitoring dashboard

This page provides reference information about the Grafana dashboard configuration included in [this GitHub repository](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be modified to reference code that exists in the Consul repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the PR just merged we will update it.


You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance.

![Preview of the Consul dataplane dashboard](../../../../public/img/grafana/consul-dataplane-dashboard.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![Preview of the Consul dataplane dashboard](../../../../public/img/grafana/consul-dataplane-dashboard.png)
![Preview of the Consul dataplane dashboard](/public/img/grafana/consul-dataplane-dashboard.png)

This should be an absolute path.


## Consul dataplane metrics

The Consul dataplane dashboard provides the following information about service mesh operations.

### Live service count

- **Grafana query:** `sum(envoy_server_live{app=~"$service"})`
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
Comment on lines +22 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Grafana query:** `sum(envoy_server_live{app=~"$service"})`
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
**Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
<CodeBlockConfig heading="Grafana query">
```
sum(envoy_server_live{app=~"$service"})
```
</CodeBlockConfig>

To meet the request from @missylbytes to make the Grafana query easy to copy-and-paste, I'd suggest making these formatting changes to each of the sections:

  1. Remove unordered list
  2. Move Description above the Grafana query
  3. Use the component with the heading set to "Grafana query" to render the code block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block should also specify the language to enable syntax highlighting.

<CodeBlockConfig heading="Grafana query" language="promql">

Alternatively you can place the promql directly after the ``` that signifies the start of the code block.


### Total request success rate

- **Grafana query:** `sum(irate(envoy_cluster_upstream_rq_xx{...}[10m]))`
- **Description:** Tracks the percentage of successful requests across the service mesh. It excludes 4xx and 5xx response codes to focus on operational success. Use it to monitor the overall reliability of your services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Tracks the percentage of successful requests across the service mesh. It excludes 4xx and 5xx response codes to focus on operational success. Use it to monitor the overall reliability of your services.
- **Description:** Tracks the percentage of successful requests across the service mesh. Use it to monitor the overall reliability of your services.

I think the mention of error response codes can be omitted because by definition a metric that tracks successful responses should only apply to successful HTTP response codes.


### Total failed requests

- **Grafana query:** `sum(increase(envoy_cluster_upstream_rq_xx{...}[10m]))`
- **Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to focus on problematic services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to focus on problematic services.
- **Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to identify problematic services.


### Requests per second

- **Grafana query:** `sum(rate(envoy_http_downstream_rq_total{...}[5m]))`
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing.
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services over a 5 minute period. It helps operators understand the current load on services and how much traffic they are processing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing.
- **Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps with understanding the current load on services and how much traffic they are processing.


### Unhealthy clusters

- **Grafana query:** `(sum(envoy_cluster_membership_healthy{...}) - sum(envoy_cluster_membership_total{...}))`
- **Description:** This metric tracks the number of unhealthy clusters in the mesh, helping operators identify services that are experiencing issues and need attention to ensure operational health.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric tracks the number of unhealthy clusters in the mesh, helping operators identify services that are experiencing issues and need attention to ensure operational health.
- **Description:** This metric tracks the number of unhealthy clusters in the mesh, helping to identify services that are experiencing issues and need attention to ensure operational health.


### Heap size

- **Grafana query:** `SUM(envoy_server_memory_heap_size{app=~"$service"})`
- **Description:** This metric displays the total memory heap size of the Envoy proxies. Monitoring heap size is essential to detect memory issues and ensure that services are operating efficiently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric displays the total memory heap size of the Envoy proxies. Monitoring heap size is essential to detect memory issues and ensure that services are operating efficiently.
- **Description:** This metric displays the total memory heap size of the Envoy proxies.

I think this last part should be omitted because the heap size is really dependent on how the proxy is configure, the number of connections it is processing, etc. Proxies that are processing high volumes of traffic may have larger heap sizes than proxies that are fronting lower traffic services. The mere presence of a large heap size alone is not indicative of a problem. Operators would need to evaluate other metrics to determine if the heap allocation is unusual given the traffic load and configuration profile of a given proxy.


### Allocated memory

- **Grafana query:** `SUM(envoy_server_memory_allocated{app=~"$service"})`
- **Description:** This metric shows the amount of memory allocated by the Envoy proxies. It helps operators monitor the resource usage of services to prevent memory overuse and optimize performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric shows the amount of memory allocated by the Envoy proxies. It helps operators monitor the resource usage of services to prevent memory overuse and optimize performance.
- **Description:** This metric shows the amount of memory allocated by the Envoy proxies.


### Avg uptime per node

- **Grafana query:** `avg(envoy_server_uptime{app=~"$service"})`
- **Description:** This metric calculates the average uptime of Envoy proxies across all nodes. It helps operators monitor the stability of services and detect potential issues with service restarts or crashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric calculates the average uptime of Envoy proxies across all nodes. It helps operators monitor the stability of services and detect potential issues with service restarts or crashes.
- **Description:** This metric calculates the average uptime of Envoy proxies across all nodes. Use it to monitor the overall stability of services and detect potential issues with service restarts or crashes.


### Cluster state

- **Grafana query:** `(sum(envoy_cluster_membership_total{...}) - sum(envoy_cluster_membership_healthy{...})) == bool 0`
- **Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance.
- **Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance.

As far as I can tel, this tracks the number of members in a given cluster. This number is going to vary per logical service based on the number of provisioned upstream instances. I don't know that it makes sense to track this other than to see how many service instances are online at a given time.


### CPU throttled seconds by namespace

- **Grafana query:** `rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m])`
- **Description:** This metric tracks the number of seconds during which CPU usage was throttled. Monitoring CPU throttling helps operators identify when services are exceeding their allocated CPU limits and may need optimization.

### Memory usage by pod limits

- **Grafana query:** `100 * max(container_memory_working_set_bytes{namespace=~"$namespace"}
/ kube_pod_container_resource_limits{resource="memory"})`
- **Description:** This metric shows memory usage as a percentage of the memory limit set for each pod. It helps operators ensure that services are staying within their allocated memory limits to avoid performance degradation.

### CPU usage by pod limits

- **Grafana query:** `100 * max(container_cpu_usage{namespace=~"$namespace"} / kube_pod_container_resource_limits{resource="cpu"})`
- **Description:** This metric displays CPU usage as a percentage of the CPU limit set for each pod. Monitoring CPU usage helps operators optimize service performance and prevent CPU exhaustion.

### Total active upstream connections

- **Grafana query:** `sum(envoy_cluster_upstream_cx_active{app=~"$service"})`
- **Description:** This metric tracks the total number of active upstream connections to other services in the mesh. It provides insight into service dependencies and network load.

### Total active downstream connections

- **Grafana query:** `sum(envoy_http_downstream_cx_active{app=~"$service"})`
- **Description:** This metric tracks the total number of active downstream connections from services to clients. It helps operators monitor service load and ensure that services are able to handle the traffic effectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric tracks the total number of active downstream connections from services to clients. It helps operators monitor service load and ensure that services are able to handle the traffic effectively.
- **Description:** This metric tracks the total number of active downstream connections to a given service. It helps operators monitor service load and ensure that services are able to handle the traffic effectively.



Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
layout: docs
page_title: Dashboard for Consul k8s control plane metrics
description: >-
This documentation provides an overview of the Consul K8s Dashboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This documentation provides an overview of the Consul K8s Dashboard
This documentation provides an overview of the Consul Kubernetes Dashboard

---

# Consul k8s monitoring (Control Plane) dashboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Consul k8s monitoring (Control Plane) dashboard
# Consul Kubernetes monitoring (Control Plane) dashboard


### Number of Consul servers

- **Grafana query:** `consul_consul_server_0_consul_members_servers{pod="consul-server-0"}`
- **Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment.
Comment on lines +12 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Grafana query:** `consul_consul_server_0_consul_members_servers{pod="consul-server-0"}`
- **Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment.
- **Grafana query:** `consul_consul_server_0_consul_members_servers{pod="consul-server-0"}`
- **Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment.

This only retrieves the metric from the pod named consul-server-0. Is it possible to modify this so that the value is retrieved from any available server, or the Raft leader?


### Number of connected Consul dataplanes

- **Grafana query:** `count(consul_dataplane_envoy_connected)`
- **Description:** Tracks the number of connected Consul dataplanes. This metric helps operators understand how many Envoy sidecars are actively connected to the mesh.

### CPU usage in seconds (Consul servers)

- **Grafana query:** `rate(container_cpu_usage_seconds_total{container="consul", pod=~"consul-server-.*"}[5m])`
- **Description:** This metric shows the CPU usage of the Consul servers over time, helping operators monitor resource consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric shows the CPU usage of the Consul servers over time, helping operators monitor resource consumption.
- **Description:** This metric shows the CPU usage of the Consul servers over time, helping monitor resource consumption.


### Memory usage (Consul servers)

- **Grafana query:** `container_memory_working_set_bytes{container="consul", pod=~"consul-server-.*"}`
- **Description:** Displays the memory usage of the Consul servers. This metric helps ensure that the servers have sufficient memory resources for proper operation.

### Disk read/write total per 5 minutes (Consul servers)

- **Grafana query:** `sum(rate(container_fs_writes_bytes_total{pod=~"consul-server-.*",
container="consul"}[5m])) by (pod, device)`
- **Grafana query:** `sum(rate(container_fs_reads_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device)`
- **Description:** Monitors disk read and write operations over 5-minute intervals for Consul servers. This helps identify potential disk bottlenecks or issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Monitors disk read and write operations over 5-minute intervals for Consul servers. This helps identify potential disk bottlenecks or issues.
- **Description:** Monitors disk read and write operations over 5-minute intervals for Consul servers. Use this metric to identify potential disk I/O bottlenecks or throughput issues.


### Received bytes total per 5 minutes (Consul servers)

- **Grafana query:** `sum(rate(container_network_receive_bytes_total{pod=~"consul-server-.*"}[5m])) by (pod)`
- **Description:** Tracks the total network bytes received by Consul servers within a 5-minute window. This metric helps assess the network load on Consul nodes.

### Memory limit (Consul servers)

- **Grafana query:** `kube_pod_container_resource_limits{resource="memory", pod="consul-server-0"}`
- **Description:** Displays the memory limit for Consul servers. This metric ensures that memory usage stays within the defined limits for each Consul server.

### CPU limit in seconds (Consul servers)

- **Grafana query:** `kube_pod_container_resource_limits{resource="cpu", pod="consul-server-0"}`
- **Description:** Displays the CPU limit for Consul servers. Monitoring CPU limits helps operators ensure that the services are not constrained by resource limitations.

### Disk usage (Consul servers)

- **Grafana query:** `sum(container_fs_usage_bytes{}) by (pod)`
- **Grafana query:** `sum(container_fs_usage_bytes{pod="consul-server-0"})`
- **Description:** Shows the amount of filesystem storage used by Consul servers. This metric helps operators track disk usage and plan for capacity.

### CPU usage in seconds (Connect injector)

- **Grafana query:** `rate(container_cpu_usage_seconds_total{pod=~".*-connect-injector-.*",
container="sidecar-injector"}[5m])`
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars. Monitoring this helps ensure that Connect Injector has adequate CPU resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars. Monitoring this helps ensure that Connect Injector has adequate CPU resources.
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars and other operations within the mesh. Monitoring this helps ensure that Connect Injector has adequate CPU resources.

The connect-injector process also acts as the controller for API Gateway.


### CPU limit in seconds (Connect injector)

- **Grafana query:** `max(kube_pod_container_resource_limits{resource="cpu", container="sidecar-injector"})`
- **Description:** Displays the CPU limit for the Connect Injector. Monitoring the CPU limits ensures that Connect Injector is not constrained by resource limitations.

### Memory usage (Connect injector)

- **Grafana query:** `container_memory_working_set_bytes{pod=~".*-connect-injector-.*",
container="sidecar-injector"}`
- **Description:** Tracks the memory usage of the Connect Injector. Monitoring this helps ensure the Connect Injector has sufficient memory resources.

### Memory limit (Connect injector)

- **Grafana query:** `max(kube_pod_container_resource_limits{resource="memory", container="sidecar-injector"})`
- **Description:** Displays the memory limit for the Connect Injector, helping to monitor if the service is nearing its resource limits.


Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
layout: docs
page_title: Dashboard for Consul server metrics
description: >-
This documentation provides an overview of the Consul Server Dashboard
---

# Consul server monitoring dashboard

### Raft commit time

- **Grafana query:** `consul_raft_commitTime`
- **Description:** This metric measures the time it takes to commit Raft log entries. Stable values are expected for a healthy cluster. High values can indicate issues with resources such as memory, CPU, or disk space.

### Raft commits per 5 minutes

- **Grafana query:** `rate(consul_raft_apply[5m])`
- **Description:** This metric tracks the rate of Raft log commits emitted by the leader, showing how quickly changes are being applied across the cluster.

### Last contacted leader

- **Grafana query:** `consul_raft_leader_lastContact != 0`
- **Description:** Measures the duration since the last contact with the Raft leader. Spikes in this metric can indicate network issues or an unavailable leader, which may affect cluster stability.

### Election events

- **Grafana query:** `rate(consul_raft_state_candidate[1m])`, `rate(consul_raft_state_leader[1m])`
- **Description:** Tracks Raft state transitions, indicating leadership elections. Frequent transitions might suggest cluster instability and require investigation.

### Autopilot health

- **Grafana query:** `consul_autopilot_healthy`
- **Description:** A boolean metric that shows a value of 1 when Autopilot is healthy and 0 when issues are detected. Ensures that the cluster has sufficient resources and an operational leader.

### DNS queries per 5 minutes

- **Grafana query:** `rate(consul_dns_domain_query_count[5m])`
- **Description:** This metric tracks the rate of DNS queries per node, bucketed into 5-minute intervals. It helps monitor the query load on Consul’s DNS service.

### DNS domain query time

- **Grafana query:** `consul_dns_domain_query`
- **Description:** Measures the time spent handling DNS domain queries. Spikes in this metric may indicate high contention in the catalog or too many concurrent queries.

### DNS reverse query time

- **Grafana query:** `consul_dns_ptr_query`
- **Description:** Tracks the time spent processing reverse DNS queries. Spikes in query time may indicate performance bottlenecks or increased workload.

### KV applies per 5 minutes

- **Grafana query:** `rate(consul_kvs_apply_count[5m])`
- **Description:** This metric tracks the rate of Key-Value store applies over 5-minute intervals, indicating the operational load on Consul’s KV store.

### KV apply time

- **Grafana query:** `consul_kvs_apply`
- **Description:** Measures the time taken to apply updates to the Key-Value store. Spikes in this metric might suggest resource contention or client overload.

### Transaction apply time

- **Grafana query:** `consul_txn_apply`
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transactional workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transactional workloads.
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transaction operations.


### ACL resolves per 5 minutes

- **Grafana query:** `rate(consul_acl_ResolveToken_count[5m])`
- **Description:** This metric tracks the rate of ACL token resolutions per 5-minute intervals. It provides insights into the activity related to ACL tokens within the cluster.

### ACL resolve token time

- **Grafana query:** `consul_acl_ResolveToken`
- **Description:** Measures the time taken to resolve ACL tokens into their associated policies. Spikes in this metric might indicate resource issues or configuration problems.

### ACL updates per 5 minutes

- **Grafana query:** `rate(consul_acl_apply_count[5m])`
- **Description:** Tracks the rate of ACL updates per 5-minute intervals. This metric helps monitor changes in ACL configurations over time.

### ACL apply time

- **Grafana query:** `consul_acl_apply`
- **Description:** Measures the time spent applying ACL changes. Spikes in apply time might suggest resource constraints or high operational load.

### Catalog operations per 5 minutes

- **Grafana query:** `rate(consul_catalog_register_count[5m])`, `rate(consul_catalog_deregister_count[5m])`
- **Description:** Tracks the rate of register and deregister operations in the Consul catalog, providing insights into the churn of services within the cluster.

### Catalog operation time

- **Grafana query:** `consul_catalog_register`, `consul_catalog_deregister`
- **Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric can indicate performance issues within the catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric can indicate performance issues within the catalog.
- **Description:** Measures the time taken to complete catalog register or deregister operations.

Spikes in these values just mean that a large number of services were registered, or deregistered. It does not necessarily mean that there is a performance issue.




Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
layout: docs
page_title: Service Mesh Observability - Dashboards
description: >-
This documentation provides an overview of several dashboards designed for monitoring and managing services within a Consul-managed Envoy service mesh. Learn how to enable access logs and configure key performance and operational metrics to ensure the reliability and performance of services in the service mesh.
---

# Dashboards for service mesh observability

This topic describes the configuration and usage of dashboards for monitoring and managing services within a Consul-managed Envoy service mesh. These dashboards provide critical insights into the health, performance, and resource utilization of services. The dashboards described here are essential tools for ensuring the stability, efficiency, and reliability of your service mesh environment.

## Dashboards overview

The repository includes the following dashboards:

- **Consul service-to-service dashboard**: Provides a detailed view of service-to-service communications, monitoring key metrics like access logs, HTTP requests, error counts, response code distributions, and request success rates. The dashboard includes customizable filters for focusing on specific services and namespaces.

- **Consul service dashboard**: Tracks key metrics for Envoy proxies at the cluster and service levels, ensuring the performance and reliability of individual services within the mesh.

- **Consul dataPlane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Consul dataPlane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.
- **Consul dataplane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.


- **Consul k8s dashboard**: Focuses on monitoring the health and resource usage of the Consul control plane within a Kubernetes environment, ensuring the stability of the control plane.

- **Consul server dashboard**: Provides detailed monitoring of Consul servers, tracking key metrics like server health, CPU and memory usage, disk I/O, and network performance. This dashboard is critical for ensuring the stability and performance of Consul servers within the service mesh.

## Enabling observability

Add the following configurations to your Consul Helm chart to enable the observability tools in [the sample repo](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main).

<CodeTabs tabs={[ "Kubernetes YAML"]}>

Comment on lines +30 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<CodeTabs tabs={[ "Kubernetes YAML"]}>

Code tabs are unnecessary since there aren't other tabs. could be used if you want to highlight specific lines in the example configuration.

For the configuration - are all of these values required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing about these docs, is we really can only enable prometheus in our helm chart. So to actually see the dashboards on Grafana, the user needs to deploy their own Grafana. I feel like that may be more of a tutorial thing? But we can for sure only include the values that apply to enabling prometheus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```yaml
global:
logLevel: trace
name: consul
datacenter: dc1
tls:
enabled: true
enableAutoEncrypt: true
httpsOnly: false
acls:
manageSystemACLs: true
metrics:
enabled: true
provider: "prometheus"
enableAgentMetrics: true
agentMetricsRetentionTime: "10m"

prometheus:
enabled: true

server:
logLevel: trace
replicas: 1
annotations: |
"prometheus.io/scheme": "https"
"prometheus.io/port": "8501"

ui:
enabled: true
service:
type: NodePort
metrics:
enabled: true
provider: "prometheus"
baseURL: http://prometheus-server.consul

connectInject:
enabled: true
metrics:
defaultEnabled: true
apiGateway:
managedGatewayClass:
serviceType: LoadBalancer
```

</CodeTabs>

## Enable access logs

Access logs configurations are defined globally in the [`proxy-defaults`](/consul/docs/connect/config-entries/proxy-defaults#accesslogs) configuration entry.

The following example is a minimal configuration for enabling access logs:

<CodeTabs tabs={[ "HCL", "Kubernetes YAML", "JSON" ]}>

```hcl
Kind = "proxy-defaults"
Name = "global"
AccessLogs {
Enabled = true
}
```

```yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
name: global
spec:
accessLogs:
enabled: true
```

```json
{
"Kind": "proxy-defaults",
"Name": "global",
"AccessLogs": {
"Enabled": true
}
}
```

</CodeTabs>
Loading