Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the docs for all the grafana dashboards. #21795

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

YasminLorinKaygalak
Copy link
Contributor

@YasminLorinKaygalak YasminLorinKaygalak commented Oct 1, 2024

Description

NET-11158
Added the docs for the grafana dashboards.

PR Checklist

  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

Copy link

hashicorp-cla-app bot commented Oct 1, 2024

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes


1 out of 2 committers have signed the CLA.

  • YasminLorinKaygalak
  • Lorin Lorin Kaygalak

Lorin Lorin Kaygalak seems not to be a GitHub user.
You need a GitHub account to be able to sign the CLA.
If you have already a GitHub account, please add the email address used for this commit to your account.

Have you signed the CLA already but the status is still pending? Recheck it.

@github-actions github-actions bot added the type/docs Documentation needs to be created/updated/clarified label Oct 1, 2024
@missylbytes missylbytes requested review from missylbytes, a team and wangxinyi7 and removed request for a team October 3, 2024 14:11
Copy link
Contributor

@boruszak boruszak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YasminLorinKaygalak Here is a preliminary review that outlines the repeated problems to correct.

For each of the reference pages, please implement the following three changes to each of the metrics and their descriptions:

  1. Sentence case in headings
  2. Line break between heading and list
  3. Grafana query instead of Metric

Then, remove the colons from the headings and ensure that there are sentences between each heading. Feel free to use the suggestions in this review as templates for each of the pages.

Don't worry about rewriting all of the descriptions at this time. Let's get these repeated formatting issues fixed first!


# Consul DataPlane Dashboard

The **Consul DataPlane Dashboard** provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh. It enables operators to monitor key metrics at both the cluster and service levels, helping ensure service reliability and performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The **Consul DataPlane Dashboard** provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh. It enables operators to monitor key metrics at both the cluster and service levels, helping ensure service reliability and performance.
This page provides reference information about the Grafana dashboard configuration included in [this GitHub repository](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh.
You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance.

Besides including a link to what I think is the repo this doc references, these suggestions:

  1. Make style guide edits for capitalization and formatting
  2. Follow our desired formatting for the beginning of reference pages
  3. Speaks directly to the reader ("you") instead of referring to them as an "operator" or "user"

Copy link
Contributor Author

@YasminLorinKaygalak YasminLorinKaygalak Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix this link once the PR is complete for the dashboards.


## Enabling Observability

The following script is the configuration needed to enable the observability tools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following script is the configuration needed to enable the observability tools.
Add the following configurations to your Consul Helm chart to enable the observability tools in [the sample repo](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main).

Comment on lines +30 to +31
<CodeTabs tabs={[ "Kubernetes YAML"]}>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<CodeTabs tabs={[ "Kubernetes YAML"]}>

Code tabs are unnecessary since there aren't other tabs. could be used if you want to highlight specific lines in the example configuration.

For the configuration - are all of these values required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one thing about these docs, is we really can only enable prometheus in our helm chart. So to actually see the dashboards on Grafana, the user needs to deploy their own Grafana. I feel like that may be more of a tutorial thing? But we can for sure only include the values that apply to enabling prometheus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@missylbytes missylbytes added the backport/1.20 Changes are backported to 1.20 label Oct 4, 2024
YasminLorinKaygalak and others added 6 commits October 4, 2024 14:41
…onsuldataplanedashboard.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
…ndex.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
…onsuldataplanedashboard.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
…onsuldataplanedashboard.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
…onsuldataplanedashboard.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

- **Consul Server Dashboard**: Provides detailed monitoring of Consul servers, tracking key metrics like server health, CPU and memory usage, disk I/O, and network performance. This dashboard is critical for ensuring the stability and performance of Consul servers within the service mesh.

## Enabling Observability
Copy link
Contributor

@missylbytes missylbytes Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Enabling Observability
## Enabling Prometheus
The Helm chart provides configuration to enable a demo Prometheus server. https://developer.hashicorp.com/consul/docs/k8s/helm#prometheus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boruszak I think the above may be all that we can say here? Basically we can install a demo prometheus server, but it is really on the user to deploy Prometheus/Loki/Grafana, and just upload our dashboards into Grafana.

@boruszak boruszak added the pr/no-changelog PR does not need a corresponding .changelog entry label Oct 8, 2024
…onsuldataplanedashboard.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
@missylbytes
Copy link
Contributor

missylbytes commented Oct 9, 2024

Do we care about ordering these alphabetically in the sidebar? @boruszak
image


- **Consul service dashboard**: Tracks key metrics for Envoy proxies at the cluster and service levels, ensuring the performance and reliability of individual services within the mesh.

- **Consul dataPlane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Consul dataPlane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.
- **Consul dataplane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.

@missylbytes
Copy link
Contributor

missylbytes commented Oct 9, 2024

Also I don't know what the standard is for this, but is there a way to make these look a bit more readable?
image
i.e. are we allowed to put this as the ``` multiline-type formatting like in the following? This is probably a @boruszak question.
image

Comment on lines +22 to +23
- **Grafana query:** `sum(envoy_server_live{app=~"$service"})`
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Grafana query:** `sum(envoy_server_live{app=~"$service"})`
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
**Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
<CodeBlockConfig heading="Grafana query">
```
sum(envoy_server_live{app=~"$service"})
```
</CodeBlockConfig>

To meet the request from @missylbytes to make the Grafana query easy to copy-and-paste, I'd suggest making these formatting changes to each of the sections:

  1. Remove unordered list
  2. Move Description above the Grafana query
  3. Use the component with the heading set to "Grafana query" to render the code block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block should also specify the language to enable syntax highlighting.

<CodeBlockConfig heading="Grafana query" language="promql">

Alternatively you can place the promql directly after the ``` that signifies the start of the code block.


You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance.

![Preview of the Consul dataplane dashboard](../../../../public/img/grafana/consul-dataplane-dashboard.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![Preview of the Consul dataplane dashboard](../../../../public/img/grafana/consul-dataplane-dashboard.png)
![Preview of the Consul dataplane dashboard](/public/img/grafana/consul-dataplane-dashboard.png)

This should be an absolute path.


# Consul dataplane monitoring dashboard

This page provides reference information about the Grafana dashboard configuration included in [this GitHub repository](https://github.com/YasminLorinKaygalak/GrafanaDemo/tree/main). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be modified to reference code that exists in the Consul repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the PR just merged we will update it.

Comment on lines +22 to +23
- **Grafana query:** `sum(envoy_server_live{app=~"$service"})`
- **Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block should also specify the language to enable syntax highlighting.

<CodeBlockConfig heading="Grafana query" language="promql">

Alternatively you can place the promql directly after the ``` that signifies the start of the code block.

layout: docs
page_title: Dashboard for Consul k8s control plane metrics
description: >-
This documentation provides an overview of the Consul K8s Dashboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This documentation provides an overview of the Consul K8s Dashboard
This documentation provides an overview of the Consul Kubernetes Dashboard

This documentation provides an overview of the Consul K8s Dashboard
---

# Consul k8s monitoring (Control Plane) dashboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Consul k8s monitoring (Control Plane) dashboard
# Consul Kubernetes monitoring (Control Plane) dashboard


- **Grafana query:** `rate(container_cpu_usage_seconds_total{pod=~".*-connect-injector-.*",
container="sidecar-injector"}[5m])`
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars. Monitoring this helps ensure that Connect Injector has adequate CPU resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars. Monitoring this helps ensure that Connect Injector has adequate CPU resources.
- **Description:** Tracks the CPU usage of the Connect Injector, which is responsible for injecting Envoy sidecars and other operations within the mesh. Monitoring this helps ensure that Connect Injector has adequate CPU resources.

The connect-injector process also acts as the controller for API Gateway.

### Transaction apply time

- **Grafana query:** `consul_txn_apply`
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transactional workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transactional workloads.
- **Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transaction operations.

### Catalog operation time

- **Grafana query:** `consul_catalog_register`, `consul_catalog_deregister`
- **Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric can indicate performance issues within the catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric can indicate performance issues within the catalog.
- **Description:** Measures the time taken to complete catalog register or deregister operations.

Spikes in these values just mean that a large number of services were registered, or deregistered. It does not necessarily mean that there is a performance issue.

### Total logs

- **Grafana query:** `sum(count_over_time(({container="consul-dataplane",namespace=~"$namespace"})[$__interval]))`
- **Description:** This metric counts the total number of log lines produced by Consul DataPlane containers. It provides an overview of the volume of logs being generated for a specific namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Description:** This metric counts the total number of log lines produced by Consul DataPlane containers. It provides an overview of the volume of logs being generated for a specific namespace.
- **Description:** This metric counts the total number of log lines produced by Consul dataplane containers. It provides an overview of the volume of logs being generated for a specific namespace.

Comment on lines +42 to +45
- `p50`: `histogram_quantile(0.50, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))`
- `p75`: `histogram_quantile(0.75, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))`
- `p90`: `histogram_quantile(0.90, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))`
- `p99.9`: `histogram_quantile(0.999, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try rendering this in a CodeTabs block. It might display a little better than multiple code stanzas in an unordered list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.20 Changes are backported to 1.20 pr/do-not-merge PR cannot be merged in its current form. pr/no-changelog PR does not need a corresponding .changelog entry type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants