Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus and grafana improvements based on load testing experience #501

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions build/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -389,23 +389,24 @@ pprof-web:

# setup prometheus in the current cluster by default Persistent Volume Claims are requested.
setup-prometheus: PVC ?= true
setup-prometheus: PV_SIZE ?= 64Gi
setup-prometheus: SCRAPE_INTERVAL=30s
setup-prometheus:
$(DOCKER_RUN) \
helm upgrade --install --wait prom stable/prometheus --namespace metrics \
--set alertmanager.enabled=false,pushgateway.enabled=false \
--set kubeStateMetrics.enabled=false,nodeExporter.enabled=false \
--set pushgateway.enabled=false \
--set server.global.scrape_interval=30s,server.persistentVolume.enabled=$(PVC)
--set server.global.scrape_interval=$(SCRAPE_INTERVAL),server.persistentVolume.enabled=$(PVC),server.persistentVolume.size=$(PV_SIZE) \
-f $(mount_path)/build/prometheus.yaml

# setup grafana in the current cluster with datasource and dashboards ready for use with agones
# by default Persistent Volume Claims are requested.
setup-grafana: PVC ?= true
setup-grafana: PV_SIZE ?= 64Gi
setup-grafana: PASSWORD ?= admin
setup-grafana:
$(DOCKER_RUN) kubectl apply -f $(mount_path)/build/grafana/
$(DOCKER_RUN) \
helm upgrade --install --wait grafana stable/grafana --namespace metrics \
--set persistence.enabled=$(PVC) \
--set persistence.enabled=$(PVC),server.persistentVolume.size=$(PV_SIZE) \
--set adminPassword=$(PASSWORD) -f $(mount_path)/build/grafana.yaml

# generate a changelog using github-changelog-generator
Expand Down
13 changes: 13 additions & 0 deletions build/gke-test-cluster/cluster.yml.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,19 @@ resources:
stable.agones.dev/agones-system: "true"
taints:
- key: stable.agones.dev/agones-system
- name: "agones-metrics"
initialNodeCount: 1
config:
machineType: n1-standard-4
markmandel marked this conversation as resolved.
Show resolved Hide resolved
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
labels:
stable.agones.dev/agones-metrics: "true"
taints:
- key: stable.agones.dev/agones-metrics
value: "true"
effect: "NO_EXECUTE"
masterAuth:
Expand Down
15 changes: 14 additions & 1 deletion build/grafana.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
service:
port: 3000
port: 3000
tolerations:
- key: "stable.agones.dev/agones-metrics"
operator: "Equal"
value: "true"
effect: "NoExecute"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: stable.agones.dev/agones-metrics
operator: Exists
sidecar:
dashboards:
enabled: true
Expand Down
120 changes: 120 additions & 0 deletions build/prometheus.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
alertmanager:
enabled: false
nodeExporter:
enabled: false
kubeStateMetrics:
enabled: false
pushgateway:
enabled: false
server:
resources:
requests:
memory: 4Gi
cpu: 2
tolerations:
- key: "stable.agones.dev/agones-metrics"
operator: "Equal"
value: "true"
effect: "NoExecute"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: stable.agones.dev/agones-metrics
operator: Exists
serverFiles:
prometheus.yml:
rule_files:
- /etc/config/rules
- /etc/config/alerts

scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090

# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.

# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
- job_name: 'kubernetes-apiservers'

kubernetes_sd_configs:
- role: endpoints

# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https

# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration because discovery & scraping are two separate concerns in
# Prometheus. The discovery auth config is automatic if Prometheus runs inside
# the cluster. Otherwise, more config options have to be provided within the
# <kubernetes_sd_config>.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
- job_name: 'kubernetes-pods'

kubernetes_sd_configs:
- role: pod

relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
19 changes: 16 additions & 3 deletions site/content/en/docs/Guides/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,15 +118,28 @@ Prometheus is an open source monitoring solution, we will use it to store Agones
Let's install Prometheus using the [helm stable](https://github.com/helm/charts/tree/master/stable/prometheus) repository.

```bash
helm install --wait --name prom stable/prometheus --namespace metrics \
--set pushgateway.enabled=false \
--set kubeStateMetrics.enabled=false,nodeExporter.enabled=false
helm upgrade --install --wait prom stable/prometheus --namespace metrics \
--set server.global.scrape_interval=30s \
--set server.persistentVolume.enabled=true \
--set server.persistentVolume.size=64Gi \
-f ./build/prometheus.yaml
```

> You can also run our {{< ghlink href="/build/Makefile" branch="master" branch="master" >}}Makefile{{< /ghlink >}} target `make setup-prometheus`
or `make kind-setup-prometheus` and `make minikube-setup-prometheus` for {{< ghlink href="/build/README.md#running-a-test-kind-cluster" branch="master" >}}Kind{{< /ghlink >}}
and {{< ghlink href="/build/README.md#running-a-test-minikube-cluster" branch="master" >}}Minikube{{< /ghlink >}}.

For resiliency it is recommended to run Prometheus on a dedicated node which is separate from nodes where Game Servers are scheduled. If you use `make setup-prometheus` to set up Prometheus, it will schedule Prometheus pods on nodes tainted with `stable.agones.dev/agones-metrics=true:NoExecute` and labeled with `stable.agones.dev/agones-metrics=true` if available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuqd remind me what's our standard here, for stating how we want to do this? Point to the yaml in GitHub?

Not sure if this should be a feature shortcode here to hide this until 0.8.0?

Copy link
Collaborator

@cyriltovena cyriltovena Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update line 113 with the new command using the yaml and here just explain the same way with a link to your yaml config.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new command being

helm upgrade --install --wait prom stable/prometheus --namespace metrics \
		--set alertmanager.enabled=false,pushgateway.enabled=false \
		--set kubeStateMetrics.enabled=false,nodeExporter.enabled=false \
		--set pushgateway.enabled=false \ 
--set server.global.scrape_interval=30s,server.persistentVolume.enabled=true,server.persistentVolume.size=64Gi  -f ./build/prometheus.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I think once this is there, this is good for approval 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


As an example, to set up dedicated node pool for Prometheus on GKE, run the following command before installing Prometheus. Alternatively you can taint and label nodes manually.

```
gcloud container node-pools create agones-metrics --cluster=... --zone=... \
--node-taints stable.agones.dev/agones-metrics=true:NoExecute \
--node-labels stable.agones.dev/agones-metrics=true \
--num-nodes=1
```

By default we will disable the push gateway (we don't need it for Agones) and other exporters.

The helm [chart](https://github.com/helm/charts/tree/master/stable/prometheus) support [nodeSelector](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector), [affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) and [toleration](https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/), you can use them to schedule prometheus deployments on an isolated node(s) to have an homogeneous game servers workload.
Expand Down