Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2290

zifter · 2021-10-04T22:43:11Z

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation

/kind feature

/kind hotfix

What this PR does / Why we need it:
It will allow collecting Prometheus metrics with prometheus-operator out of box.

Which issue(s) this PR fixes:

Closes #2262

Special notes for your reviewer:

I created Service agones-allocator-service, which is some kind of duplication of agones-allocator. But agones-allocator can be exposed via LB and I want to prevent to expose metrics port. So, as a result, I added agones.allocator.http2.port variable;
I added installation of whole prometheus-stack, without refactoring of old prometheus and grafana installation. I think, It's better to have two prometheus environment with different scraping mechanism - annotations and ServiceMonitor;
agones-ping service does not have metrics.

google-cla · 2021-10-04T22:43:15Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

google-cla · 2021-10-04T22:48:40Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

agones-bot · 2021-10-04T23:01:31Z

Build Failed 😱

Build Id: 16edd22f-321f-4009-81f1-54bd46fae67b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-04T23:16:25Z

Build Failed 😱

Build Id: 83ec6eb0-5533-494a-a4f7-cf126be1fb34

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

markmandel · 2021-10-04T23:29:14Z

To get past testing, you will need to run mark gen-install to regenerate the index.yaml 👍🏻

google-cla · 2021-10-04T23:39:58Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

agones-bot · 2021-10-04T23:53:23Z

Build Failed 😱

Build Id: cbfc03c4-d1a3-45a9-b18a-4b58534c960d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-04T23:57:53Z

Build Failed 😱

Build Id: fde7d232-06ba-4cff-860f-524fbac1d56c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-05T00:01:10Z

Build Failed 😱

Build Id: 74a24e33-269a-4bd5-8363-55bc861d7213

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-05T00:39:10Z

Build Failed 😱

Build Id: c3e79aa5-308b-4c25-99cb-7b5a60e9ba31

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-05T08:44:17Z

Build Failed 😱

Build Id: 0e9ec9c2-b528-46bb-9180-df9b0cbfa27c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2021-10-05T10:17:56Z

Build Failed 😱

Build Id: 373ad062-67ad-400d-ad61-f59d0675a2c9

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

zifter · 2021-10-05T10:20:38Z

@markmandel can you help me?
I have no idea why tests failed. Maybe something wrong with previous release?

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

I'm not sure, is it connected with my changes? Because the same failure I see in this pr #2288

roberthbailey · 2021-10-05T15:57:40Z

I've rolled back the helm install. It should be ready for e2e testing again.

agones-bot · 2021-10-05T20:26:52Z

Build Failed 😱

Build Id: 3e7d5be4-f175-4e35-adc9-c6aa8a5aa6a6

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

…monitor

agones-bot · 2021-10-05T22:42:11Z

Build Succeeded 👏

Build Id: 39ddf0b1-a491-4189-aaf8-86598600d2dc

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.18.0-5282967
image: gcr.io/agones-images/agones-ping:1.18.0-5282967
Linux C++ SDK (build): agonessdk-1.18.0-5282967-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.18.0-5282967.zip

A preview of the website (the last 30 builds are retained):

https://5282967-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.18.0-5282967

zifter · 2021-10-06T08:41:27Z

Oh, tests is definitely unstable :(

zifter · 2021-10-12T09:58:59Z

Hi @aLekSer @roberthbailey!
Could you review this changes, please?

roberthbailey · 2021-10-12T14:49:39Z

Hi @zifter - I (and maybe others) was waiting until after the feature freeze (which ends with the release cut scheduled for today) to review new PRs. So you should get some feedback in the next day or two.

zifter · 2021-10-14T21:26:18Z

There are some problems in CI build, which is not connected with my changes

agones-bot · 2021-10-15T02:52:33Z

Build Succeeded 👏

Build Id: 6d5cb97d-5cf2-4ed8-a942-e4e5ff5615ca

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-fb02d60
image: gcr.io/agones-images/agones-ping:1.19.0-fb02d60
Linux C++ SDK (build): agonessdk-1.19.0-fb02d60-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-fb02d60.zip

A preview of the website (the last 30 builds are retained):

https://fb02d60-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-fb02d60

build/Makefile

site/content/en/docs/Guides/metrics.md

install/helm/agones/values.yaml

roberthbailey · 2021-10-18T21:31:30Z

install/helm/agones/values.yaml

@@ -163,6 +166,13 @@ agones:
        port: 443
        portName: grpc
        targetPort: 8443
+    serviceInternal:


nit: I wonder if this name should convey that this is used for scraping metrics? "internal service" is a bit vague, and might be confused as a way to allocate game servers from within the cluster.

I don't want to connect naming of that service with metrics. Yes, at that moment it's using for metrics scraping only, but in the future, I suppose, it will be used for other features too. In that case it will be confusing. It will require name change and become the problem for backward compatibility.
I'd like to make it opposite to service, which is available outside cluster (service section).
But, of course, I do not insist.
How would like you prefer to make it?

From what I can tell, the allocator service doesn't expose much on port 8080 - just the health handlers (for liveness / readiness probes) and the metrics endpoint. So you can't actually use this service to do anything other than scrape metrics -- you can't, for instance, call this service to allocate a game server. So I don't see a problem making it obvious that this is strictly an internal metric gathering service.

One other question that occurred to me as I was thinking about this - should prometheus be scraping all allocator pods instead of using a service to pull metrics from one pod at a time (and likely different pods each time a new request is made)? If there was only one pod behind the service then using a service gives a stable name to find the pod, but when there are multiple pods and each one will have different stats, it seems like we should pull from all of them to get things like total aggregated allocations (which is the sum of allocations from all pods in the deployment).

Agree!
I will rename it to serviceMetrics, ok?

ServiceMonitor will scrape metrics from all pods which are discovered by Service label.
So, don't worry, all replicas of allocator\controller will be scrapped :)
For more information how it works, refer to this doc.
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor

install/helm/agones/templates/service/allocation.yaml

roberthbailey · 2021-10-19T04:52:56Z

I was going to update your branch but it looks like you need to resolve a conflict.

agones-bot · 2021-10-19T08:20:20Z

Build Succeeded 👏

Build Id: 2d2fb5be-0829-4076-9972-8f50674de3c3

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-d8b3139
image: gcr.io/agones-images/agones-ping:1.19.0-d8b3139
Linux C++ SDK (build): agonessdk-1.19.0-d8b3139-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-d8b3139.zip

A preview of the website (the last 30 builds are retained):

https://d8b3139-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-d8b3139

zifter

fix some review notes

agones-bot · 2021-10-19T09:13:52Z

Build Succeeded 👏

Build Id: 5089ea86-61ae-4224-94c2-9d45cb43459d

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-f7e9dfb
image: gcr.io/agones-images/agones-ping:1.19.0-f7e9dfb
Linux C++ SDK (build): agonessdk-1.19.0-f7e9dfb-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-f7e9dfb.zip

A preview of the website (the last 30 builds are retained):

https://f7e9dfb-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-f7e9dfb

agones-bot · 2021-10-19T09:21:43Z

Build Succeeded 👏

Build Id: 57cc7617-30a8-4652-a135-7b36f43f547b

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-a11b710
image: gcr.io/agones-images/agones-ping:1.19.0-a11b710
Linux C++ SDK (build): agonessdk-1.19.0-a11b710-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-a11b710.zip

A preview of the website (the last 30 builds are retained):

https://a11b710-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-a11b710

roberthbailey · 2021-10-19T17:17:39Z

install/helm/agones/templates/service/allocation.yaml

@@ -63,7 +63,54 @@ spec:
 {{ toYaml .Values.agones.allocator.service.loadBalancerSourceRanges | indent 4 }}
  {{- end }}
 {{- end }}
-
+{{- if .Values.agones.allocator.serviceInternal.enabled }}


I don't see .Values.agones.allocator.serviceInternal.name in the values file (there is an http.enabled though). I'm wondering if this should be on or off by default. There isn't much overhead of having an unused internal service in k8s (no new cloud resources need to be created) so I'm ok with leaving it on by default.

Yes, it was untested changed after previous review notes, sorry.
I agree with your point and, I think, it's better to remove enabled flag at all for that service.

install/helm/agones/templates/service/allocation.yaml

roberthbailey · 2021-10-19T17:24:01Z

install/helm/agones/values.yaml

@@ -163,6 +166,13 @@ agones:
        port: 443
        portName: grpc
        targetPort: 8443
+    serviceInternal:


From what I can tell, the allocator service doesn't expose much on port 8080 - just the health handlers (for liveness / readiness probes) and the metrics endpoint. So you can't actually use this service to do anything other than scrape metrics -- you can't, for instance, call this service to allocate a game server. So I don't see a problem making it obvious that this is strictly an internal metric gathering service.

One other question that occurred to me as I was thinking about this - should prometheus be scraping all allocator pods instead of using a service to pull metrics from one pod at a time (and likely different pods each time a new request is made)? If there was only one pod behind the service then using a service gives a stable name to find the pod, but when there are multiple pods and each one will have different stats, it seems like we should pull from all of them to get things like total aggregated allocations (which is the sum of allocations from all pods in the deployment).

install/helm/agones/values.yaml

site/content/en/docs/Guides/metrics.md

build/Makefile

roberthbailey · 2021-10-19T17:28:35Z

Just a few more comments - this is getting close!

agones-bot · 2021-10-20T21:01:52Z

Build Failed 😱

Build Id: f06e5e98-ec4f-4dc4-a124-a2136dc369cc

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

zifter

Review notes fixed

build/Makefile

zifter · 2021-10-20T21:16:08Z

install/helm/agones/values.yaml

@@ -163,6 +166,13 @@ agones:
        port: 443
        portName: grpc
        targetPort: 8443
+    serviceInternal:


Agree!
I will rename it to serviceMetrics, ok?

ServiceMonitor will scrape metrics from all pods which are discovered by Service label.
So, don't worry, all replicas of allocator\controller will be scrapped :)
For more information how it works, refer to this doc.
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor

agones-bot · 2021-10-20T21:53:10Z

Build Succeeded 👏

Build Id: 1ab3d785-48fd-4600-bfaa-e6a74d8f1dba

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-f3c4feb
image: gcr.io/agones-images/agones-ping:1.19.0-f3c4feb
Linux C++ SDK (build): agonessdk-1.19.0-f3c4feb-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-f3c4feb.zip

A preview of the website (the last 30 builds are retained):

https://f3c4feb-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-f3c4feb

roberthbailey

Thanks for sticking with this change over the many review cycles!

google-oss-robot · 2021-10-20T22:32:52Z

New changes are detected. LGTM label has been removed.

google-oss-robot · 2021-10-20T22:32:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: roberthbailey, zifter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

agones-bot · 2021-10-20T23:00:10Z

Build Succeeded 👏

Build Id: be1a445e-d671-4f13-90b7-a4d2780d61b5

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.19.0-5235751
image: gcr.io/agones-images/agones-ping:1.19.0-5235751
Linux C++ SDK (build): agonessdk-1.19.0-5235751-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.19.0-5235751.zip

A preview of the website (the last 30 builds are retained):

https://5235751-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-5235751

google-cla bot added the cla: no label Oct 4, 2021

google-oss-robot added the size/L label Oct 4, 2021

google-oss-robot requested review from aLekSer and roberthbailey October 4, 2021 22:43

zifter changed the title ~~Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2262~~ Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism Oct 4, 2021

google-cla bot added cla: yes and removed cla: no labels Oct 4, 2021

roberthbailey added the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 5, 2021

Prometheus metrics: Allow to use ServiceMonitor

eb47c66

Merge remote-tracking branch 'origin/main' into feature/2262-service-…

5282967

…monitor

roberthbailey removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 12, 2021

Merge branch 'main' into feature/2262-service-monitor

fb02d60

roberthbailey reviewed Oct 18, 2021

View reviewed changes

merge

d8b3139

Fix review notes

f7e9dfb

zifter commented Oct 19, 2021

View reviewed changes

Remove unnecessary header

a11b710

roberthbailey reviewed Oct 19, 2021

View reviewed changes

zifter commented Oct 20, 2021

View reviewed changes

Fix review notes

f3c4feb

roberthbailey approved these changes Oct 20, 2021

View reviewed changes

google-oss-robot assigned roberthbailey Oct 20, 2021

google-oss-robot added the lgtm label Oct 20, 2021

Merge branch 'main' into feature/2262-service-monitor

5235751

google-oss-robot removed the lgtm label Oct 20, 2021

google-oss-robot added the approved label Oct 20, 2021

roberthbailey merged commit 80e202d into googleforgames:main Oct 20, 2021

roberthbailey added this to the 1.19.0 milestone Nov 1, 2021

SaitejaTamma added the kind/feature New features for Agones label Nov 16, 2021

Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2290

Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2290

Conversation

zifter commented Oct 4, 2021

google-cla bot commented Oct 4, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

google-cla bot commented Oct 4, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

agones-bot commented Oct 4, 2021

agones-bot commented Oct 4, 2021

markmandel commented Oct 4, 2021

google-cla bot commented Oct 4, 2021

agones-bot commented Oct 4, 2021

agones-bot commented Oct 4, 2021

agones-bot commented Oct 5, 2021

agones-bot commented Oct 5, 2021

agones-bot commented Oct 5, 2021

agones-bot commented Oct 5, 2021

zifter commented Oct 5, 2021

roberthbailey commented Oct 5, 2021

agones-bot commented Oct 5, 2021

agones-bot commented Oct 5, 2021

zifter commented Oct 6, 2021

zifter commented Oct 12, 2021

roberthbailey commented Oct 12, 2021

zifter commented Oct 14, 2021 • edited Loading

agones-bot commented Oct 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthbailey commented Oct 19, 2021

agones-bot commented Oct 19, 2021

zifter left a comment

Choose a reason for hiding this comment

agones-bot commented Oct 19, 2021

agones-bot commented Oct 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthbailey commented Oct 19, 2021

agones-bot commented Oct 20, 2021

zifter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agones-bot commented Oct 20, 2021

roberthbailey left a comment

Choose a reason for hiding this comment

google-oss-robot commented Oct 20, 2021

google-oss-robot commented Oct 20, 2021

agones-bot commented Oct 20, 2021

zifter commented Oct 14, 2021 •

edited

Loading