Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2290

Merged
merged 11 commits into from
Oct 20, 2021
Merged

Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2290

merged 11 commits into from
Oct 20, 2021

Conversation

zifter
Copy link
Contributor

@zifter zifter commented Oct 4, 2021

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation

/kind feature

/kind hotfix

What this PR does / Why we need it:
It will allow collecting Prometheus metrics with prometheus-operator out of box.

Which issue(s) this PR fixes:

Closes #2262

Special notes for your reviewer:

  1. I created Service agones-allocator-service, which is some kind of duplication of agones-allocator. But agones-allocator can be exposed via LB and I want to prevent to expose metrics port. So, as a result, I added agones.allocator.http2.port variable;
  2. I added installation of whole prometheus-stack, without refactoring of old prometheus and grafana installation. I think, It's better to have two prometheus environment with different scraping mechanism - annotations and ServiceMonitor;
  3. agones-ping service does not have metrics.

@google-cla
Copy link

google-cla bot commented Oct 4, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Oct 4, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@zifter zifter changed the title Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism #2262 Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism Oct 4, 2021
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 16edd22f-321f-4009-81f1-54bd46fae67b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 83ec6eb0-5533-494a-a4f7-cf126be1fb34

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

To get past testing, you will need to run mark gen-install to regenerate the index.yaml 👍🏻

@google-cla
Copy link

google-cla bot commented Oct 4, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: yes and removed cla: no labels Oct 4, 2021
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: cbfc03c4-d1a3-45a9-b18a-4b58534c960d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: fde7d232-06ba-4cff-860f-524fbac1d56c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 74a24e33-269a-4bd5-8363-55bc861d7213

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: c3e79aa5-308b-4c25-99cb-7b5a60e9ba31

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 0e9ec9c2-b528-46bb-9180-df9b0cbfa27c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 373ad062-67ad-400d-ad61-f59d0675a2c9

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@zifter
Copy link
Contributor Author

zifter commented Oct 5, 2021

@markmandel can you help me?
I have no idea why tests failed. Maybe something wrong with previous release?

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

I'm not sure, is it connected with my changes? Because the same failure I see in this pr #2288

@roberthbailey
Copy link
Member

I've rolled back the helm install. It should be ready for e2e testing again.

@roberthbailey roberthbailey added the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 5, 2021
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 3e7d5be4-f175-4e35-adc9-c6aa8a5aa6a6

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 39ddf0b1-a491-4189-aaf8-86598600d2dc

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.18.0-5282967

@zifter
Copy link
Contributor Author

zifter commented Oct 6, 2021

Oh, tests is definitely unstable :(

@zifter
Copy link
Contributor Author

zifter commented Oct 12, 2021

Hi @aLekSer @roberthbailey!
Could you review this changes, please?

@roberthbailey
Copy link
Member

Hi @zifter - I (and maybe others) was waiting until after the feature freeze (which ends with the release cut scheduled for today) to review new PRs. So you should get some feedback in the next day or two.

@roberthbailey roberthbailey removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 12, 2021
@zifter
Copy link
Contributor Author

zifter commented Oct 14, 2021

There are some problems in CI build, which is not connected with my changes

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 6d5cb97d-5cf2-4ed8-a942-e4e5ff5615ca

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-fb02d60

build/Makefile Show resolved Hide resolved
site/content/en/docs/Guides/metrics.md Show resolved Hide resolved
install/helm/agones/values.yaml Show resolved Hide resolved
@@ -163,6 +166,13 @@ agones:
port: 443
portName: grpc
targetPort: 8443
serviceInternal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if this name should convey that this is used for scraping metrics? "internal service" is a bit vague, and might be confused as a way to allocate game servers from within the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to connect naming of that service with metrics. Yes, at that moment it's using for metrics scraping only, but in the future, I suppose, it will be used for other features too. In that case it will be confusing. It will require name change and become the problem for backward compatibility.
I'd like to make it opposite to service, which is available outside cluster (service section).
But, of course, I do not insist.
How would like you prefer to make it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, the allocator service doesn't expose much on port 8080 - just the health handlers (for liveness / readiness probes) and the metrics endpoint. So you can't actually use this service to do anything other than scrape metrics -- you can't, for instance, call this service to allocate a game server. So I don't see a problem making it obvious that this is strictly an internal metric gathering service.

One other question that occurred to me as I was thinking about this - should prometheus be scraping all allocator pods instead of using a service to pull metrics from one pod at a time (and likely different pods each time a new request is made)? If there was only one pod behind the service then using a service gives a stable name to find the pod, but when there are multiple pods and each one will have different stats, it seems like we should pull from all of them to get things like total aggregated allocations (which is the sum of allocations from all pods in the deployment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!
I will rename it to serviceMetrics, ok?

ServiceMonitor will scrape metrics from all pods which are discovered by Service label.
So, don't worry, all replicas of allocator\controller will be scrapped :)
For more information how it works, refer to this doc.
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor

@roberthbailey
Copy link
Member

I was going to update your branch but it looks like you need to resolve a conflict.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 2d2fb5be-0829-4076-9972-8f50674de3c3

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-d8b3139

Copy link
Contributor Author

@zifter zifter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix some review notes

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 5089ea86-61ae-4224-94c2-9d45cb43459d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-f7e9dfb

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 57cc7617-30a8-4652-a135-7b36f43f547b

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-a11b710

@@ -63,7 +63,54 @@ spec:
{{ toYaml .Values.agones.allocator.service.loadBalancerSourceRanges | indent 4 }}
{{- end }}
{{- end }}

{{- if .Values.agones.allocator.serviceInternal.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see .Values.agones.allocator.serviceInternal.name in the values file (there is an http.enabled though). I'm wondering if this should be on or off by default. There isn't much overhead of having an unused internal service in k8s (no new cloud resources need to be created) so I'm ok with leaving it on by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was untested changed after previous review notes, sorry.
I agree with your point and, I think, it's better to remove enabled flag at all for that service.

@@ -163,6 +166,13 @@ agones:
port: 443
portName: grpc
targetPort: 8443
serviceInternal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, the allocator service doesn't expose much on port 8080 - just the health handlers (for liveness / readiness probes) and the metrics endpoint. So you can't actually use this service to do anything other than scrape metrics -- you can't, for instance, call this service to allocate a game server. So I don't see a problem making it obvious that this is strictly an internal metric gathering service.

One other question that occurred to me as I was thinking about this - should prometheus be scraping all allocator pods instead of using a service to pull metrics from one pod at a time (and likely different pods each time a new request is made)? If there was only one pod behind the service then using a service gives a stable name to find the pod, but when there are multiple pods and each one will have different stats, it seems like we should pull from all of them to get things like total aggregated allocations (which is the sum of allocations from all pods in the deployment).

install/helm/agones/values.yaml Show resolved Hide resolved
site/content/en/docs/Guides/metrics.md Show resolved Hide resolved
build/Makefile Show resolved Hide resolved
@roberthbailey
Copy link
Member

Just a few more comments - this is getting close!

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: f06e5e98-ec4f-4dc4-a124-a2136dc369cc

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Copy link
Contributor Author

@zifter zifter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review notes fixed

build/Makefile Show resolved Hide resolved
@@ -163,6 +166,13 @@ agones:
port: 443
portName: grpc
targetPort: 8443
serviceInternal:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!
I will rename it to serviceMetrics, ok?

ServiceMonitor will scrape metrics from all pods which are discovered by Service label.
So, don't worry, all replicas of allocator\controller will be scrapped :)
For more information how it works, refer to this doc.
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/design.md#servicemonitor

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 1ab3d785-48fd-4600-bfaa-e6a74d8f1dba

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-f3c4feb

Copy link
Member

@roberthbailey roberthbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this change over the many review cycles!

@google-oss-robot
Copy link

New changes are detected. LGTM label has been removed.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: roberthbailey, zifter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: be1a445e-d671-4f13-90b7-a4d2780d61b5

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2290/head:pr_2290 && git checkout pr_2290
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.19.0-5235751

@roberthbailey roberthbailey merged commit 80e202d into googleforgames:main Oct 20, 2021
@roberthbailey roberthbailey added this to the 1.19.0 milestone Nov 1, 2021
@SaitejaTamma SaitejaTamma added the kind/feature New features for Agones label Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus metrics: Use ServiceMonitor instead of deprecated annotation mechanism
6 participants