Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dupe labels in prom metrics #2194

Merged
merged 8 commits into from
Jun 21, 2019

Conversation

blakebarnett
Copy link
Contributor

Fixes #2181

Blake added 2 commits March 7, 2019 13:54
Since the # of containers shouldn't be massive on a single machine this is probably fine for memory allocation.
@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link
Collaborator

Hi @blakebarnett. Thanks for your PR.

I'm waiting for a google or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@blakebarnett
Copy link
Contributor Author

I signed it!

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@dashpole
Copy link
Collaborator

dashpole commented Mar 7, 2019

/ok-to-test

@dashpole
Copy link
Collaborator

dashpole commented Mar 7, 2019

The following files are not properly formatted:
metrics/prometheus.go

sl := sanitizeLabelName(l)
for _, x := range labels {
if sl != x {
duplicate = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will stay permanently true, and we will skip all subsequent labels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, fixing

break
}
}
if duplicate != true {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/duplicate != true/!duplicate

@@ -1155,8 +1155,19 @@ func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric)
values := make([]string, 0, len(rawLabels))
labels := make([]string, 0, len(rawLabels))
containerLabels := c.containerLabelsFunc(cont)
duplicate := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just declare duplicate inside the for l := range rawLabels block? Then you don't need to reset it to false each loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, confused myself by doing this a different way before submitting the PR. Thanks :)

}
}
if !duplicate {
labels = append(labels, sl)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we end up with fewer labels than values? Should we move the values = append(values, containerLabels[l]) statement inside here as well?

Copy link
Contributor Author

@blakebarnett blakebarnett Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering that after the test failure, but this shouldn't change that behavior right? It will only exclude a label if a duplicate label already exists and the value will still get set for that label.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if prometheus didn't yell at us if we tried to, for example, use a description with 3 labels, but then provide 4 label values when creating the metric. What you are implicitly doing here is just using the first occurrence of a given sanitized label, and ignoring subsequent ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I see. It could happen if someone provides multiple permutations of a label that all normalize to the same thing. Should we just throw them out in that case? I can't think of a great default behavior there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ on the same container, it should hopefully be a rare edge-case...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just picking the first is a fine behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No #2181 was node-wide. 2 separate containers (separate pods) with annotations that normalize to the same thing caused the panic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get the difference now. I think we would still always get more values than labels, since if the label isn't present, we still add it with the value "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, in fact I noticed when running without --store_container_labels=false all labels that were present on any of the containers on the host showed up with empty values for all container metrics in prometheus, that was what made me look into the cgroup whitelisting initially and then I noticed this crash behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Prometheus requires that all metric streams in a given scrape have the same set of labels. So our workaround is just to add empty values for all labels we don't have.

@blakebarnett
Copy link
Contributor Author

blakebarnett commented Mar 14, 2019

We noticed a something similar to this: #2183 with --store_container_labels=false, the first container in the cgroup has all the kubernetes-provided label values, the rest don't. I can't figure if this is a regression of #1704 or something new...

This is without any changes and without this PR applied, I discovered it when validating this change.

@dashpole
Copy link
Collaborator

Feel free to close if you are no longer working on this. IIRC, this re-introduces #1704 in its current form. Let me know if that is not the case.

@dashpole dashpole self-assigned this Jun 21, 2019
@blakebarnett
Copy link
Contributor Author

Sorry, lost track of this one. This should be fine to go in. The issue I mentioned above about store_cotnainer_labels=false seems to be the bigger problem for us, I'll open another issue/PR for it.

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dashpole dashpole merged commit e8b24bf into google:master Jun 21, 2019
@nightah
Copy link
Contributor

nightah commented Oct 15, 2019

@dashpole is this likely to make it in a release for Docker anytime soon?
I am experiencing the same issue that is exhibited in #2181 and have rolled back to v0.32.0 in the interim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in v0.33.0
5 participants