Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-state-metrics API scraping timeout #995

Closed
zhengl7 opened this issue Dec 9, 2019 · 10 comments · May be fixed by #2510
Closed

kube-state-metrics API scraping timeout #995

zhengl7 opened this issue Dec 9, 2019 · 10 comments · May be fixed by #2510
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@zhengl7
Copy link

zhengl7 commented Dec 9, 2019

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
We are using kube-state-metrics (1.6.0) from prometheus-operator. Our cluster periodically creates hundreds of pods and deletes them within a 15 minutes window for bursting traffic. When this happens, kube-state-metrics target in Prometheus will often see API timeout (default 10 seconds). Suspecting this is due to the mutex locking mechanism used for MetricsStore that when heavy writing of new entries into the map happens it'll block reading for extended time.

What you expected to happen:
More consistent API response time to not cause Prometheus scraping timeout.

How to reproduce it (as minimally and precisely as possible):
Our cluster normally has 30k metrics in total, with roughly 25k metrics are from pod collectors (seems there are 27 pod_info_ metrics, we have close to 1000 pods). When the bursting traffic comes, the cluster create a few hundreds of pods to push total metrics to around 60k, most added in pod_info metrics.

Response size doesn't seem to be an issue, with 60k metrics, the total response size is roughly 10MB. When the scraping call doesn't time out, it can response in 100 ~ 200ms.

Single kube-state-metrics pod.

Anything else we need to know?:
Tried modifying kube-state-metrics code by using a sync.Map for MetricsStore in hope of better concurrency, it seems relieving the problem a little bit, but when the flood comes in it can still take seconds for kube-state-metrics to respond, as sync.Map is really good for infrequently changed map entries but in our case there are always a ton of new entries being created.

I think padding the gap in Prometheus with previous sample should be fine, not sure if it makes sense for kube-state-metrics to provide this behavior, something like keeping a copy of previous response, and if the collector reading takes too long simply using the copy.

Environment:

  • Kubernetes version (use kubectl version):
    1.13
  • Kube-state-metrics image version
    1.6.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2019
@lilic
Copy link
Member

lilic commented Dec 11, 2019

Thanks for the issue.

Our cluster periodically creates hundreds of pods and deletes them within a 15 minutes window for bursting traffic.

Seems like your use case is very specific, have you tried disabling the pod metrics resource from kube-state-metrics? Does it make a difference?

@brancz
Copy link
Member

brancz commented Dec 11, 2019

Or potentially configure kube-state-metrics to not watch the namespaces with the problematic workload.

@zhengl7
Copy link
Author

zhengl7 commented Dec 11, 2019

@lilic Thanks for the response. Yeah agree our use case is special. I stood up my local kube-state-metrics with some timing instrumentation and it's the pod collector that takes majority of the time, so I'm pretty sure it'll get things fixed, but I don't think that's an option for us.

@brancz Thanks for the info, that might be something we can look at if no other solutions. The namespaces can also be dynamic so it won't be ideal.

I'm now leaning toward exploring using completely lock-free maps to enahnce kube-state-metrics, as sync.Map still locks with new entries creation. If things work well I can generate a pull request for that.

@lilic
Copy link
Member

lilic commented Dec 12, 2019

@zhengl7 note that we have benchmark tests in place and some of those things were done for optimisation purposes, so if the performance decreases we won't accept the PR. But we are happy to see any PRs that solve bugs :)

@zhengl7
Copy link
Author

zhengl7 commented Dec 12, 2019

@lilic Absolutely. I just got a version that solves our problem, it reduces the scraping call duration from tens of seconds to subsecond during those peak traffic minutes, on the same cluster. I'll try running the test scripts locally.

@zhengl7
Copy link
Author

zhengl7 commented Jan 18, 2020

@lilic Just submitted the PR about the proposed change, the fix ends up being just using sync.Map, and the timeouts are completely gone.

I ran make test-benchmark-compare and seems there are some degradation, not sure whether it's acceptable or not.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Pokom added a commit to grafana/kube-state-metrics that referenced this issue Jun 5, 2024
There are a few documented scenarios where `kube-state-metrics` will
lock up(kubernetes#995, kubernetes#1028). I believe a much simpler solution to ensure
`kube-state-metrics` doesn't lock up and require a restart to server
`/metrics` requests is to add default read and write timeouts and to
allow them to be configurable. At Grafana, we've experienced a few
scenarios where `kube-state-metrics` running in larger clusters falls
behind and starts getting scraped multiple times. When this occurs,
`kube-state-metrics` becomes completely unresponsive and requires a
reboot. This is somewhat easily reproduceable(I'll provide a script in
an issue) and causes other critical workloads(KEDA, VPA) to fail in
weird ways.

Adds two flags:
- `server-read-timeout`
- `server-write-timeout`

Updates the metrics http server to set the `ReadTimeout` and
`WriteTimeout` to the configured values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
5 participants