Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM keda operator and metricServer #4687

Closed
yuvalweber opened this issue Jun 14, 2023 · 11 comments
Closed

OOM keda operator and metricServer #4687

yuvalweber opened this issue Jun 14, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@yuvalweber
Copy link
Contributor

Report

For some reason after deploying only one scaledObject in my cluster (very large cluster) the keda-operator started crashing due to OOM (before that he was using only 20Mi).
I am using the default spec of keda which means I have 100Mi and limits to 1000Mi.
Because of the OOM I changed the pod to have 2Gi and now he can survive with 600Mi of memory.
After that the metrics server started crashing due to OOM as well and when I changed his configuration as well he manage to work but bounced as well to this amount of memory.

my question is how can I investigate what causing this memory burst cause with debug logs I can’t see anything which seems related.

Expected Behavior

Shouldn’t jump to 30 times more memory consumption due to only one scaled object

Actual Behavior

Jump to large amount of memory

Steps to Reproduce the Problem

  1. Have a large cluster
  2. Deploy a scaled object with Prometheus scaler
  3. Have a memory burst

Logs from KEDA operator

example

KEDA Version

2.8.1

Kubernetes Version

1.23

Platform

Amazon Web Services

Scaler Details

Prometheus

Anything else?

No response

@yuvalweber yuvalweber added the bug Something isn't working label Jun 14, 2023
@JorTurFer
Copy link
Member

Could you share the logs please?

@yuvalweber
Copy link
Contributor Author

@JorTurFer
Copy link
Member

I see that you are registering 106 custom CAs, is that correct?
Could you share KEDA operator/metrics server deployment yaml? are you using helm?

@yuvalweber
Copy link
Contributor Author

How can I check this thing with the CA cause I don't see that i'm registering 106 CAs?.

This are the deployments
keda-operator-deplopyment.log
keda-metrics-server-deployment.log

@JorTurFer
Copy link
Member

oh f**k,
I was wrong, that CAs are internals, ignore my previous comment xD

@JorTurFer
Copy link
Member

Could you share the ScaledObject that you are deploying?

@yuvalweber
Copy link
Contributor Author

Of course
keda-scaled-object.log

@zroubalik
Copy link
Member

I see you are using 2.8.1 could you please update the version? I recall there were some critical issues fixed.

@yuvalweber
Copy link
Contributor Author

Hey I can't try this right now but I have a memory profile.
Maybe you can help me understand it?
keda_memory_map.pdf

@yuvalweber
Copy link
Contributor Author

Found out what the problem was.
Turned out that because I was using an older version of keda I was using a controller-runtime older version as well (0.12.3 instead of 0.15.0).
In Keda version 2.11.0 there was upgrade to newer version of controller-runtime.
In this version of controller-runtime they are not using the createStructuredListWatch which has problem with caching and they changed many settings regarding the caching of the informers the controller uses.

When I looked at the differences between the requests heading to the api-server of kubernetes I could see that in version 2.8.1 it queries: "/api/v1/secrets?limit=500"
and in version 2.11.0 it queries: "/api/v1/secrets/namespaces/<namespace_name>?limit=500"
which gets way less secrets and didn't fill the all memory we provided to keda.

Just added here the explanation because I thought it would be good to other people as well.

@zroubalik
Copy link
Member

@yuvalweber Thanks, appreciate that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants