Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10x memory usage in 1.13.2 compared to 1.12.x #2436

Closed
alam0rt opened this issue Jun 22, 2023 · 15 comments
Closed

10x memory usage in 1.13.2 compared to 1.12.x #2436

alam0rt opened this issue Jun 22, 2023 · 15 comments
Labels

Comments

@alam0rt
Copy link

alam0rt commented Jun 22, 2023

Deploying 1.13.2 revealed that the memory usage at both start up + after running for some time has drastically increased.

At start up, we saw memory usage climb to 450 or so Mi and settle down to about 400Mi.

This appears to increase with the number of nodes in the cluster.

image
The red bar is OOM kills, pre-spike is running 1.12.0, post is 1.13.2.

1.12.0

aws-cni-pqd9g                           6m           63Mi
aws-cni-q6x2x                           7m           40Mi
aws-cni-q7sbw                           8m           63Mi
aws-cni-q8msl                           10m          57Mi
aws-cni-rfrll                           5m           54Mi
aws-cni-s2qwn                           7m           55Mi
aws-cni-s9ndl                           6m           64Mi
aws-cni-s9pvj                           6m           54Mi
aws-cni-sg9g7                           6m           61Mi
aws-cni-srvvz                           6m           40Mi
aws-cni-szqg8                           6m           52Mi
aws-cni-v8gvc                           7m           64Mi
aws-cni-vjhff                           6m           40Mi
aws-cni-w56kk                           6m           52Mi
aws-cni-w9mb8                           6m           53Mi
aws-cni-wmxwf                           8m           40Mi
aws-cni-wtpgg                           6m           53Mi
aws-cni-wxrtp                           6m           54Mi

1.13.2

aws-cni-wbk75                            7m           413Mi
aws-cni-wg857                            8m           434Mi
aws-cni-wgkrj                            8m           422Mi
aws-cni-wkqnm                            11m          433Mi
aws-cni-wlzlw                            8m           428Mi
aws-cni-x22hb                            7m           424Mi
aws-cni-x8r4n                            6m           447Mi
aws-cni-xk2hd                            7m           426Mi
aws-cni-xw4ts                            5m           432Mi
aws-cni-zt5wr                            6m           446Mi
aws-cni-zxch8                            8m           453Mi

What happened:
Updated to 1.13.2

Attach logs

What you expected to happen:

For aws-cni not to use 10 times the memory.

How to reproduce it (as minimally and precisely as possible):

Deploy 1.13.2

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.25.10
  • CNI Version: 1.13.2
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
@alam0rt alam0rt added the bug label Jun 22, 2023
@alam0rt
Copy link
Author

alam0rt commented Jun 22, 2023

Mind you, in a smaller cluster (19 nodes), memory usage is much more reasonable (still a bit higher)

aws-cni-4rzv9                            5m           48Mi
aws-cni-6hk89                            5m           50Mi
aws-cni-6pdcv                            5m           49Mi
aws-cni-745lg                            6m           52Mi
aws-cni-8wp7v                            5m           51Mi
aws-cni-9jwsv                            5m           49Mi
aws-cni-9lnhs                            6m           50Mi
aws-cni-d5ssz                            6m           66Mi
aws-cni-d8lhr                            6m           53Mi
aws-cni-dbbth                            5m           55Mi
aws-cni-f79l4                            7m           50Mi
aws-cni-flt6s                            6m           63Mi
aws-cni-hqckg                            5m           49Mi
aws-cni-khr8f                            8m           53Mi
aws-cni-mcxzq                            5m           50Mi
aws-cni-tzwnn                            7m           49Mi
aws-cni-v4x49                            6m           53Mi
aws-cni-vfqsn                            5m           69Mi
aws-cni-wtq6n                            5m           50Mi

Whereas in the main example (in the description), we are at about 100 nodes, so that's why I suspect node count is somehow correlated with a memory increase in the latest release.

@jdn5126
Copy link
Contributor

jdn5126 commented Jun 22, 2023

@alam0rt v1.13.1 is not a release version. Is there a typo in the version?

@jdn5126
Copy link
Contributor

jdn5126 commented Jun 22, 2023

Also, I see that your pod names are aws-cni-xxxx rather than aws-node-xxxx. Are you sure that you are using the VPC CNI from this repo? And how are you deploying it?

@alam0rt alam0rt changed the title 10x memory usage in 0.13.1 compared to 0.12.x 10x memory usage in 1.13.1 compared to 1.12.x Jun 22, 2023
@alam0rt
Copy link
Author

alam0rt commented Jun 22, 2023

Also, I see that your pod names are aws-cni-xxxx rather than aws-node-xxxx. Are you sure that you are using the VPC CNI from this repo? And how are you deploying it?

We have some ruby which pulls down config/master/aws-k8s-cni.yaml and slightly modifies the resource to inject annotations, limits etc.

container["name"] = "aws-cni

We also build Dockerfile.release and Dockerfile.init ourselves and COPY in a special binary that we use to format logs specially. I will test without our extra step.

@alam0rt v1.13.1 is not a release version. Is there a typo in the version?

My bad, fixed the typo.

@alam0rt
Copy link
Author

alam0rt commented Jun 22, 2023

Tested using unmodified (yet built ourselves) aws-cni container and same behaviour. I can probably get a pprof going.

@alam0rt alam0rt changed the title 10x memory usage in 1.13.1 compared to 1.12.x 10x memory usage in 1.13.2 compared to 1.12.x Jun 22, 2023
@alam0rt
Copy link
Author

alam0rt commented Jun 23, 2023

image
Looks like it's caching every pod in the cluster. Makes sense as to why we saw much higher usage in our larger clusters.

@adammw and I suspect it's due to the caching client:

func CreateCachedKubeClient(rawK8SClient client.Client, mapper meta.RESTMapper) (client.Client, error) {
restCfg, err := getRestConfig()
if err != nil {
return nil, err
}
restCfg.Burst = 100
vpcCniScheme := runtime.NewScheme()
clientgoscheme.AddToScheme(vpcCniScheme)
eniconfigscheme.AddToScheme(vpcCniScheme)
stopChan := ctrl.SetupSignalHandler()
cache, err := cache.New(restCfg, cache.Options{Scheme: vpcCniScheme, Mapper: mapper})
if err != nil {
return nil, err
}
go func() {
cache.Start(stopChan)
}()
cache.WaitForCacheSync(stopChan)
cachedK8SClient := client.NewDelegatingClientInput{
CacheReader: cache,
Client: rawK8SClient,
}
returnedCachedK8SClient, err := client.NewDelegatingClient(cachedK8SClient)
if err != nil {
return nil, err
}
return returnedCachedK8SClient, nil
}

Am going to try adding a selector to scope to the node only. Probably will use MY_NODE_NAME env var from downward API.

config.SelectorsByObject = map[client.Object]ObjectSelector{&corev1.Pod{}: {
    Field: fields.Set{"spec.nodeName": nodeName}.AsSelector(),
}}

@alam0rt
Copy link
Author

alam0rt commented Jun 23, 2023

#2439

This seems to do the trick.

@jdn5126
Copy link
Contributor

jdn5126 commented Jun 26, 2023

Thanks for the excellent debugging @alam0rt and @adammw ! Sorry, just getting back from vacation, but I assumed that the k8s client cache or EC2 metadata cache could be the only components in IPAMD where an issue could cause memory to scale with the number of nodes/pods. There is also a big delta in client-go and controller-runtime versions between these two releases.

I'd like to do some further digging and testing on #2439 this week and then I can approve.

@diranged
Copy link

We just ran into this rolling 1.13.2 out to our largest cluster.. then we tried 1.13.1 and 1.13.0. All three releases have the same memory issue. Do we think that #2439 is likely to be shipped out in the next few days? Until then, do we think it's safe to run v1.13.2 with a much higher memory requirement?

@jdn5126
Copy link
Contributor

jdn5126 commented Jun 28, 2023

@diranged we are planning to do a release as soon as #2439 is merged. If you have strict memory requirements, then it may be better to stick with v1.12.6 until this is available

@diranged
Copy link

@jdn5126 We tried bumping up our limits to 1GI and we still got OOMs... so we are on 1.12.6 until this is out. Thanks.

@sam-som
Copy link

sam-som commented Jul 17, 2023

Look like the PR was merged and release
#2463

Does anyone know when this version will be available in eksctl?

eksctl utils describe-addon-versions --kubernetes-version 1.27 --name vpc-cni | grep AddonVersion
			"AddonVersions": [
					"AddonVersion": "v1.13.2-eksbuild.1",
					"AddonVersion": "v1.13.0-eksbuild.1",
					"AddonVersion": "v1.12.6-eksbuild.2",
					"AddonVersion": "v1.12.6-eksbuild.1",
					"AddonVersion": "v1.12.5-eksbuild.2",

@jdn5126
Copy link
Contributor

jdn5126 commented Jul 17, 2023

Hi @sam-som , that pipeline is in progress and should be completed by the end of this week

@jdn5126
Copy link
Contributor

jdn5126 commented Jul 18, 2023

Closing as fixed by #2463 and released in v1.13.3

@jdn5126 jdn5126 closed this as completed Jul 18, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants