10x memory usage in `1.13.2` compared to `1.12.x` #2436

alam0rt · 2023-06-22T01:58:31Z

Deploying 1.13.2 revealed that the memory usage at both start up + after running for some time has drastically increased.

At start up, we saw memory usage climb to 450 or so Mi and settle down to about 400Mi.

This appears to increase with the number of nodes in the cluster.

The red bar is OOM kills, pre-spike is running 1.12.0, post is 1.13.2.

1.12.0

aws-cni-pqd9g                           6m           63Mi
aws-cni-q6x2x                           7m           40Mi
aws-cni-q7sbw                           8m           63Mi
aws-cni-q8msl                           10m          57Mi
aws-cni-rfrll                           5m           54Mi
aws-cni-s2qwn                           7m           55Mi
aws-cni-s9ndl                           6m           64Mi
aws-cni-s9pvj                           6m           54Mi
aws-cni-sg9g7                           6m           61Mi
aws-cni-srvvz                           6m           40Mi
aws-cni-szqg8                           6m           52Mi
aws-cni-v8gvc                           7m           64Mi
aws-cni-vjhff                           6m           40Mi
aws-cni-w56kk                           6m           52Mi
aws-cni-w9mb8                           6m           53Mi
aws-cni-wmxwf                           8m           40Mi
aws-cni-wtpgg                           6m           53Mi
aws-cni-wxrtp                           6m           54Mi

1.13.2

aws-cni-wbk75                            7m           413Mi
aws-cni-wg857                            8m           434Mi
aws-cni-wgkrj                            8m           422Mi
aws-cni-wkqnm                            11m          433Mi
aws-cni-wlzlw                            8m           428Mi
aws-cni-x22hb                            7m           424Mi
aws-cni-x8r4n                            6m           447Mi
aws-cni-xk2hd                            7m           426Mi
aws-cni-xw4ts                            5m           432Mi
aws-cni-zt5wr                            6m           446Mi
aws-cni-zxch8                            8m           453Mi

What happened:
Updated to 1.13.2

Attach logs

What you expected to happen:

For aws-cni not to use 10 times the memory.

How to reproduce it (as minimally and precisely as possible):

Deploy 1.13.2

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.25.10
CNI Version: 1.13.2
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

The text was updated successfully, but these errors were encountered:

alam0rt · 2023-06-22T02:01:40Z

Mind you, in a smaller cluster (19 nodes), memory usage is much more reasonable (still a bit higher)

aws-cni-4rzv9                            5m           48Mi
aws-cni-6hk89                            5m           50Mi
aws-cni-6pdcv                            5m           49Mi
aws-cni-745lg                            6m           52Mi
aws-cni-8wp7v                            5m           51Mi
aws-cni-9jwsv                            5m           49Mi
aws-cni-9lnhs                            6m           50Mi
aws-cni-d5ssz                            6m           66Mi
aws-cni-d8lhr                            6m           53Mi
aws-cni-dbbth                            5m           55Mi
aws-cni-f79l4                            7m           50Mi
aws-cni-flt6s                            6m           63Mi
aws-cni-hqckg                            5m           49Mi
aws-cni-khr8f                            8m           53Mi
aws-cni-mcxzq                            5m           50Mi
aws-cni-tzwnn                            7m           49Mi
aws-cni-v4x49                            6m           53Mi
aws-cni-vfqsn                            5m           69Mi
aws-cni-wtq6n                            5m           50Mi

Whereas in the main example (in the description), we are at about 100 nodes, so that's why I suspect node count is somehow correlated with a memory increase in the latest release.

jdn5126 · 2023-06-22T02:24:14Z

@alam0rt v1.13.1 is not a release version. Is there a typo in the version?

jdn5126 · 2023-06-22T02:31:36Z

Also, I see that your pod names are aws-cni-xxxx rather than aws-node-xxxx. Are you sure that you are using the VPC CNI from this repo? And how are you deploying it?

alam0rt · 2023-06-22T03:17:20Z

Also, I see that your pod names are aws-cni-xxxx rather than aws-node-xxxx. Are you sure that you are using the VPC CNI from this repo? And how are you deploying it?

We have some ruby which pulls down config/master/aws-k8s-cni.yaml and slightly modifies the resource to inject annotations, limits etc.

container["name"] = "aws-cni

We also build Dockerfile.release and Dockerfile.init ourselves and COPY in a special binary that we use to format logs specially. I will test without our extra step.

@alam0rt v1.13.1 is not a release version. Is there a typo in the version?

My bad, fixed the typo.

alam0rt · 2023-06-22T03:45:27Z

Tested using unmodified (yet built ourselves) aws-cni container and same behaviour. I can probably get a pprof going.

alam0rt · 2023-06-23T03:01:47Z

Looks like it's caching every pod in the cluster. Makes sense as to why we saw much higher usage in our larger clusters.

@adammw and I suspect it's due to the caching client:

amazon-vpc-cni-k8s/pkg/k8sapi/k8sutils.go

Lines 61 to 92 in 0ac4b39

    
           func CreateCachedKubeClient(rawK8SClient client.Client, mapper meta.RESTMapper) (client.Client, error) { 
        
           	restCfg, err := getRestConfig() 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	restCfg.Burst = 100 
        
           	vpcCniScheme := runtime.NewScheme() 
        
           	clientgoscheme.AddToScheme(vpcCniScheme) 
        
           	eniconfigscheme.AddToScheme(vpcCniScheme) 
        
           	stopChan := ctrl.SetupSignalHandler() 
        
           	cache, err := cache.New(restCfg, cache.Options{Scheme: vpcCniScheme, Mapper: mapper}) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	go func() { 
        
           		cache.Start(stopChan) 
        
           	}() 
        
           	cache.WaitForCacheSync(stopChan) 
        
           	cachedK8SClient := client.NewDelegatingClientInput{ 
        
           		CacheReader: cache, 
        
           		Client:      rawK8SClient, 
        
           	} 
        
           	returnedCachedK8SClient, err := client.NewDelegatingClient(cachedK8SClient) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	return returnedCachedK8SClient, nil 
        
           }

Am going to try adding a selector to scope to the node only. Probably will use MY_NODE_NAME env var from downward API.

config.SelectorsByObject = map[client.Object]ObjectSelector{&corev1.Pod{}: {
    Field: fields.Set{"spec.nodeName": nodeName}.AsSelector(),
}}

alam0rt · 2023-06-23T04:51:37Z

#2439

This seems to do the trick.

jdn5126 · 2023-06-26T16:10:01Z

Thanks for the excellent debugging @alam0rt and @adammw ! Sorry, just getting back from vacation, but I assumed that the k8s client cache or EC2 metadata cache could be the only components in IPAMD where an issue could cause memory to scale with the number of nodes/pods. There is also a big delta in client-go and controller-runtime versions between these two releases.

I'd like to do some further digging and testing on #2439 this week and then I can approve.

diranged · 2023-06-28T03:20:15Z

We just ran into this rolling 1.13.2 out to our largest cluster.. then we tried 1.13.1 and 1.13.0. All three releases have the same memory issue. Do we think that #2439 is likely to be shipped out in the next few days? Until then, do we think it's safe to run v1.13.2 with a much higher memory requirement?

jdn5126 · 2023-06-28T16:01:15Z

@diranged we are planning to do a release as soon as #2439 is merged. If you have strict memory requirements, then it may be better to stick with v1.12.6 until this is available

diranged · 2023-06-28T17:54:40Z

@jdn5126 We tried bumping up our limits to 1GI and we still got OOMs... so we are on 1.12.6 until this is out. Thanks.

sam-som · 2023-07-17T22:14:01Z

Look like the PR was merged and release
#2463

Does anyone know when this version will be available in eksctl?

eksctl utils describe-addon-versions --kubernetes-version 1.27 --name vpc-cni | grep AddonVersion
			"AddonVersions": [
					"AddonVersion": "v1.13.2-eksbuild.1",
					"AddonVersion": "v1.13.0-eksbuild.1",
					"AddonVersion": "v1.12.6-eksbuild.2",
					"AddonVersion": "v1.12.6-eksbuild.1",
					"AddonVersion": "v1.12.5-eksbuild.2",

jdn5126 · 2023-07-17T22:17:16Z

Hi @sam-som , that pipeline is in progress and should be completed by the end of this week

jdn5126 · 2023-07-18T15:38:39Z

Closing as fixed by #2463 and released in v1.13.3

github-actions · 2023-07-18T15:38:58Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

alam0rt added the bug label Jun 22, 2023

alam0rt changed the title ~~10x memory usage in 0.13.1 compared to 0.12.x~~ 10x memory usage in 1.13.1 compared to 1.12.x Jun 22, 2023

alam0rt changed the title ~~10x memory usage in 1.13.1 compared to 1.12.x~~ 10x memory usage in 1.13.2 compared to 1.12.x Jun 22, 2023

alam0rt mentioned this issue Jun 23, 2023

Scope the cached k8s client to the node #2439

Closed

jdn5126 mentioned this issue Jul 3, 2023

For k8s client cache, only cache pods on this node #2456

Closed

jdn5126 mentioned this issue Jul 11, 2023

Decrease memory usage by K8S Clients #2463

Merged

jdn5126 closed this as completed Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10x memory usage in `1.13.2` compared to `1.12.x` #2436

10x memory usage in `1.13.2` compared to `1.12.x` #2436

alam0rt commented Jun 22, 2023 •

edited

Loading

alam0rt commented Jun 22, 2023

jdn5126 commented Jun 22, 2023

jdn5126 commented Jun 22, 2023

alam0rt commented Jun 22, 2023 •

edited

Loading

alam0rt commented Jun 22, 2023

alam0rt commented Jun 23, 2023 •

edited

Loading

alam0rt commented Jun 23, 2023

jdn5126 commented Jun 26, 2023

diranged commented Jun 28, 2023

jdn5126 commented Jun 28, 2023

diranged commented Jun 28, 2023

sam-som commented Jul 17, 2023

jdn5126 commented Jul 17, 2023

jdn5126 commented Jul 18, 2023

github-actions bot commented Jul 18, 2023

10x memory usage in 1.13.2 compared to 1.12.x #2436

10x memory usage in 1.13.2 compared to 1.12.x #2436

Comments

alam0rt commented Jun 22, 2023 • edited Loading

alam0rt commented Jun 22, 2023

jdn5126 commented Jun 22, 2023

jdn5126 commented Jun 22, 2023

alam0rt commented Jun 22, 2023 • edited Loading

alam0rt commented Jun 22, 2023

alam0rt commented Jun 23, 2023 • edited Loading

alam0rt commented Jun 23, 2023

jdn5126 commented Jun 26, 2023

diranged commented Jun 28, 2023

jdn5126 commented Jun 28, 2023

diranged commented Jun 28, 2023

sam-som commented Jul 17, 2023

jdn5126 commented Jul 17, 2023

jdn5126 commented Jul 18, 2023

github-actions bot commented Jul 18, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

10x memory usage in `1.13.2` compared to `1.12.x` #2436

10x memory usage in `1.13.2` compared to `1.12.x` #2436

alam0rt commented Jun 22, 2023 •

edited

Loading

alam0rt commented Jun 22, 2023 •

edited

Loading

alam0rt commented Jun 23, 2023 •

edited

Loading