Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterClass: cache SSA dry-run requests #8146

Closed
sbueringer opened this issue Feb 21, 2023 · 4 comments · Fixed by #8207 or #8243
Closed

ClusterClass: cache SSA dry-run requests #8146

sbueringer opened this issue Feb 21, 2023 · 4 comments · Fixed by #8207 or #8243
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@sbueringer
Copy link
Member

sbueringer commented Feb 21, 2023

We recently merged: ClusterClass: run dry-run on original and modified object.

This basically doubled the amount of SSA dry-run calls to the API server by the Cluster topology controller.

Goal of this issue is to implement a cache to reduce the number of calls.

The rough idea is that for every given original and modified object we don't have to repeat SSA dryrun calls once we know that there is no diff.

We cache the "no diff" result under a combined key which consists of: resourceVersion of original and hash of modified.

To cover against changes to the defaulting logic (e.g. by infra provider updates), which influences the result of the SSA dryrun, we will only cache the result for 10 minutes.

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2023
@sbueringer sbueringer self-assigned this Feb 21, 2023
@sbueringer sbueringer added this to the v1.4 milestone Feb 21, 2023
@sbueringer
Copy link
Member Author

cc @fabriziopandini

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2023
@sbueringer
Copy link
Member Author

Moving a comment from #8207 (comment)

Took a look at the data before/after this PR merged:

Cluster topology controller:

  • 10.96.0.1:443 cluster.x-k8s.io clusters
    • normal: 10
    • dryrun: 984 => 304
  • 10.96.0.1:443 cluster.x-k8s.io machinedeployments
    • normal: 18 => 19
    • dryrun: 958 => 178
  • 10.96.0.1:443 cluster.x-k8s.io machinehealthchecks:
    • normal: 18
    • dryrun: 1630 => 260
  • 10.96.0.1:443 controlplane.cluster.x-k8s.io kubeadmcontrolplanes:
    • normal: 25
    • dryrun: 964 => 588
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockerclusters
    • normal: 11 => 10
    • dryrun: 964 => 62
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigtemplates
    • normal: 15
    • dryrun: 958 => 44
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachinetemplates
    • normal: 37
    • dryrun: 1922 => 110

MD controller:

  • 10.96.0.1:443 cluster.x-k8s.io machinesets: 892 => 753
    => TODO: the reduction doesn't look good enough

MS controller:

  • 10.96.0.1:443 cluster.x-k8s.io machines: 1650 => 570
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachines: 1613 => 205
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigs: 1613 => 151

Machine controller: (various workload clusters)

  • 172.18.0.10: v1/nodes 526 => 18
  • 172.18.0.3: v1/nodes 977 => 147
  • 172.18.0.4: v1/nodes 2075 => 83
  • 172.18.0.5: v1/nodes 729 => 73
  • 172.18.0.6: v1/nodes 353 => 128
  • 172.18.0.7: v1/nodes 601 => 43
  • 172.18.0.8: v1/nodes 135 => 41

=> Before: 19663 => After: 3892

KCP controller:

  • 10.96.0.1:443 cluster.x-k8s.io machines 4404 => 4131
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigs: 4355 => 4089
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachines 4355 => 4089
    => TODO: the reduction doesn't look good enough

tl;dr 5x reduction of calls for everything except KCP and MD controller. I'll take a look at those, maybe the caching doesn't work there for some reason

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 7, 2023
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

Moving a comment from #8207 (comment)

Took a look at the data before/after this PR merged:

Cluster topology controller:

  • 10.96.0.1:443 cluster.x-k8s.io clusters
  • normal: 10
  • dryrun: 984 => 304
  • 10.96.0.1:443 cluster.x-k8s.io machinedeployments
  • normal: 18 => 19
  • dryrun: 958 => 178
  • 10.96.0.1:443 cluster.x-k8s.io machinehealthchecks:
  • normal: 18
  • dryrun: 1630 => 260
  • 10.96.0.1:443 controlplane.cluster.x-k8s.io kubeadmcontrolplanes:
  • normal: 25
  • dryrun: 964 => 588
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockerclusters
  • normal: 11 => 10
  • dryrun: 964 => 62
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigtemplates
  • normal: 15
  • dryrun: 958 => 44
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachinetemplates
  • normal: 37
  • dryrun: 1922 => 110

MD controller:

  • 10.96.0.1:443 cluster.x-k8s.io machinesets: 892 => 753
    => TODO: the reduction doesn't look good enough

MS controller:

  • 10.96.0.1:443 cluster.x-k8s.io machines: 1650 => 570
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachines: 1613 => 205
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigs: 1613 => 151

Machine controller: (various workload clusters)

  • 172.18.0.10: v1/nodes 526 => 18
  • 172.18.0.3: v1/nodes 977 => 147
  • 172.18.0.4: v1/nodes 2075 => 83
  • 172.18.0.5: v1/nodes 729 => 73
  • 172.18.0.6: v1/nodes 353 => 128
  • 172.18.0.7: v1/nodes 601 => 43
  • 172.18.0.8: v1/nodes 135 => 41

=> Before: 19663 => After: 3892

KCP controller:

  • 10.96.0.1:443 cluster.x-k8s.io machines 4404 => 4131
  • 10.96.0.1:443 bootstrap.cluster.x-k8s.io kubeadmconfigs: 4355 => 4089
  • 10.96.0.1:443 infrastructure.cluster.x-k8s.io dockermachines 4355 => 4089
    => TODO: the reduction doesn't look good enough

tl;dr 5x reduction of calls for everything except KCP and MD controller. I'll take a look at those, maybe the caching doesn't work there for some reason

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member Author

sbueringer commented Mar 8, 2023

#8243 further improves caching drastically. In this case I ran e2e-full on #8243 and #8241 (baseline).
The absolute numbers are similar but not directly comparable to the results above as the e2e full job is skipping the quickstart tests compared to the periodic I used above.

Cluster topology controller:

  • cluster.x-k8s.io clusters (dryrun) 302 => 302
  • cluster.x-k8s.io clusters 10 => 10
  • cluster.x-k8s.io machinedeployments (dryrun) 180 => 178
  • cluster.x-k8s.io machinedeployments 19 => 20
  • cluster.x-k8s.io machinehealthchecks (dryrun) 240 => 246
  • cluster.x-k8s.io machinehealthchecks 18 => 18
  • controlplane.cluster.x-k8s.io kubeadmcontrolplanes (dryrun) 586 => 598
  • controlplane.cluster.x-k8s.io kubeadmcontrolplanes 24 => 24
  • infrastructure.cluster.x-k8s.io dockerclusters (dryrun) 68 => 68
  • infrastructure.cluster.x-k8s.io dockerclusters 11 => 10
  • bootstrap.cluster.x-k8s.io kubeadmconfigtemplates (dryrun) 42 => 42
  • bootstrap.cluster.x-k8s.io kubeadmconfigtemplates 15 => 15
  • infrastructure.cluster.x-k8s.io dockermachinetemplates (dryrun) 110 => 108
  • infrastructure.cluster.x-k8s.io dockermachinetemplates 37 => 37

MD controller:

  • cluster.x-k8s.io machinesets 736 => 417

MS controller:

  • cluster.x-k8s.io machines 596 => 401
  • bootstrap.cluster.x-k8s.io kubeadmconfigs 142 => 145
  • infrastructure.cluster.x-k8s.io dockermachines 201 => 202

KCP controller:

  • cluster.x-k8s.io machines 4409 => 497
  • bootstrap.cluster.x-k8s.io kubeadmconfigs 4431 => 153
  • infrastructure.cluster.x-k8s.io dockermachines 4431 => 197

Improvements:

  • KCP controller: Machines/KubeadmConfig/DockerMachine: 13271 => 847
  • MD controller: MachineSets: 736 => 417
  • MS controller: Machines: 596 => 401

Overall ~ through both PRs: Before: 32934 => After: 4225 (~87% less calls / ~8x improvement)

P.S. It's nice to see (also for future experiments) that the call counts are almost constant for the calls that are not affected

@killianmuldoon killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
3 participants