Self-hosted e2e test is flaky #8316

sbueringer · 2023-03-20T04:13:16Z

Which jobs are flaking?

periodic-cluster-api-e2e-release-1-4
(also main & mink8s)

Which tests are flaking?

When testing Cluster API working on self-hosted clusters using ClusterClass with a HA control plane [ClusterClass]
When testing Cluster API working on self-hosted clusters using ClusterClass [ClusterClass]

Since when has it been flaking?

Note sure. Possibly related to the merge of #7680 on 10th March

20/03/2023, 02:19:05 periodic-cluster-api-e2e-main
19/03/2023, 22:19:07 periodic-cluster-api-e2e-main
19/03/2023, 04:15:02 periodic-cluster-api-e2e-main
18/03/2023, 06:13:02 periodic-cluster-api-e2e-main
17/03/2023, 16:11:01 periodic-cluster-api-e2e-main

Testgrid link

No response

Reason for failure (if possible)

Failed to run clusterctl move
Expected success, but got an error:
    <errors.aggregate | len:1, cap:1>: 
    action failed after 10 attempts: error patching the managed fields: error patching managed fields "addons.cluster.x-k8s.io/v1beta1, Kind=ClusterResourceSetBinding" self-hosted-eguve1/self-hosted-2wpo27: clusterresourcesetbindings.addons.cluster.x-k8s.io "self-hosted-2wpo27" not found
    [
        <*errors.withStack | 0xc000597ea8>{
            error: <*errors.withMessage | 0xc000f3ffe0>{
                cause: <*errors.withStack | 0xc000597e48>{
                    error: <*errors.withMessage | 0xc000f3ffc0>{
                        cause: <*errors.withStack | 0xc000597e18>{
                            error: <*errors.withMessage | 0xc000f3ffa0>{
                                cause: <*errors.StatusError | 0xc001e8f220>{
                                    ErrStatus: {
                                        TypeMeta: {Kind: ..., APIVersion: ...},
                                        ListMeta: {
                                            SelfLink: ...,
                                            ResourceVersion: ...,
                                            Continue: ...,
                                            RemainingItemCount: ...,
                                        },
                                        Status: "Failure",
                                        Message: "clusterresourcesetbindings.addons.cluster.x-k8s.io \"self-hosted-2wpo27

Anything else we need to know?

In some cases the initial clusterctl move already fails, in other cases the clusterctl move in cleanup in AfterEach fails. Basically clusterctl move seems flaky

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

sbueringer · 2023-03-20T04:16:06Z

cc @killianmuldoon @fabriziopandini

As often as this fails, I would consider clusterctl move with ClusterClass and ClusterResourceSet broken at the moment

sbueringer · 2023-03-20T04:47:01Z

Looked into it. As far as I can tell the following is happening:

We recently changed the ownerRefs of ClusterResourceSetBinding:
- previously: ClusterResourceSetBinding was owned by ClusterResourceSet and Cluster
- now: ClusterResourceSetBinding is owned only by ClusterResourceSet
The consequence of that is that clusterctl move doesn't deploy Cluster and ClusterResourceSetBinding in a specific order. Examples:
- Successful clusterctl move: Cluster is deployed before ClusterResourceSetBinding: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1637533595570540544/artifacts/clusters/bootstrap/logs/self-hosted-q5wtr0/clusterctl-move.log
- Failed clusterctl move: ClusterResourceSetBinding is deployed before Cluster: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1637533595570540544/artifacts/clusters/self-hosted-y9pfpk/logs/self-hosted-q5wtr0/clusterctl-move.log
The problem is that if ClusterResourceSetBinding is created before Cluster the ClusterResourceSetBinding controller will just delete it:
- "msg":"deleting ClusterResourceSetBinding because the owner Cluster no longer exists"

By dropping the ownerRef to Cluster we basically dropped an ordering constraint.

Notes:

ClusterResourceSetBinding controller is not paused as it doesn't check for pause on Cluster (which also would be impossible if the Cluster does not even exist)

sbueringer · 2023-03-20T04:54:09Z

Possible solutions:

Revert ✨ Introduce ClusterName field to ClusterResourceSetBinding #7680 for now to give us more time
Pause ClusterResourceSetBinding (controller already has a predicate)
Re-introduce the ownerRef from ClusterResourceSetBinding to Cluster
Deploy the entire Cluster,... hierarchy before the ClusterResourceSet hierarchy
?

@fabriziopandini @ykakarap @killianmuldoon Can one of you take this over? I don't think I will have the time to work on this until I'm on PTO.

furkatgofurov7 · 2023-03-20T09:40:29Z

Hey, thanks for opening the issue (I could not make a time for it recently).

Looks like, we are affected by this in main branch as well, particularly capi-e2e-mink8s-main job is failing now constantly with the same issue:

capi-e2e: [It] When testing Cluster API working on single-node self-hosted clusters using ClusterClass [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster expand_less | 6m27s
-- | --
{Failed to run clusterctl move Expected success, but got an error:     <errors.aggregate \| len:1, cap:1>:      action failed after 10 attempts: error patching the managed fields: error patching managed fields "addons.cluster.x-k8s.io/v1beta1, 
    [
        <*errors.withStack | 0xc002464678>{
            error: <*errors.withMessage | 0xc003002060>{
                cause: <*errors.withStack | 0xc002464648>{
                    error: <*errors.withMessage | 0xc003002040>{
                        cause: <*errors.withStack | 0xc002464618>{
                            error: <*errors.withMessage | 0xc003002020>{
                                cause: <*errors.StatusError | 0xc001f1c8c0>{
                                    ErrStatus: {
                                        TypeMeta: {Kind: ..., APIVersion: ...},
                                        ListMeta: {
                                            SelfLink: ...,
                                            ResourceVersion: ...,
                                            Continue: ...,
                                            RemainingItemCount: ...,
                                        },
                                        Status: "Failure",
                                        Message: "clusterresourcesetbindings.addons.cluster.x-k8s.io \"self-hosted-s7ahse\" not found",
                                        Reason: "NotFound",
                                        Details: {
                                            Name: ...,
                                            Group: ...,
                                            Kind: ...,
                                            UID: ...,
                                            Causes: ...,
                                            RetryAfterSeconds: ...,
                                        },
                                        Code: 404,
                                    },
                                },
                                msg: "error patching managed fields \"addons.cluster.x-k8s.io/v1beta1, Kind=ClusterResourceSetBinding\" self-hosted-neg7qg/self-hosted-s7ahse",

sbueringer · 2023-03-20T11:28:15Z

Yup, same issue in a bunch of places. I would expect the failure in e2e / mink8s across main and release-1.4

Fabrizio is looking into a fix

fabriziopandini · 2023-03-20T14:51:13Z

/triage accepted
/assign

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2023

k8s-ci-robot assigned fabriziopandini Mar 20, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2023

fabriziopandini mentioned this issue Mar 20, 2023

🐛 Add soft ownership from clusters to ClusterResourceSetBinding #8318

Merged

k8s-ci-robot closed this as completed in #8318 Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosted e2e test is flaky #8316

Self-hosted e2e test is flaky #8316

sbueringer commented Mar 20, 2023 •

edited

Loading

sbueringer commented Mar 20, 2023

sbueringer commented Mar 20, 2023 •

edited

Loading

sbueringer commented Mar 20, 2023 •

edited

Loading

furkatgofurov7 commented Mar 20, 2023

sbueringer commented Mar 20, 2023 •

edited

Loading

fabriziopandini commented Mar 20, 2023

Self-hosted e2e test is flaky #8316

Self-hosted e2e test is flaky #8316

Comments

sbueringer commented Mar 20, 2023 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

sbueringer commented Mar 20, 2023

sbueringer commented Mar 20, 2023 • edited Loading

sbueringer commented Mar 20, 2023 • edited Loading

furkatgofurov7 commented Mar 20, 2023

sbueringer commented Mar 20, 2023 • edited Loading

fabriziopandini commented Mar 20, 2023

sbueringer commented Mar 20, 2023 •

edited

Loading

sbueringer commented Mar 20, 2023 •

edited

Loading

sbueringer commented Mar 20, 2023 •

edited

Loading

sbueringer commented Mar 20, 2023 •

edited

Loading