🐛 Bootstrap machine only if it conforms to the version skew policy #6044

dlipovetsky · 2022-02-02T03:40:50Z

What this PR does / why we need it:
Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

TODO

Handle absence of Status.Version field in ControlPlaneRef.
Additional tests in machine_controller_test.go

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #6040

k8s-ci-robot · 2022-02-02T03:41:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign neolit123 after the PR has been reviewed.
You can assign the PR to them by writing /assign @neolit123 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2022-02-02T03:59:02Z

@dlipovetsky: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-e2e-main	`29cc271`	link	true	`/test pull-cluster-api-e2e-main`
pull-cluster-api-e2e-informing-ipv6-main	`29cc271`	link	false	`/test pull-cluster-api-e2e-informing-ipv6-main`
pull-cluster-api-e2e-informing-main	`29cc271`	link	false	`/test pull-cluster-api-e2e-informing-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dlipovetsky · 2022-02-02T04:54:37Z

The e2e tests are failing, because the code doesn't handle the absence of the Status.Version field of the Control Plane resource.

The Get API in the contract package makes this difficult: The "not found" error is unexported. While it could be exported, I think the Get method should instead return a boolean (i.e. (string, bool, error), instead of an error, since it is normal for optional fields to be absent. The path of the absent field is included in the error, but that same path is already readable using the Path method. I'll submit a separate PR for that.

enxebre · 2022-02-02T10:19:34Z

thanks @dlipovetsky. Given that the topology controller managed upgrades should take care of this I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact for some use cases where I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario.
Anyone thoughts?

joejulian · 2022-02-02T17:44:46Z

Topology controller managed upgrades have a ways to go and is still experimental. We have an immediate need to allow declarative updates and this would provide that ability. A user can apply resources to upgrade their cluster and this would prevent unsupported version skew: upgrading the control plane and disallowing the workers to upgrade until the control plane is upgraded.

Further, this will prevent users from inadvertently make their situation worse by upgrading a set of worker machines such that they're not supported by the control plane.

Do you have a suggestion, @enxebre, on how you would disable this protection for development purposes? If I were trying to break my cluster I suppose I would just comment out that protection and run it externally.

dlipovetsky · 2022-02-02T18:01:36Z

thanks @dlipovetsky. Given that the topology controller managed upgrades should take care of this I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact for some use cases where I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario. Anyone thoughts?

I share your concerns.

Given that the topology controller managed upgrades should take care of this

I want correct behavior even when I do not use ClusterClass.

I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact

This is a good point. However, I have not yet been able to think of a scenario where the version check presents a problem, especially since the check is skipped when either the Machine.Spec.Version and Cluster.Spec.ControlPlaneRef are undefined.

I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario.

We could make version check either opt-in or opt-out, e.g., using an annotation.

neolit123 · 2022-02-03T13:26:22Z

Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

what has made sense to me in the past, and what i saw already being the case in kubeadm after joining the project is the following:

error out on users if they are violating version skew policies.
give them the power to override the errors if needed for testing, etc.

unclear to me how this can be done in CAPI, but it feels like the best UX.

dlipovetsky · 2022-02-03T17:18:49Z

Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

what has made sense to me in the past, and what i saw already being the case in kubeadm after joining the project is the following:

error out on users if they are violating version skew policies.

give them the power to override the errors if needed for testing, etc.

Thanks for the feedback! I like this approach.

unclear to me how this can be done in CAPI

I think an annotation on the Machine could work. For example, machine.cluster.x-k8s.io/exclude-version-skew-check, following as similar annotation that allows the user to skip node drain on delete. WDYT?

fabriziopandini · 2022-02-03T17:56:34Z

@dlipovetsky thanks for tackling this problem!
This helps to address the discussion on #5222 as well.

A few considerations on the UX:

The proposed implementation makes the machine reconciliation fail in case a wrong version is applied; I agree this should be the latest safeguard, we should make the users aware of why the machine is not provisioning.
Thinking about a better UX, IMO a user I would prefer to get informed about the problem while creating a machine, or given that machines are often controlled by higher-level abstractions, when creating KCP, MD/MS or MP (vs accepting resources with wrong version and then fail at reconciling time); I agree this is a much bigger change, but on the bright side we now have in place all the bits required for creating "smarter" webhooks - like the one implemented for ClusterClass -.

One last consideration, I'm getting to the idea that surfacing some information about the ControlPlane version on the Cluster could simplify a couple of use cases (I should open an issue and start to track them)

vincepri · 2022-02-04T16:02:33Z

internal/controllers/machine/machine_controller.go

+		return true, nil
+	}
+
+	mv, err := semver.ParseTolerant(*machine.Spec.Version)


Suggested change

mv, err := semver.ParseTolerant(*machine.Spec.Version)

machineVersion, err := semver.ParseTolerant(*machine.Spec.Version)

vincepri · 2022-02-04T16:04:01Z

internal/controllers/machine/machine_controller.go

+	controlPlaneActualVersionStr, err := contract.ControlPlane().StatusVersion().Get(controlPlane)
+	if err != nil {
+		return false, errors.Wrap(err, "failed to read control plane actual version")
+	}
+	// controlPlaneVersionStr is not nil, because Get did not return an error.
+	controlPlaneActualVersion, err := semver.ParseTolerant(*controlPlaneActualVersionStr)
+	if err != nil {
+		return false, errors.Wrap(err, "failed to parse control plane actual version")
+	}


status.version was an optional field that was added later, not sure if it's in the current contract (cc @fabriziopandini), although spec.version should be there. If the current version in status is not available, can we fallback to only look at the desired one instead?

Both fields are required for "implementations using version" (https://cluster-api.sigs.k8s.io/developer/architecture/controllers/control-plane.html#required-status-fields-for-implementations-using-version)

If only .spec.version exists and .status.version doesn't, in most cases this means that the first control plane machine isn't up yet. In the topology reconcile we're interpreting this as "control plane is provisioning"

Thanks for the additional context. I have a TODO to handle the absence of Status.Version.

vincepri · 2022-02-04T16:04:50Z

internal/controllers/machine/machine_controller.go

+
+	if util.IsControlPlaneMachine(machine) {
+		if version.Compare(mv, controlPlaneDesiredVersion, version.IgnorePatchVersion()) == 1 {
+			return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+


Suggested change

return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+

return false, errors.Errorf("control plane machine major.minor version (%d.%d) must be less than or equal to"+

vincepri · 2022-02-04T16:05:44Z

internal/controllers/machine/machine_controller.go

+	ok, err := r.isVersionAllowed(ctx, m, cluster)
+	if err != nil {
+		return ctrl.Result{}, errors.Wrap(err, "failed to check machine version")
+	}
+	if !ok {
+		return ctrl.Result{}, errors.Wrap(err, "machine version is not allowed")
+	}


I don't think this is right, wouldn't this code block reconciliation of a Machine as soon as the control plane bumps the version?

In general, we should have checks in a validation webhook, potentially this one can be carrying a client that looks at the control plane reference

Webhook on which resource? MachineDeployment, MachinePool, MachineSet, Machine or all of them?

I assume only on create, which would solve the issue that this check should not be run once a Machine already exists (to avoid blocking reconciliation of existing machines because a control plane upgrade is in progress)

ha I just commented something similar here: #6040 (comment)

IMO the validation should be in the webhook of every object that can't change before the ControlPlane does but that has a Version in spec that can be edited by the user: Machine, MachinePool, MachineDeployment, and MachineSet.

I was thinking we'd want to validate updates too. If a user tries to upgrade a MachinePool or MachineDeployment before upgrading the ControlPlane that's breaking. What's your concern with blocking reconciliation of existing machines because a control plane upgrade is in progress @sbueringer? Wouldn't the new version be reflected in the control plane spec if the control plane has already been updated? Also, we wouldn't block reconciliation, just updating the Machine/MD/MP/MS version.

For Machines, we need to make sure to not block Machine objects that are being created as part of the reconciliation of machine-based control planes, otherwise the control plane is never going to be rolled out

good point, I think we can exclude control plane machines from this validation based on the control plane label?

sbueringer · 2022-02-04T16:19:51Z

One last consideration, I'm getting to the idea that surfacing some information about the ControlPlane version on the Cluster could simplify a couple of use cases (I should open an issue and start to track them)

I think this issue covers that: #5341

sbueringer · 2022-02-04T16:37:04Z

internal/controllers/machine/machine_controller.go

+// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version.
+// If the Machine is not part of the control plane, then its major.minor version must be less than or equal to the control
+// plane actual major.minor version.
+func (r *Reconciler) isVersionAllowed(ctx context.Context, machine *clusterv1.Machine, cluster *clusterv1.Cluster) (bool, error) {


I wonder if it would make sense to align some of the logic with what we are doing in the topology reconciler when deciding if we want to trigger MD rollouts (computeMachineDeploymentVersion).

E.g. when the control plane is currently upgrading (ControlPlaneContract.IsUpgrading) we are not triggering new MD upgrades.

Most of the logic does not apply, but I wonder if there are some common parts.

sbueringer · 2022-02-04T16:44:23Z

internal/controllers/machine/machine_controller.go

+// If the Machine is part of the control plane, then its major.minor version must be greater than or equal to the
+// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version.


I could be wrong, but could it be that it's possible to downgrade with KCP today?
(I only found something that it's forbidden via ClusterClass)

If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)

If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)

Probably.

The current version skew policy does not mention downgrade. The most recent discussion took place in 2019-2020: kubernetes/website#12327.

k8s-ci-robot · 2022-03-29T05:36:28Z

@dlipovetsky: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2022-06-07T17:11:17Z

@dlipovetsky Any updates on this PR?

k8s-triage-robot · 2022-09-05T17:47:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vincepri · 2022-09-26T20:28:26Z

Closing for now

/close

k8s-ci-robot · 2022-09-26T20:28:30Z

@vincepri: Closed this PR.

In response to this:

Closing for now

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dlipovetsky added 2 commits February 1, 2022 19:39

util/version: Allow comparison of only major.minor version

3fd8415

Bootstrap machine only if it conforms to the version skew policy

29cc271

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2022

k8s-ci-robot requested review from CecileRobertMichon and stmcginnis February 2, 2022 03:41

joejulian approved these changes Feb 2, 2022

View reviewed changes

neolit123 mentioned this pull request Feb 3, 2022

Machine controller proceeds to bootstrap a Machine even though its Kubernetes version violates the version skew policy #6040

Closed

vincepri reviewed Feb 4, 2022

View reviewed changes

sbueringer reviewed Feb 4, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2022

k8s-ci-robot closed this Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

dlipovetsky commented Feb 2, 2022 •

edited

Loading

k8s-ci-robot commented Feb 2, 2022

k8s-ci-robot commented Feb 2, 2022

dlipovetsky commented Feb 2, 2022

enxebre commented Feb 2, 2022 •

edited

Loading

joejulian commented Feb 2, 2022

dlipovetsky commented Feb 2, 2022

neolit123 commented Feb 3, 2022

dlipovetsky commented Feb 3, 2022

fabriziopandini commented Feb 3, 2022

vincepri Feb 4, 2022

vincepri Feb 4, 2022

sbueringer Feb 4, 2022

sbueringer Feb 4, 2022

dlipovetsky Feb 4, 2022

vincepri Feb 4, 2022

vincepri Feb 4, 2022

vincepri Feb 4, 2022

sbueringer Feb 4, 2022 •

edited

Loading

CecileRobertMichon Mar 17, 2022

vincepri Mar 29, 2022

CecileRobertMichon Mar 29, 2022

vincepri Mar 30, 2022

sbueringer commented Feb 4, 2022

sbueringer Feb 4, 2022 •

edited

Loading

sbueringer Feb 4, 2022 •

edited

Loading

dlipovetsky Feb 4, 2022

k8s-ci-robot commented Mar 29, 2022

vincepri commented Jun 7, 2022

k8s-triage-robot commented Sep 5, 2022

vincepri commented Sep 26, 2022

k8s-ci-robot commented Sep 26, 2022

	mv, err := semver.ParseTolerant(*machine.Spec.Version)
	machineVersion, err := semver.ParseTolerant(*machine.Spec.Version)

	return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+
	return false, errors.Errorf("control plane machine major.minor version (%d.%d) must be less than or equal to"+

		// If the Machine is part of the control plane, then its major.minor version must be greater than or equal to the
		// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version.

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

Conversation

dlipovetsky commented Feb 2, 2022 • edited Loading

k8s-ci-robot commented Feb 2, 2022

k8s-ci-robot commented Feb 2, 2022

dlipovetsky commented Feb 2, 2022

enxebre commented Feb 2, 2022 • edited Loading

joejulian commented Feb 2, 2022

dlipovetsky commented Feb 2, 2022

neolit123 commented Feb 3, 2022

dlipovetsky commented Feb 3, 2022

fabriziopandini commented Feb 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer commented Feb 4, 2022

sbueringer Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

sbueringer Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 29, 2022

vincepri commented Jun 7, 2022

k8s-triage-robot commented Sep 5, 2022

vincepri commented Sep 26, 2022

k8s-ci-robot commented Sep 26, 2022

dlipovetsky commented Feb 2, 2022 •

edited

Loading

enxebre commented Feb 2, 2022 •

edited

Loading

sbueringer Feb 4, 2022 •

edited

Loading

sbueringer Feb 4, 2022 •

edited

Loading

sbueringer Feb 4, 2022 •

edited

Loading