-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Bootstrap machine only if it conforms to the version skew policy #6044
🐛 Bootstrap machine only if it conforms to the version skew policy #6044
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dlipovetsky: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The e2e tests are failing, because the code doesn't handle the absence of the The |
thanks @dlipovetsky. Given that the topology controller managed upgrades should take care of this I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact for some use cases where I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario. |
Topology controller managed upgrades have a ways to go and is still experimental. We have an immediate need to allow declarative updates and this would provide that ability. A user can apply resources to upgrade their cluster and this would prevent unsupported version skew: upgrading the control plane and disallowing the workers to upgrade until the control plane is upgraded. Further, this will prevent users from inadvertently make their situation worse by upgrading a set of worker machines such that they're not supported by the control plane. Do you have a suggestion, @enxebre, on how you would disable this protection for development purposes? If I were trying to break my cluster I suppose I would just comment out that protection and run it externally. |
I share your concerns.
I want correct behavior even when I do not use ClusterClass.
This is a good point. However, I have not yet been able to think of a scenario where the version check presents a problem, especially since the check is skipped when either the
We could make version check either opt-in or opt-out, e.g., using an annotation. |
what has made sense to me in the past, and what i saw already being the case in kubeadm after joining the project is the following:
unclear to me how this can be done in CAPI, but it feels like the best UX. |
Thanks for the feedback! I like this approach.
I think an annotation on the Machine could work. For example, |
@dlipovetsky thanks for tackling this problem! A few considerations on the UX:
One last consideration, I'm getting to the idea that surfacing some information about the ControlPlane version on the Cluster could simplify a couple of use cases (I should open an issue and start to track them) |
return true, nil | ||
} | ||
|
||
mv, err := semver.ParseTolerant(*machine.Spec.Version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mv, err := semver.ParseTolerant(*machine.Spec.Version) | |
machineVersion, err := semver.ParseTolerant(*machine.Spec.Version) |
controlPlaneActualVersionStr, err := contract.ControlPlane().StatusVersion().Get(controlPlane) | ||
if err != nil { | ||
return false, errors.Wrap(err, "failed to read control plane actual version") | ||
} | ||
// controlPlaneVersionStr is not nil, because Get did not return an error. | ||
controlPlaneActualVersion, err := semver.ParseTolerant(*controlPlaneActualVersionStr) | ||
if err != nil { | ||
return false, errors.Wrap(err, "failed to parse control plane actual version") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status.version
was an optional field that was added later, not sure if it's in the current contract (cc @fabriziopandini), although spec.version
should be there. If the current version in status is not available, can we fallback to only look at the desired one instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both fields are required for "implementations using version" (https://cluster-api.sigs.k8s.io/developer/architecture/controllers/control-plane.html#required-status-fields-for-implementations-using-version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only .spec.version exists and .status.version doesn't, in most cases this means that the first control plane machine isn't up yet. In the topology reconcile we're interpreting this as "control plane is provisioning"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context. I have a TODO to handle the absence of Status.Version.
|
||
if util.IsControlPlaneMachine(machine) { | ||
if version.Compare(mv, controlPlaneDesiredVersion, version.IgnorePatchVersion()) == 1 { | ||
return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+ | |
return false, errors.Errorf("control plane machine major.minor version (%d.%d) must be less than or equal to"+ |
ok, err := r.isVersionAllowed(ctx, m, cluster) | ||
if err != nil { | ||
return ctrl.Result{}, errors.Wrap(err, "failed to check machine version") | ||
} | ||
if !ok { | ||
return ctrl.Result{}, errors.Wrap(err, "machine version is not allowed") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right, wouldn't this code block reconciliation of a Machine as soon as the control plane bumps the version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should have checks in a validation webhook, potentially this one can be carrying a client that looks at the control plane reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Webhook on which resource? MachineDeployment, MachinePool, MachineSet, Machine or all of them?
I assume only on create, which would solve the issue that this check should not be run once a Machine already exists (to avoid blocking reconciliation of existing machines because a control plane upgrade is in progress)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha I just commented something similar here: #6040 (comment)
IMO the validation should be in the webhook of every object that can't change before the ControlPlane does but that has a Version in spec that can be edited by the user: Machine, MachinePool, MachineDeployment, and MachineSet.
I was thinking we'd want to validate updates too. If a user tries to upgrade a MachinePool or MachineDeployment before upgrading the ControlPlane that's breaking. What's your concern with blocking reconciliation of existing machines because a control plane upgrade is in progress
@sbueringer? Wouldn't the new version be reflected in the control plane spec if the control plane has already been updated? Also, we wouldn't block reconciliation, just updating the Machine/MD/MP/MS version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Machines, we need to make sure to not block Machine objects that are being created as part of the reconciliation of machine-based control planes, otherwise the control plane is never going to be rolled out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I think we can exclude control plane machines from this validation based on the control plane label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm!
I think this issue covers that: #5341 |
// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version. | ||
// If the Machine is not part of the control plane, then its major.minor version must be less than or equal to the control | ||
// plane actual major.minor version. | ||
func (r *Reconciler) isVersionAllowed(ctx context.Context, machine *clusterv1.Machine, cluster *clusterv1.Cluster) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would make sense to align some of the logic with what we are doing in the topology reconciler when deciding if we want to trigger MD rollouts (computeMachineDeploymentVersion).
E.g. when the control plane is currently upgrading (ControlPlaneContract.IsUpgrading) we are not triggering new MD upgrades.
Most of the logic does not apply, but I wonder if there are some common parts.
// If the Machine is part of the control plane, then its major.minor version must be greater than or equal to the | ||
// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but could it be that it's possible to downgrade with KCP today?
(I only found something that it's forbidden via ClusterClass)
If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)
Probably.
The current version skew policy does not mention downgrade. The most recent discussion took place in 2019-2020: kubernetes/website#12327.
@dlipovetsky: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@dlipovetsky Any updates on this PR? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Closing for now /close |
@vincepri: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.
TODO
Status.Version
field in ControlPlaneRef.machine_controller_test.go
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #6040