Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

Conversation

dlipovetsky
Copy link
Contributor

@dlipovetsky dlipovetsky commented Feb 2, 2022

What this PR does / why we need it:
Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

TODO

  • Handle absence of Status.Version field in ControlPlaneRef.
  • Additional tests in machine_controller_test.go

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #6040

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign neolit123 after the PR has been reviewed.
You can assign the PR to them by writing /assign @neolit123 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@dlipovetsky: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-e2e-main 29cc271 link true /test pull-cluster-api-e2e-main
pull-cluster-api-e2e-informing-ipv6-main 29cc271 link false /test pull-cluster-api-e2e-informing-ipv6-main
pull-cluster-api-e2e-informing-main 29cc271 link false /test pull-cluster-api-e2e-informing-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dlipovetsky
Copy link
Contributor Author

The e2e tests are failing, because the code doesn't handle the absence of the Status.Version field of the Control Plane resource.

The Get API in the contract package makes this difficult: The "not found" error is unexported. While it could be exported, I think the Get method should instead return a boolean (i.e. (string, bool, error), instead of an error, since it is normal for optional fields to be absent. The path of the absent field is included in the error, but that same path is already readable using the Path method. I'll submit a separate PR for that.

@enxebre
Copy link
Member

enxebre commented Feb 2, 2022

thanks @dlipovetsky. Given that the topology controller managed upgrades should take care of this I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact for some use cases where I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario.
Anyone thoughts?

@joejulian
Copy link
Contributor

Topology controller managed upgrades have a ways to go and is still experimental. We have an immediate need to allow declarative updates and this would provide that ability. A user can apply resources to upgrade their cluster and this would prevent unsupported version skew: upgrading the control plane and disallowing the workers to upgrade until the control plane is upgraded.

Further, this will prevent users from inadvertently make their situation worse by upgrading a set of worker machines such that they're not supported by the control plane.

Do you have a suggestion, @enxebre, on how you would disable this protection for development purposes? If I were trying to break my cluster I suppose I would just comment out that protection and run it externally.

@dlipovetsky
Copy link
Contributor Author

thanks @dlipovetsky. Given that the topology controller managed upgrades should take care of this I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact for some use cases where I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario. Anyone thoughts?

I share your concerns.

Given that the topology controller managed upgrades should take care of this

I want correct behavior even when I do not use ClusterClass.

I'm wondering if introducing this rigidness in the machine controller might have any difficult to predict undesired impact

This is a good point. However, I have not yet been able to think of a scenario where the version check presents a problem, especially since the check is skipped when either the Machine.Spec.Version and Cluster.Spec.ControlPlaneRef are undefined.

I might want to deliberately upgrade a machine beyond allowed skew and intervene manually if required e.g this would prevent testing/predicting/reproducing issues in a skewed cluster scenario.

We could make version check either opt-in or opt-out, e.g., using an annotation.

@neolit123
Copy link
Member

Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

what has made sense to me in the past, and what i saw already being the case in kubeadm after joining the project is the following:

  • error out on users if they are violating version skew policies.
  • give them the power to override the errors if needed for testing, etc.

unclear to me how this can be done in CAPI, but it feels like the best UX.

@dlipovetsky
Copy link
Contributor Author

Make the Machine controller enforce the Kubernetes version skew policy before reconciling the bootstrap configuration, effectively bootstrapping the Machine.

what has made sense to me in the past, and what i saw already being the case in kubeadm after joining the project is the following:

  • error out on users if they are violating version skew policies.
  • give them the power to override the errors if needed for testing, etc.

Thanks for the feedback! I like this approach.

unclear to me how this can be done in CAPI

I think an annotation on the Machine could work. For example, machine.cluster.x-k8s.io/exclude-version-skew-check, following as similar annotation that allows the user to skip node drain on delete. WDYT?

@fabriziopandini
Copy link
Member

@dlipovetsky thanks for tackling this problem!
This helps to address the discussion on #5222 as well.

A few considerations on the UX:

  • The proposed implementation makes the machine reconciliation fail in case a wrong version is applied; I agree this should be the latest safeguard, we should make the users aware of why the machine is not provisioning.
  • Thinking about a better UX, IMO a user I would prefer to get informed about the problem while creating a machine, or given that machines are often controlled by higher-level abstractions, when creating KCP, MD/MS or MP (vs accepting resources with wrong version and then fail at reconciling time); I agree this is a much bigger change, but on the bright side we now have in place all the bits required for creating "smarter" webhooks - like the one implemented for ClusterClass -.

One last consideration, I'm getting to the idea that surfacing some information about the ControlPlane version on the Cluster could simplify a couple of use cases (I should open an issue and start to track them)

return true, nil
}

mv, err := semver.ParseTolerant(*machine.Spec.Version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mv, err := semver.ParseTolerant(*machine.Spec.Version)
machineVersion, err := semver.ParseTolerant(*machine.Spec.Version)

Comment on lines +528 to +536
controlPlaneActualVersionStr, err := contract.ControlPlane().StatusVersion().Get(controlPlane)
if err != nil {
return false, errors.Wrap(err, "failed to read control plane actual version")
}
// controlPlaneVersionStr is not nil, because Get did not return an error.
controlPlaneActualVersion, err := semver.ParseTolerant(*controlPlaneActualVersionStr)
if err != nil {
return false, errors.Wrap(err, "failed to parse control plane actual version")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status.version was an optional field that was added later, not sure if it's in the current contract (cc @fabriziopandini), although spec.version should be there. If the current version in status is not available, can we fallback to only look at the desired one instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only .spec.version exists and .status.version doesn't, in most cases this means that the first control plane machine isn't up yet. In the topology reconcile we're interpreting this as "control plane is provisioning"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context. I have a TODO to handle the absence of Status.Version.


if util.IsControlPlaneMachine(machine) {
if version.Compare(mv, controlPlaneDesiredVersion, version.IgnorePatchVersion()) == 1 {
return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return false, errors.Errorf("machine major.minor version (%d.%d) must be less than or equal to"+
return false, errors.Errorf("control plane machine major.minor version (%d.%d) must be less than or equal to"+

Comment on lines +258 to +264
ok, err := r.isVersionAllowed(ctx, m, cluster)
if err != nil {
return ctrl.Result{}, errors.Wrap(err, "failed to check machine version")
}
if !ok {
return ctrl.Result{}, errors.Wrap(err, "machine version is not allowed")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, wouldn't this code block reconciliation of a Machine as soon as the control plane bumps the version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should have checks in a validation webhook, potentially this one can be carrying a client that looks at the control plane reference

Copy link
Member

@sbueringer sbueringer Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Webhook on which resource? MachineDeployment, MachinePool, MachineSet, Machine or all of them?

I assume only on create, which would solve the issue that this check should not be run once a Machine already exists (to avoid blocking reconciliation of existing machines because a control plane upgrade is in progress)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha I just commented something similar here: #6040 (comment)

IMO the validation should be in the webhook of every object that can't change before the ControlPlane does but that has a Version in spec that can be edited by the user: Machine, MachinePool, MachineDeployment, and MachineSet.

I was thinking we'd want to validate updates too. If a user tries to upgrade a MachinePool or MachineDeployment before upgrading the ControlPlane that's breaking. What's your concern with blocking reconciliation of existing machines because a control plane upgrade is in progress @sbueringer? Wouldn't the new version be reflected in the control plane spec if the control plane has already been updated? Also, we wouldn't block reconciliation, just updating the Machine/MD/MP/MS version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Machines, we need to make sure to not block Machine objects that are being created as part of the reconciliation of machine-based control planes, otherwise the control plane is never going to be rolled out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think we can exclude control plane machines from this validation based on the control plane label?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

@sbueringer
Copy link
Member

One last consideration, I'm getting to the idea that surfacing some information about the ControlPlane version on the Cluster could simplify a couple of use cases (I should open an issue and start to track them)

I think this issue covers that: #5341

// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version.
// If the Machine is not part of the control plane, then its major.minor version must be less than or equal to the control
// plane actual major.minor version.
func (r *Reconciler) isVersionAllowed(ctx context.Context, machine *clusterv1.Machine, cluster *clusterv1.Cluster) (bool, error) {
Copy link
Member

@sbueringer sbueringer Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would make sense to align some of the logic with what we are doing in the topology reconciler when deciding if we want to trigger MD rollouts (computeMachineDeploymentVersion).

E.g. when the control plane is currently upgrading (ControlPlaneContract.IsUpgrading) we are not triggering new MD upgrades.

Most of the logic does not apply, but I wonder if there are some common parts.

Comment on lines +503 to +504
// If the Machine is part of the control plane, then its major.minor version must be greater than or equal to the
// control plane actual major.minor version, and less than or equal to the control plane desired major.minor version.
Copy link
Member

@sbueringer sbueringer Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but could it be that it's possible to downgrade with KCP today?
(I only found something that it's forbidden via ClusterClass)

If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm correct we should probably start blocking downgrades in KCP? (separately from this PR)

Probably.

The current version skew policy does not mention downgrade. The most recent discussion took place in 2019-2020: kubernetes/website#12327.

@k8s-ci-robot
Copy link
Contributor

@dlipovetsky: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2022
@vincepri
Copy link
Member

vincepri commented Jun 7, 2022

@dlipovetsky Any updates on this PR?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2022
@vincepri
Copy link
Member

Closing for now

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closed this PR.

In response to this:

Closing for now

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
10 participants