Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

*: add backup file option to recover subcommand #528

Merged
merged 3 commits into from
Jun 2, 2017
Merged

*: add backup file option to recover subcommand #528

merged 3 commits into from
Jun 2, 2017

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented May 18, 2017

/cc @diegs

This is still WIP. But if you want to take a early view. Do it :P. I will keep this updated with my progress, and ping you again when it is finished and tested.

@k8s-ci-robot k8s-ci-robot requested a review from diegs May 18, 2017 23:34
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 18, 2017
Copy link
Contributor

@diegs diegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet!

@@ -46,8 +48,10 @@ func init() {
cmdRecover.Flags().StringVar(&recoverOpts.etcdCertificatePath, "etcd-certificate-path", "", "Path to an existing certificate that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-private-key-path, and must have etcd configured to use TLS with matching secrets.")
cmdRecover.Flags().StringVar(&recoverOpts.etcdPrivateKeyPath, "etcd-private-key-path", "", "Path to an existing private key that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-certificate-path, and must have etcd configured to use TLS with matching secrets.")
cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-servers", "", "List of etcd server URLs including host:port, comma separated.")
cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-backup-file", "", "Path to the etcd backup file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&recoverOpts.etcdBackupFile?

@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error {
return err
}
backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix)
case recoverOpts.etcdBackupFile != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just move this above the switch, set the etcdServer, and then let it fall througH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about it. but the logging will be duplicated. it will look like

Attempting recovery using etcd backup file at ...
Attempting recovery using etcd cluster at ...

but, yea, i will try to clean up code duplication here in another way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal.

}{
// TODO: this already exists in bootkube/cmd.
// do not duplicate this!
Image: "quay.io/coreos/etcd:v3.1.6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with factoring out an images package.

if err != nil {
return err
}
defer recovery.CleanRecoveryEtcd(recoverOpts.podManifestPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ignoring error returned by function.


const assetPathRecoveryEtcd = "recovery-etcd.yaml"

func StartRecoveryEtcd(p, backupPath string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would put "Backup" somewhere in the name to be clearer.

@@ -31,6 +37,16 @@ func NewEtcdBackend(client *clientv3.Client, pathPrefix string) Backend {
}
}

// NewSelfHostedEtcdBackend constructs a new etcdBackend for the given client and pathPrefix, and backup file.
func NewSelfHostedEtcdBackend(client *clientv3.Client, pathPrefix, backupPath string) Backend {
return &etcdBackend{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this violates the SRP.

Why not create a new type, etcdBackupBackend, which inherits etcdBackend, calls etcdBackend.Read(), and then adds the bootEtcd asset to the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error {
return err
}
backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix)
case recoverOpts.etcdBackupFile != "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal.

}()

bootkube.UserOutput("Waiting for etcd server to start...\n")
// TODO: better way to detect the startup...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a poll loop in pkg/util/etcdutil/migrate.go#createBootstrapEtcdService that we can factor out

--data-dir=/var/etcd/data && \
etcdctl \
--endpoints=http://localhost:32379 \
del /registry/ThirdPartyResourceData/etcd.coreos.com/clusters/kube-system/kube-etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slick, but we won't know if it fails. Might have to use the client instead.

Copy link
Contributor Author

@xiang90 xiang90 May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no where in the code we can do this correctly from my understanding.

we have to delete the TPR after we recovered the boot etcd, but before API server starts to use it.

Maybe you can try to find a better way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original understanding was that you would:
(1) Start temporary etcd from backup, creating a datadir and extracting the manifests
(2) The bootstrap etcd would talk to the datadir you created in (1)

Thus, I thought you'd be able to delete the TPR from the running etcd in (1). However, it looks like (2) also starts up directly from the backup? In that case then I agree that there's no place in code to remove the TPR, so I suppose this should work. I'd say let's go ahead and try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it looks like (2) also starts up directly from the backup?

Yes. It was not possible to reuse the same datadir since (2) needs Pod.IP, which is hard to know when we create datadir for (1). But as we start to use a well-known service IP for boot-etcd. It is possible now.

I still choose to use the backup file approach since we can save one flag for bootkube recovery. Or user needs to specify another etcd data dir which will persist after bootkube recover for the bootkube start command.

But it is up to you to decide which we should go.

Copy link
Contributor

@diegs diegs May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are recovering to a known asset-dir, it would be possible to put the etcd data dir in there (and use the service IP).

If the approach you are using appears to work right now, then I'd say let's stick with it to get something in and add a TODO/open an issue to explore the slightly cleaner approach.

@xiang90 xiang90 changed the title WIP *: add backup file option to recover subcommand *: add backup file option to recover subcommand May 24, 2017
@xiang90
Copy link
Contributor Author

xiang90 commented May 24, 2017

@diegs This works now. Can you take a look? I am trying to extend bootkube-test-recovery to include self hosted case.

if you want to give this a try, you can do:

  1. SELF_HOST_ETCD=true ./bootkube-up

  2. wait the control plane to be up

  3. ssh into c1 and copy /var/etcd/kube-etcd-0000/member/snap/db to local machine

  4. recreate c1

  5. copy db file back to c1 as /home/core/db

  6. sudo ./bootkube recover --asset-dir=/home/core/recovered --etcd-backup-file=/home/core/db --kubeconfig=/home/core/kubeconfig

  7. sudo /home/core/bootkube start --asset-dir=/home/core/recovered

@xiang90
Copy link
Contributor Author

xiang90 commented May 25, 2017

@diegs I added a test script.

Copy link
Contributor

@diegs diegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some minor nits.

cmdRecover.Flags().StringVar(&recoverOpts.etcdPrefix, "etcd-prefix", "/registry", "Path prefix to Kubernetes cluster data in etcd.")
cmdRecover.Flags().StringVar(&recoverOpts.kubeConfigPath, "kubeconfig", "", "Path to kubeconfig for communicating with the cluster.")
cmdRecover.Flags().StringVar(&recoverOpts.podManifestPath, "pod-manifest-path", "/etc/kubernetes/manifests", "The location where the kubelet is configured to look for static pod manifests. (Only need to be set when recovering from a etcd backup file)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add validation that this flag is non-empty iff etcd-backup-file is non-empty (yes, even though it has a default value).

case recoverOpts.etcdBackupFile != "":
bootkube.UserOutput("Attempting recovery using etcd backup file at %q...\n", recoverOpts.etcdBackupFile)

err := recovery.StartRecoveryEtcdForBackup(recoverOpts.podManifestPath, recoverOpts.etcdBackupFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scope errs inside if statements (here and 2 places below)

@xiang90
Copy link
Contributor Author

xiang90 commented May 31, 2017

@diegs Thanks for the review. All fixed.

@diegs
Copy link
Contributor

diegs commented May 31, 2017

@xiang90 awesome! One last thing: can you please update the README? Then I think we're good to merge.

@xiang90
Copy link
Contributor Author

xiang90 commented May 31, 2017

@diegs

One last thing: can you please update the README? Then I think we're good to merge.

Sure. I will do it late today or tomorrow (need to attend coreos fest).

@xiang90
Copy link
Contributor Author

xiang90 commented Jun 1, 2017

@diegs readme updated.

Copy link
Contributor

@diegs diegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please address my README comment before merging.

@@ -78,7 +78,13 @@ Recover from a running apiserver (i.e. if the scheduler pods are all down):
bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig
```

For a complete recovery example please see the [hack/multi-node/bootkube-test-recovery](hack/multi-node/bootkube-test-recovery) script.
Recover from an etcd backup when self hosted etcd is enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update line 61 to add the new supported case, and add a section just like lines 72/78 with a command line example. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the command line example is already added two line below, right? or you want something different? i will update line 61.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry totally misread the diff. That SG, thanks!

@xiang90
Copy link
Contributor Author

xiang90 commented Jun 1, 2017

@diegs

CI failed with:

    run.go:147: creating cluster: error calling NewMachines: <nil> ssh journalctl failed: dial tcp 104.196.152.95:22: getsockopt: connection refused

I do not think this is relevant with my PR. So I am going to merge this PR.

@diegs
Copy link
Contributor

diegs commented Jun 1, 2017

rktbot run tests

@xiang90 xiang90 merged commit 775e3ea into kubernetes-retired:master Jun 2, 2017
@xiang90 xiang90 deleted the t branch June 2, 2017 01:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants