Allow EtcdRestore to restore cluster without an existing EtcdCluster #2047

furkanmustafa · 2019-02-06T06:23:22Z

Example scenario;

I have my backups available in the object storage. But my etcd cluster does not exist anymore. I want to restore my backup into a new cluster.

Changes;

EtcdRestore.spec.etcdCluster is now ClusterSpec instead of EtcdClusterRef
added Name property to ClusterSpec, to make it compatible with EtcdClusterRef
Now, etcd-restore-operator checks for EtcdCluster resource existence
- If exists; current behavior works (snapshot spec, delete old one, create new one)
- If not; uses EtcdRestore.spec.etcdCluster as cluster spec

Example EtcdRestore yaml;

apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdRestore"
metadata:
  name: new-etcd-cluster-from-backup
  namespace: etcd-op
spec:
  etcdCluster:
    name: new-etcd-cluster-from-backup
    size: 5
    version: "3.2.13"
    # pod:
    #   persistentVolumeClaimSpec:
    #     storageClassName: standard
    #     accessModes:
    #     - ReadWriteOnce
    #     resources:
    #       requests:
    #         storage: 1Gi
  backupStorageType: S3
  storageType: S3
  s3:
    path: etcd-backups/20190129.backup
    awsSecret: minio-credentials
    endpoint: http://minio.etcd-op.svc.cluster.local:9000
    forcePathStyle: true

I'm not used to golang code, please let me know if any part needs fixing, or a different approach.

Thanks

…resource Changes; - `EtcdRestore.spec.etcdCluster` is now `ClusterSpec` instead of `EtcdClusterRef` - added `Name` property to `ClusterSpec`, to make it compatible with `EtcdClusterRef` - Now, `etcd-restore-operator` checks for `EtcdCluster` resource existence - If exists; current behavior works (snapshot spec, delete old one, create new one) - If not; uses `EtcdRestore.spec.etcdCluster` as cluster spec

etcd-bot · 2019-02-06T06:24:03Z

Can one of the admins verify this patch?

etcd-bot · 2019-02-06T06:24:11Z

Can one of the admins verify this patch?

etcd-bot · 2019-02-06T06:24:19Z

Can one of the admins verify this patch?

hexfusion · 2019-03-08T13:48:27Z

@etcd-bot ok to test

hexfusion · 2019-04-03T16:02:01Z

@etcd-bot retest this please

alaypatel07 · 2019-04-03T21:16:17Z

@furkanmustafa @hexfusion I do not feel that EtcdCluster should be created by EtcdRestore like this because of the following reasons:

Embedding the EtcdCluster CR in the EtcdRestore CR is not exactly a Kube-native mechanism. When an EtcdCluster CR is erected using an EtcdRestore, the same data is stored twice in the Kube API, one copy in EtcdRestore CR and other in EtcdCluster CR. Once the EtcdCluster CR is created the data stored in the EtcdRestore CR that represents EtcdCluster CR becomes a shadow copy and is rendered useless.
I think that writing a shell script that automates the process of creating a black EtcdCluster CR and then creating an EtcdRestore CR to point to the black EtcdCluster CR would be a better way of achieving the same thing without polluting the Kube-API with shadow copies of the same data.

However, this leads to a broader topic of discussion that I have been wondering for a while. I think the EtcdRestore Operator is unique in terms of functionality, i.e. upon a single successful run, the EtcdRestore CR goes into a Completed state and all further reconciliations on this CR is ignored, more like the Jobs resource. The CR however still stays in the system and reconciliations are still queued uselessly on the controller. I wonder if there is any value in making the EtcdRestore Operator a part of the original EtcdCluster Operator. It will handle the case this PR is aimed at as well as the case of EtcdCluster CR losing a quorum. Might as well be worth creating a new issue for this discussion, if there's enough interest.

furkanmustafa · 2019-04-04T00:59:54Z

@alaypatel07 Thanks for the comments. I think what you say makes sense.

One option is, as you said, making EtcdRestore CRs temporary, like a Job. Then I think it wouldn't be an issue.

There is another option that makes more sense to me, which is having restore-from-backup: {backupName} in EtcdCluster CR. Rending EtcdRestore a bit useless. But it also makes the usage very clear for the user.

What do you think about the second option?

alaypatel07 · 2019-04-04T04:14:58Z

@furkanmustafa The approach you described in the second option is very similar to what I was referring to as making EtcdRestore a part of EtcdCluster in my earlier comment.

And I totally agree with you, it makes the usage very clear for the user. It also makes the EtcdCluster Operator handle the failure scenario gracefully i.e., if the majority of pods die, automatically restore from backup. This is more effective with the periodic backup feature added to the backup operator, recently, and removes the extra admin intervention of deploying the restore operator.

Having said that, there are significant challenges to implementing it. First, it will significantly increase the complexity of the EtcdCluster Operator reconciliation logic, since we are moving the entire restore functionality into the EtcdCluster Operator. The second concern I have is that an operator will always try to reconcile the cluster to a state specified in the spec. Hence, the EtcdCluster operator will somehow need to determine that it successfully restored the EtcdCluster from a backup during last reconciliation and it should skip the restore in this reconciliation but reconcile other parameters of spec like size. While it is a solvable problem, it does need a bit of discussion around the best way to achieve it.

Since the approach in this PR should be refined and carefully thought out, do you mind closing this PR in favour of opening an issue stating the requirement of creating a cluster with existing backup? We can discuss the implementation options in that thread and maybe follow it up with a proposal of implementation.

hasbro17 · 2019-04-15T00:40:56Z

Sorry for the late review but I partly agree with Alay that this warrants more discussion in the form of a proposal before we decide how to support restoring without an existing EtcdCluster object.

I say partly because trying to move EtcdBackup/EtcdRestore API back into the EtcdCluster CR would be reverting back to an older architecture of the etcd operator that was redesigned for good reasons.
See #1626 for more details.
In short it was primarily to reduce the complexity of the EtcdCluster API and decouple the API from the specifics of the Backup/Restore implementation for different mediums (S3, Azure, ...).

Also see the older revisions of our docs and EtcdCluster API for more context on how we used to handle backup and restore from a single EtcdCluster CR.

https://github.com/coreos/etcd-operator/blob/0fa2f7db2bec622fe9fce077bd68079df332bb75/doc/user/spec_examples.md#three-members-cluster-that-restores-from-previous-pv-backup

etcd-operator/pkg/spec/cluster.go

Lines 112 to 116 in 0fa2f7d

    
           // Restore defines the policy to restore cluster form existing backup if not nil. 
        
           // It's not allowed if restore policy is set and backup policy not. 
        
           // 
        
           // Restore is a cluster initialization configuration. It cannot be updated. 
        
           Restore *RestorePolicy `json:"restore,omitempty"`

etcd-operator/pkg/cluster/cluster.go

Lines 397 to 412 in 0fa2f7d

    
           func (c *Cluster) startSeedMember(recoverFromBackup bool) error { 
        
           	m := &etcdutil.Member{ 
        
           		Name:         etcdutil.CreateMemberName(c.cluster.Name, c.memberCounter), 
        
           		Namespace:    c.cluster.Namespace, 
        
           		SecurePeer:   c.isSecurePeer(), 
        
           		SecureClient: c.isSecureClient(), 
        
           	} 
        
           	ms := etcdutil.NewMemberSet(m) 
        
           	if err := c.createPod(ms, m, "new", recoverFromBackup); err != nil { 
        
           		return fmt.Errorf("failed to create seed member (%s): %v", m.Name, err) 
        
           	} 
        
           	c.memberCounter++ 
        
           	c.members = ms 
        
           	c.logger.Infof("cluster created with seed member (%s)", m.Name) 
        
           	return nil 
        
           }

So we should discuss this in a proposal since this entails API changes but I think most likely we'll want to do this by keeping the EtcdRestore API decoupled from the EtcdCluster API.
Going to close this for now.

hexfusion modified the milestone: v0.9.5 Apr 3, 2019

hasbro17 closed this Apr 15, 2019

liyin37 mentioned this pull request May 17, 2021

etcd-operator help furkanmustafa/yafw4osx#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow EtcdRestore to restore cluster without an existing EtcdCluster #2047

Allow EtcdRestore to restore cluster without an existing EtcdCluster #2047

furkanmustafa commented Feb 6, 2019 •

edited

Loading

etcd-bot commented Feb 6, 2019

etcd-bot commented Feb 6, 2019

etcd-bot commented Feb 6, 2019

hexfusion commented Mar 8, 2019

hexfusion commented Apr 3, 2019

alaypatel07 commented Apr 3, 2019 •

edited

Loading

furkanmustafa commented Apr 4, 2019

alaypatel07 commented Apr 4, 2019

hasbro17 commented Apr 15, 2019

Allow EtcdRestore to restore cluster without an existing EtcdCluster #2047

Allow EtcdRestore to restore cluster without an existing EtcdCluster #2047

Conversation

furkanmustafa commented Feb 6, 2019 • edited Loading

etcd-bot commented Feb 6, 2019

etcd-bot commented Feb 6, 2019

etcd-bot commented Feb 6, 2019

hexfusion commented Mar 8, 2019

hexfusion commented Apr 3, 2019

alaypatel07 commented Apr 3, 2019 • edited Loading

furkanmustafa commented Apr 4, 2019

alaypatel07 commented Apr 4, 2019

hasbro17 commented Apr 15, 2019

furkanmustafa commented Feb 6, 2019 •

edited

Loading

alaypatel07 commented Apr 3, 2019 •

edited

Loading