Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

vFile: handle swarm node promotion and demotion #1868

Merged
merged 7 commits into from
Oct 31, 2017

Conversation

luomiao
Copy link
Contributor

@luomiao luomiao commented Sep 7, 2017

Resolves #1732

When a node is promoted from worker to manager, the helper thread will join ETCD cluster according to swarm information.
On the other hand, when the node is demoted from manager to worker, the helper thread should stop the watcher, delete itself from ETCD member list, and clean up the ETCD data directory.

This is required since due to the role change, the cluster may eventually run out of original managers, and thus the ETCD cluster.

Manually tested with 4-node swarm cluster and having one of the node promoted/demoted multiple times. Using etcdctl to verify the ETCD service is in correct status according to the node role change.

msterin
msterin previously requested changes Sep 7, 2017
Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to my (already untrained) eye, but we should start adding automated testing IN the PRs.
A couple of minor comments are also inside

_, err := exec.Command("/bin/etcd", cmd...).Output()
// leaveEtcdCluster function is called when a manager is demoted
func (e *EtcdKVS) leaveEtcdCluster() error {
nodeAddr := e.nodeAddr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an error, just curious - why not use e.nodeAddr where needed, why the extra vars ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to follow some rule to avoid multiple accesses to a parameter inside a struct...
But maybe it's not applicable here.
I can replace with using e.nodeAddr directly.

).Error("Failed to remove member for ETCD ")
return err
}
// the same peerAddr can only join at once. no need to continue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print info ?

}
}

e.etcdStopService()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls log info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcdStopService already has log info inside.


// etcdStartService function starts an ETCD process
func (e *EtcdKVS) etcdStartService(lines []string) {
cmd := exec.Command("/bin/etcd", lines...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use systemd to manage services. Running daemons ourselves means we are in charge of resource allocation and restart on issues.... If we do no have a tracking issue for this, please do open one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.
Issue created: #1873

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, only have some comments/questions.

).Error("Failed to list member for ETCD")
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment about what is "peerAddr", and it could be helpful with an example

}
}

e.etcdStopService()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -74,6 +77,9 @@ type EtcdKVS struct {
dockerOps *dockerops.DockerOps
nodeID string
nodeAddr string
isManager bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for those three newly added fields.

go e.etcdWatcher(cli)
go e.serviceAndVolumeGC(cli)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here we don't need to call e.serviceAndVolumeGC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serviceAndVolumeGC is renamed to etcdHelper with the role check function inside now.
It's now moved to outside of checkLocalEtcd and after joinETCD/startETCD, so the joinETCD function can be re-used by etcdHelper itself.

}
} else {
if e.isManager {
err = e.leaveEtcdCluster()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment to say that "demote from manger to worker, leave ETCD cluster"

@luomiao
Copy link
Contributor Author

luomiao commented Sep 7, 2017

@msterin @lipingxue
Addressed your comments.
Please review again.

@msterin Yes we should start adding tests. This is a high priority issue for the next one month.
We need first resolve the testbed problem.

@msterin
Copy link
Contributor

msterin commented Sep 7, 2017

I am not sure what is the "testbed problem", so I assume this is something preventing you from writing and committing automated tests. In this case IMO this should be the top priority and top work item - to enable automated testing before doing any (not automatically tested) feature work

@luomiao
Copy link
Contributor Author

luomiao commented Oct 26, 2017

@lipingxue
I added a e2e test for this PR.
The new test changes the role of manager and worker in swarm cluster and do volume lifecycle test before and after the role change.
Also some new updates to resolve code conflicts with master branch.
Please review the new changes accordingly. Thank you!

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, and I have a few comments.

// limitations under the License.

// This test suite includes test cases to verify basic functionality
// before upgrade for upgrade test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "before and after for promote/demote test", a copy paste issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catching! will update.


var _ = Suite(&VFileDemotePromoteTestSuite{})

// All VMs are created in a shared datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test overall looks good. Can we add some enhancement?

  1. before promote/demote, after create and attach the 1st volume, write some data in the volume
  2. after promote/demote, read the data back from the 1st volume to make sure data written in step 1 are still exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read/write tests have been covered by the advanced_vfile_test.
Also the role change code only affects ETCD, and won't affect neither the file server nor the internal volumes.
So I think this one should focus on the role change only?


out, err = dockercli.DeleteVolume(s.worker1, s.volName2)
c.Assert(err, IsNil, Commentf(out))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code after this line is just to make reset the test bed, right? It is not part of the test itself. If it is true, please add a comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused here. Below this we are trying to reset the testbed's swarm role back to the beginning, in order to not affect other following tests.
If we don't put it here, where we should include this reset part of code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put it here, but just add a one line comment to say that the following code is for reset the testbed.

Miao Luo added 6 commits October 30, 2017 14:24
When a node is promoted from worker to manager, the helper thread
will join ETCD cluster according to swarm information; On the other
hand, when the node is demoted from manager to worker, the helper
thread should stop the watcher, delete itself from ETCD member list,
and clean up the ETCD data directory.
Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luomiao luomiao dismissed msterin’s stale review October 31, 2017 07:19

CI test for vFile has been added so this request has been cleared.

@luomiao luomiao merged commit 3aa760a into vmware-archive:master Oct 31, 2017
shuklanirdesh82 pushed a commit to shuklanirdesh82/vsphere-storage-for-docker that referenced this pull request Nov 2, 2017
When a node is promoted from worker to manager, the helper thread
will join ETCD cluster according to swarm information; On the other
hand, when the node is demoted from manager to worker, the helper
thread should stop the watcher, delete itself from ETCD member list,
and clean up the ETCD data directory.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants