Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

use init container for worker pod to wait master pod ready #187

Merged
merged 1 commit into from
Jul 25, 2019

Conversation

zlcnju
Copy link
Contributor

@zlcnju zlcnju commented Jul 8, 2019

use init container for worker pod to wait master pod ready

fix 186

@k8s-ci-robot
Copy link

Hi @zlcnju. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zlcnju
Copy link
Contributor Author

zlcnju commented Jul 8, 2019

add a init container default use busybox image, add wait master running by check dns master addr. And this initcontainer config can be configured in the pytorch configmap. If aggred, I will add the configmap config in the kubeflow project

@zlcnju zlcnju changed the title use init container for worker pod to wait master pod ready[186](https… use init container for worker pod to wait master pod ready [186](https… Jul 8, 2019
@zlcnju zlcnju changed the title use init container for worker pod to wait master pod ready [186](https… use init container for worker pod to wait master pod ready fix [186](https://github.com/kubeflow/pytorch-operator/issues/186) Jul 8, 2019
@zlcnju zlcnju changed the title use init container for worker pod to wait master pod ready fix [186](https://github.com/kubeflow/pytorch-operator/issues/186) use init container for worker pod to wait master pod ready Jul 8, 2019
- name: init-pytorch
image: busybox
command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the default init container template as a const string will be better, no need to add a function here.

var initContainerTemplate string

func init() {
bytes, err := ioutil.ReadFile("/etc/config/initContainer.yaml")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why read init container template from a file of hard coded path? Is it better to add a new arg to the cmd to accept the init container template file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified

@johnugeorge
Copy link
Member

/ok-to-test

var initContainerTemplate string

func init() {
bytes, err := ioutil.ReadFile("/etc/config/initContainer.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose InitContainerTemplate to user? Any particular use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, the image used in init container will be configured in the private cloud. And to avoid the other use case not in mind, expose InitContainerTemplate.

@coveralls
Copy link

coveralls commented Jul 9, 2019

Coverage Status

Coverage remained the same at 85.217% when pulling a762f6d on zlcnju:master into 6aa39a4 on kubeflow:master.

@johnugeorge
Copy link
Member

/cc @gaocegege
/cc @richardsliu


var initContainerTemplate = `
- name: init-pytorch
image: busybox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a tag for the image and set IfNotPresent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified

@zlcnju
Copy link
Contributor Author

zlcnju commented Jul 15, 2019

@johnugeorge @gaocegege any other suggestions?

@johnugeorge
Copy link
Member

/lgtm

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gaocegege
Copy link
Member

@zlcnju
Thanks for your contribution! 🎉 👍

@tossmilestone Thanks for your review!

@richardsliu
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6c75b0c into kubeflow:master Jul 25, 2019
@chansonzhang
Copy link

May I ask which release will include this fix?
If I use the 0.6 release, will I meet the same problem described in issue#186 ?

@johnugeorge
Copy link
Member

Can you use latest 0.7 release

@chansonzhang
Copy link

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

@zlcnju
Copy link
Contributor Author

zlcnju commented Nov 13, 2019

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

you can check the pod create by pyjob, If init container exists, this fix included

@chansonzhang
Copy link

Can you use latest 0.7 release

I use the 0.7.0 image from gcr.io, do this image contain this fix?

you can check the pod create by pyjob, If init container exists, this fix included

it do contain this fix, thank you very much

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gang schedule bug
8 participants