Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JAX controller #2194

Merged
merged 2 commits into from
Sep 20, 2024
Merged

Conversation

sandipanpanda
Copy link
Contributor

@sandipanpanda sandipanpanda commented Aug 5, 2024

What this PR does / why we need it:
Implement JAX controller

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Ref: #1619 #2145

/area gsoc

@sandipanpanda
Copy link
Contributor Author

@coveralls
Copy link

coveralls commented Aug 5, 2024

Pull Request Test Coverage Report for Build 10960403998

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 10937611143: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

@shravan-achar
Copy link

shravan-achar commented Aug 14, 2024

@sandipanpanda - Is this PR ready for review?

@tenzen-y
Copy link
Member

@sandipanpanda - Is this PR ready for review?

Actually, no. @sandipanpanda continues trying some implementations.
Regarding the remaining implementations, you can find the weekly sync documentation.

@sandipanpanda
Copy link
Contributor Author

Adding jaxjob webhook test, examples and finishing up some work on test_e2e_jaxjob.py remain. Can you please share your input if the current implementation up until now is in the correct direction?

@andreyvelich
Copy link
Member

@sandipanpanda If this PR is ready for review, please review the WIP from the PR title.

@sandipanpanda
Copy link
Contributor Author

sandipanpanda commented Aug 16, 2024

I am still working on jaxjob webhook test, examples, example image for generate_container() in test_e2e_jaxjob.py. If that needs to be part of this PR, this is still WIP.

@sandipanpanda sandipanpanda force-pushed the jax-controller branch 3 times, most recently from a5e554b to 4c266aa Compare August 28, 2024 20:03
@sandipanpanda sandipanpanda changed the title [WIP] Add JAX controller Add JAX controller Aug 28, 2024
examples/jax/cpu-demo/train.py Show resolved Hide resolved
examples/jax/cpu-demo/train.py Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
examples/jax/cpu-demo/demo.yaml Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Show resolved Hide resolved
pkg/controller.v1/jax/jaxjob_controller.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

@sandipanpanda Could you address this error? We need to pass the appropriate arguments to e2e testing.

DEBUG kubernetes.client.rest:rest.py:235 response body: FATAL Flags parsing error:
flag --job_name=None: Flag --job_name must have a value other than None.
flag --sub_domain=None: Flag --sub_domain must have a value other than None.
flag --coordinator_port=None: Flag --coordinator_port must have a value other than None.
Pass --helpshort or --helpfull to see help on flags.

DEBUG kubernetes.client.rest:rest.py:235 response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Success","details":{"name":"jaxjob-cpu-ci-test","group":"kubeflow.org","kind":"jaxjobs","uid":"f7402a7c-588a-4b9d-903d-969ba0d4c7e2"}}

https://github.com/kubeflow/training-operator/actions/runs/10753814852/job/29823692795?pr=2194#step:9:5832

@sandipanpanda sandipanpanda force-pushed the jax-controller branch 2 times, most recently from a3303b7 to f957d69 Compare September 10, 2024 20:17
@sandipanpanda
Copy link
Contributor Author

cc @tenzen-y PTAL

spec:
containers:
- name: jax-worker
image: sandipanify/jaxgoogle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image: sandipanify/jaxgoogle
image: kubeflow/jaxjob-simple:latest

examples/jax/cpu-demo/Dockerfile Show resolved Hide resolved
Comment on lines +9 to +12
libgoogle-glog-dev \
libgflags-dev \
libprotobuf-dev \
protobuf-compiler \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Library versions as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandipanpanda Missing unit testing.

pkg/controller.v1/jax/jaxjob_controller_test.go Outdated Show resolved Hide resolved
JAXJOB_PLURAL = "jaxjobs"
JAXJOB_CONTAINER = "jax"
JAXJOB_REPLICA_TYPES = REPLICA_TYPE_WORKER.lower()
JAXJOB_BASE_IMAGE = "docker.io/sandipanify/jaxgloo:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
JAXJOB_BASE_IMAGE = "docker.io/sandipanify/jaxgloo:latest"
JAXJOB_BASE_IMAGE = "kubeflow/jaxjob-simple:latest"

def generate_container() -> V1Container:
return V1Container(
name=CONTAINER_NAME,
image="docker.io/sandipanify/jaxgoogle:latest",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR is merged and image is pushed into DockerHub, could you update this image in the followup PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that.

@sandipanpanda sandipanpanda force-pushed the jax-controller branch 3 times, most recently from 9effb4a to 7c0563d Compare September 18, 2024 15:27
Add JAX controller, controller tests, webhook validations, examples,
e2e tests for JAXJob

Extend the Training Operator Python SDK to simplify the creation and
management of JAXJob resources.

Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar.go Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/jax/envvar_test.go Outdated Show resolved Hide resolved
Signed-off-by: Sandipan Panda <samparksandipan@gmail.com>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a lot of work!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 8285aff into kubeflow:master Sep 20, 2024
40 checks passed
@sandipanpanda
Copy link
Contributor Author

Thank you for your unwavering guidance and support throughout this project!

@sandipanpanda sandipanpanda deleted the jax-controller branch September 20, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants