Support Local Execution of Training Jobs #2231

franciscojavierarceo · 2024-08-21T21:02:13Z

What you would like to be added?

The Kubeflow Pipelines v2 API supports running and testing pipelines locally without the need for Kubernetes. Ideally, the TrainingClient could also be extended to run locally for both the v1 and forthcoming v2 API.

This is particularly appealing to Data Scientists who may not be as familiar with Kubernetes or Data Scientists that aim to develop and test their training jobs locally for a faster feedback loop.

As a means of comparison, this is what makes Ray's library so easy to get started for data scientists; i.e., their code just works without having to think too much about kubernetes.

Why is this needed?

Providing a great developer experience for Data Scientists is extremely valuable for growing adoption and catering to our end users.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich · 2024-08-28T16:50:19Z

Thank you for creating this @franciscojavierarceo.
I think, this is a great idea!

Please can you explain details on how KFP runs pipelines locally ? Do I need to have Docker runtime in my local environment to run it and do I need to have local Kind cluster running ?

/area sdk

andreyvelich · 2024-08-28T16:50:43Z

/remove-label lifecycle/needs-triage

franciscojavierarceo · 2024-08-28T17:02:53Z

So they allow for a local subprocess runner and docker runner.

The docker container approach is pretty straightforward (code below) but I actually like the Subprocess approach even though the KFP docs recommend the DockerRunner.

I understand why they recommend the Docker based approach but the Subprocess is just easier for data scientists. You can pass in a list of packages to the virtual environment that will be created to run the pipeline locally. I think that's probably the lowest-friction approach for Data Scientists to get started with the Training on Kubeflow (especially those unfamiliar with k8s).

I think the docker approach or the venv approach is probably all we would need as a start. Pipelines has to deal with complex DAG orchestration where Training only needs to worry about executing the train_func, which I think makes things much easier for local testing. We'd have to figure out how to best align a local run and the configuration parameters (e.g., num_workers, resources_per_worker, etc.) but I think that can be thought through in a spec.

Glad to hear you're supportive of this! I'll talk with folks on the team to investigate creating a spec on the implementation. 👍

Kubeflow Pipeline's docker runner implementation

# https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/local/docker_task_handler.py
def run_docker_container(
    client: 'docker.DockerClient',
    image: str,
    command: List[str],
    volumes: Dict[str, Any],
) -> int:
    image = add_latest_tag_if_not_present(image=image)
    image_exists = any(
        image in existing_image.tags for existing_image in client.images.list())
    if image_exists:
        print(f'Found image {image!r}\n')
    else:
        print(f'Pulling image {image!r}')
        repository, tag = image.split(':')
        client.images.pull(repository=repository, tag=tag)
        print('Image pull complete\n')
    container = client.containers.run(
        image=image,
        command=command,
        detach=True,
        stdout=True,
        stderr=True,
        volumes=volumes,
    )
    for line in container.logs(stream=True):
        # the inner logs should already have trailing \n
        # we do not need to add another
        print(line.decode(), end='')
    return container.wait()['StatusCode']

tenzen-y · 2024-09-17T15:49:38Z

My root question is, how will we orchestrate multiple Nodes (machines) and multiple Roles (networking and storage) without Kubernetes?
Or will you focus on a single Node and a single Role?

franciscojavierarceo · 2024-09-17T15:53:34Z

My root question is, how will we orchestrate multiple Nodes (machines) and multiple Roles (networking and storage) without Kubernetes?

Throw an Exception for local mode

Or will you focus on a single Node and a single Role?

Yes

tenzen-y · 2024-09-17T15:56:58Z

Uhm, the next question is, will you support a single Docker container in a single machine or multiple Docker containers in a single machine?

franciscojavierarceo · 2024-09-17T15:59:38Z

single Docker container in a single machine

We'd probably outline the details about this in a tech spec that we would share with the community before doing the implementation.

tenzen-y · 2024-09-17T16:06:38Z

single Docker container in a single machine

We'd probably outline the details about this in a tech spec that we would share with the community before doing the implementation.

That makes sense. I would recommend sharing the outline of this feature, like the scope of support, in the community meeting or issue. Actual design often does not align with existing specifications. By sharing the outline before the actual design, we can avoid situations where the design is not implementable due to existing specifications.

franciscojavierarceo · 2024-09-17T16:07:48Z

Agreed!

franciscojavierarceo added kind/feature lifecycle/needs-triage labels Aug 21, 2024

google-oss-prow bot added the area/sdk label Aug 28, 2024

google-oss-prow bot removed the lifecycle/needs-triage label Aug 28, 2024

franciscojavierarceo mentioned this issue Sep 17, 2024

Training Operator ROADMAP 2024 #2259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Local Execution of Training Jobs #2231

Support Local Execution of Training Jobs #2231

franciscojavierarceo commented Aug 21, 2024 •

edited

Loading

andreyvelich commented Aug 28, 2024 •

edited

Loading

andreyvelich commented Aug 28, 2024

franciscojavierarceo commented Aug 28, 2024 •

edited

Loading

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

Support Local Execution of Training Jobs #2231

Support Local Execution of Training Jobs #2231

Comments

franciscojavierarceo commented Aug 21, 2024 • edited Loading

What you would like to be added?

Why is this needed?

Love this feature?

andreyvelich commented Aug 28, 2024 • edited Loading

andreyvelich commented Aug 28, 2024

franciscojavierarceo commented Aug 28, 2024 • edited Loading

Kubeflow Pipeline's docker runner implementation

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

tenzen-y commented Sep 17, 2024

franciscojavierarceo commented Sep 17, 2024

franciscojavierarceo commented Aug 21, 2024 •

edited

Loading

andreyvelich commented Aug 28, 2024 •

edited

Loading

franciscojavierarceo commented Aug 28, 2024 •

edited

Loading