Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Local Execution of Training Jobs #2231

Open
franciscojavierarceo opened this issue Aug 21, 2024 · 9 comments
Open

Support Local Execution of Training Jobs #2231

franciscojavierarceo opened this issue Aug 21, 2024 · 9 comments

Comments

@franciscojavierarceo
Copy link
Contributor

franciscojavierarceo commented Aug 21, 2024

What you would like to be added?

The Kubeflow Pipelines v2 API supports running and testing pipelines locally without the need for Kubernetes. Ideally, the TrainingClient could also be extended to run locally for both the v1 and forthcoming v2 API.

This is particularly appealing to Data Scientists who may not be as familiar with Kubernetes or Data Scientists that aim to develop and test their training jobs locally for a faster feedback loop.

As a means of comparison, this is what makes Ray's library so easy to get started for data scientists; i.e., their code just works without having to think too much about kubernetes.

Why is this needed?

Providing a great developer experience for Data Scientists is extremely valuable for growing adoption and catering to our end users.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@andreyvelich
Copy link
Member

andreyvelich commented Aug 28, 2024

Thank you for creating this @franciscojavierarceo.
I think, this is a great idea!

Please can you explain details on how KFP runs pipelines locally ? Do I need to have Docker runtime in my local environment to run it and do I need to have local Kind cluster running ?

/area sdk

@andreyvelich
Copy link
Member

/remove-label lifecycle/needs-triage

@franciscojavierarceo
Copy link
Contributor Author

franciscojavierarceo commented Aug 28, 2024

So they allow for a local subprocess runner and docker runner.

The docker container approach is pretty straightforward (code below) but I actually like the Subprocess approach even though the KFP docs recommend the DockerRunner.

I understand why they recommend the Docker based approach but the Subprocess is just easier for data scientists. You can pass in a list of packages to the virtual environment that will be created to run the pipeline locally. I think that's probably the lowest-friction approach for Data Scientists to get started with the Training on Kubeflow (especially those unfamiliar with k8s).

I think the docker approach or the venv approach is probably all we would need as a start. Pipelines has to deal with complex DAG orchestration where Training only needs to worry about executing the train_func, which I think makes things much easier for local testing. We'd have to figure out how to best align a local run and the configuration parameters (e.g., num_workers, resources_per_worker, etc.) but I think that can be thought through in a spec.

Glad to hear you're supportive of this! I'll talk with folks on the team to investigate creating a spec on the implementation. 👍

Kubeflow Pipeline's docker runner implementation

# https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/local/docker_task_handler.py
def run_docker_container(
    client: 'docker.DockerClient',
    image: str,
    command: List[str],
    volumes: Dict[str, Any],
) -> int:
    image = add_latest_tag_if_not_present(image=image)
    image_exists = any(
        image in existing_image.tags for existing_image in client.images.list())
    if image_exists:
        print(f'Found image {image!r}\n')
    else:
        print(f'Pulling image {image!r}')
        repository, tag = image.split(':')
        client.images.pull(repository=repository, tag=tag)
        print('Image pull complete\n')
    container = client.containers.run(
        image=image,
        command=command,
        detach=True,
        stdout=True,
        stderr=True,
        volumes=volumes,
    )
    for line in container.logs(stream=True):
        # the inner logs should already have trailing \n
        # we do not need to add another
        print(line.decode(), end='')
    return container.wait()['StatusCode']

@tenzen-y
Copy link
Member

My root question is, how will we orchestrate multiple Nodes (machines) and multiple Roles (networking and storage) without Kubernetes?
Or will you focus on a single Node and a single Role?

@franciscojavierarceo
Copy link
Contributor Author

My root question is, how will we orchestrate multiple Nodes (machines) and multiple Roles (networking and storage) without Kubernetes?

Throw an Exception for local mode

Or will you focus on a single Node and a single Role?

Yes

@tenzen-y
Copy link
Member

Uhm, the next question is, will you support a single Docker container in a single machine or multiple Docker containers in a single machine?

@franciscojavierarceo
Copy link
Contributor Author

single Docker container in a single machine

We'd probably outline the details about this in a tech spec that we would share with the community before doing the implementation.

@tenzen-y
Copy link
Member

single Docker container in a single machine

We'd probably outline the details about this in a tech spec that we would share with the community before doing the implementation.

That makes sense. I would recommend sharing the outline of this feature, like the scope of support, in the community meeting or issue. Actual design often does not align with existing specifications. By sharing the outline before the actual design, we can avoid situations where the design is not implementable due to existing specifications.

@franciscojavierarceo
Copy link
Contributor Author

Agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants