Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update KFP docs #189

Merged
merged 4 commits into from
May 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 22 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ The goal is to offer high-level APIs for developers to quickly get started in wo

## 📝 Table of Contents
- [About](#about)
- [Setup](#setup)
- [Getting Started](#getting_started)
- [How to Contribute](#contribute_steps)
- [Acknowledgments](#acknowledgement)
Expand All @@ -40,7 +39,7 @@ Eventually, Data Prep Kit will offer consistent APIs and configurations across t
1. Python runtime
2. Ray runtime (local and distributed)
3. Spark runtime (local and distributed)
4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
4. [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)

The current matrix for the combination of modules and supported runtimes is shown in the table below.
Contributors are welcome to add new modules as well as add runtime support for existing modules!
Expand All @@ -66,7 +65,7 @@ Features of the toolkit:
- It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
- It provides a growing set of sample pipelines developed for real enterprise use cases.
- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
- It uses [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md) for no-code data prep.
- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).

Data modalities supported:

Expand Down Expand Up @@ -97,7 +96,8 @@ A general purpose [SQL-based filter transform](transforms/universal/filter) enab
For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms).

#### Scaling of Transforms
To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) and [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.
To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html)
or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.
A generalized workflow is shown [here](doc/data-processing.md).

#### Bring Your Own Transform
Expand All @@ -107,29 +107,26 @@ More details on the data processing library are [here](data-processing-lib/doc/o
#### Automation
The toolkit also supports transform execution automation based on
[Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP),
tested on [Kind clusters](https://kind.sigs.k8s.io/). The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html)
tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an
automation to create a Kind cluster and deploy all required components on it.
The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html)
for creating and managing the Ray cluster and [KubeRay API server](https://github.com/ray-project/kuberay/tree/master/apiserver)
to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several
[kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation.

## &#x1F680; Getting Started <a name = "getting_started"></a>

## &#x2699; Setup <a name = "setup"></a>

We tried the project on different hardware/software configurations (see [Apple/Mac considerations](doc/mac.md).)
We recommend using a laptop with at least 16GB of memory and 8 CPU cores for development without KFP,
and at least 32GB and preferably 16 CPU cores if you plan to run KFP on Kind.

### Prerequisites

* Python 3.10 or 3.11
* Docker/Podman

Two important tools will also be installed using the steps below:
* [pre-commit](https://pre-commit.com/)
* [twine](https://twine.readthedocs.io/en/stable/)
There are various entry points that you can choose based on the use case. Each entry point has its pre-requirements and setup steps.
The common part of are:
#### Prerequisites
- Python 3.10 or 3.11
-Docker/Podman

### Installation Steps
Two important development tools will also be installed using the steps below:
-[pre-commit](https://pre-commit.com/)
-[twine](https://twine.readthedocs.io/en/stable/)

#### Installation Steps
```shell
pip install pre-commit
pip install twine
Expand All @@ -138,11 +135,7 @@ git clone git@github.com:IBM/data-prep-kit.git
cd data-prep-kit
pre-commit install
```

## &#x1F680; Getting Started <a name = "getting_started"></a>

There are various entry points that you can choose based on the use case. Below are a few demos to get you started.

Below are a few demos to get you started.
### Build Your Own Transforms
Follow the documentation [here](data-processing-lib/doc/overview.md) to build your own transform
and run it in either the python or Ray runtimes.
Expand All @@ -152,11 +145,12 @@ Get started by running the "noop" transform that performs an identity operation
[tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) and associated
[noop implementation](transforms/universal/noop).

### Run a Data Pipeline on Local Ray
Get started by building a data pipeline with our [example pipeline](./examples/) that can run on a laptop. To test this pipeline, you can download this repo as a zip file and get started.
### Run a Jupyter notebook on Local Ray cluster
Get started by building a Jupiter notebook executing a sequence of Transforms with our [example pipeline](./examples/)
that can run on your machine. This implementation can also be extended to connect to the remote Ray cluster.

### Automate a Pipeline
The data preprocessing can be automated by running transformers as a KubeFlow pipeline (KFP).
The data preprocessing can be automated by running transformers as a Kubeflow pipeline (KFP).
See this simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md). See [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md)
if you want to combine several data transformation steps.

Expand Down
6 changes: 6 additions & 0 deletions kfp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Automation with Kubeflow Pipelines

- [Set up a Kubernetes clusters for KFP execution](./doc/setup.md)
- [Simple Transform pipeline tutorial](./doc/simple_transform_pipeline.md)
- [Execution several transformers](./doc/multi_transform_pipeline.md)
- [Clean up the cluster](./doc/setup#cleanup)
2 changes: 1 addition & 1 deletion kfp/doc/multi_transform_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ In the list of its input parameters, we also see `data_s3_config`. Now, we have
![param list](param_list2.png)


**Note** An example super pipeline that combines several transforms, `doc_id`, `ededup`, and `fdedup`, can be found in [superworkflow_dedups_sample_wf.py](../transform_workflows/superworkflows/superworkflow_dedups_sample_wf.py).
**Note** An example super pipeline that combines several transforms, `doc_id`, `ededup`, and `fdedup`, can be found in [superworkflow_dedups_sample_wf.py](../superworkflows/v1/superworkflow_dedups_sample_wf.py).

![super pipeline](super_pipeline.png)

Expand Down
126 changes: 126 additions & 0 deletions kfp/doc/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Set up a Kubernetes clusters for KFP execution

## 📝 Table of Contents
- [A Kind deployment supported platforms](#kind_platforms)
- [Preinstalled software components](#preinstalled)
- [A Kind deployment](#kind)
- [An existing cluster](#existing_cluster)
- [Installation steps](#installation)
- [Installation on an existing Kubernetes cluster](#installation_existing)
- [Clean up the cluster](#cleanup")

The project provides instructions and deployment automation to run all components in an all-inclusive fashion on a
single machine using a [Kind cluster](https://kind.sigs.k8s.io/) and a local data storage ([MinIO](https://min.io/)).
However, this topology is not suitable for processing medium and large datasets, and deployment should be carried out
on a real Kubernetes or OpenShift cluster. Therefore, we recommend using Kind cluster for only for local testing and
debugging, not production loads. For production loads use a real Kubernetes cluster.

Running a Kind Kubernetes cluster with Kubeflow pipelines (KFP) and MinIO requires significant
memory. We recommend deploying it on machines with at least 32 GB of RAM and 8-9 CPU cores. RHEL OS requires
more resources, e.g. 64 GB RAM and 32 CPU cores.

## A Kind deployment supported Platforms <a name = "kind_platforms"></a>
Executing KFP, MinIO, and Ray on a single Kind cluster pushes the system to its load limits. Therefore, although we are
working on extending support for additional platforms, not all platforms/configurations are currently supported.

| Operating System | Container Agent | Support | Comments |
|:-----------------:|:---------------:|:--------:| :---------: |
| RHEL 7 | any | - | Kind [doesn't support](https://github.com/kubernetes-sigs/kind/issues/3311) RHEL 7 |
| RHEL 8 | | |
| RHEL 9.4 | Docker | Yes |
| RHEL 9.4 | Podman | No | Issues with Ray job executions
| Ubuntu 24-04 | Docker | Yes |
| Ubuntu 24-04 | Podman | |
| Windows WSL2 | Docker | Yes |
| Windows WSL2 | Podman | |
| MacOS amd64 | Docker | Yes |
| MacOS amd64 | Podman | |
| MacOS arm64 | Docker | |
| MacOS arm64 | Podman | No | Issues with Ray job executions

## Preinstalled software components <a name = "preinstalled"></a>

Depending on whether a Kind cluster or an existing Kubernetes cluster is used, different software packages need to be preinstalled.

### Kind deployment <a name = "kind"></a>
The following programs should be manually installed:

- [Helm](https://helm.sh/docs/intro/install/) 3.10.0 or greater must be installed and configured on your machine.
- [Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) tool for running local Kubernetes clusters 0.14.0 or newer must be installed on your machine.
- [Kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) 1.26 or newer must be installed on your machine.
- [MinIO Client (mc)](https://min.io/docs/minio/kubernetes/upstream/index.html) must be installed on your machine. Please
choose your OS system, and process according to "(Optional) Install the MinIO Client". You have to install the `mc` client only.
- [git client](https://git-scm.com/downloads), we use git client to clone installation repository
- [lsof](https://www.ionos.com/digitalguide/server/configuration/linux-lsof/) usually it is part of Linux or MacOS distribution.
- Container agent such as [Docker](https://www.docker.com/) or [Podman](https://podman-desktop.io/)

### Existing Kubernetes cluster <a name = "existing_cluster"></a>
Deployment on an existing cluster requires less pre-installed software
Only the following programs should be manually installed:

- [Helm](https://helm.sh/docs/intro/install/) 3.10.0 or greater must be installed and configured on your machine.
- [Kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) 1.26 or newer must be installed on your machine, and be
able to connect to the external cluster.
- Deployment of the test data requires [MinIO Client (mc)](https://min.io/docs/minio/kubernetes/upstream/index.html) Please
choose your OS system, and process according to "(Optional) Install the MinIO Client". Only the `mc` client should be installed.

## Installation steps <a name = "installation"></a>

You can create a Kind cluster with all required software installed using the following command:

```shell
make setup
```
from this main package directory or from the `kind` directory.
If you do not want to upload the testing data into the locally deployed Minio, and reduce memory footprint, please set:
```bash
export POPULATE_TEST_DATA ?= 0
```

### Installation on an existing Kubernetes cluster <a name = "installation_existing"></a>
Alternatively you can deploy pipeline to the existing Kubernetes cluster. If your local kubectl is configured to

In order to execute data transformers on the remote Kubernetes cluster, the following packages should be installed on the cluster:

- [KubeFlow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP). Currently, we use
upstream Argo-based KFP v1.
- [KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) controller and
[KubeRay API Server](https://ray-project.github.io/kuberay/components/apiserver/)

You can install the software from their repositories, or you can use our installation scripts.

If your local kubectl is configured to connect to the external cluster do the following:
```bash
export EXTERNAL_CLUSTER=1
make setup
```

- In addition, you should configure external access to the KFP UI (`svc/ml-pipeline-ui` in the `kubeflow` ns) and the Ray
Server API (`svc/kuberay-apiserver-service` in the `kuberay` ns). Depends on your cluster and its deployment it can be
LoadBalancer services, Ingresses or Routes.

- Optionally, you can upload the test data into the [MinIO](https://min.io/) Object Store, deployed as part of KFP. In
order to do this, please provide external access to the Minio (`svc/minio-service` in the `kubeflow` ns) and execute the
following commands:
```shell
export MINIO_SERVER=<Minio external URL>
kubectl apply -f kind/hack/s3_secret.yaml
kind/hack/populate_minio.sh
```

## Clean up the cluster <a name = "cleanup"></a>
If you use an external Kubernetes cluster set the `EXTERNAL_CLUSTER` environment variable.

```shell
export EXTERNAL_CLUSTER=1
```
Now, you can cleanup the external or Kind Kubernetes clusters by running the following command:

```shell
make clean
```

**Note** that this command has to run from the project kind subdirectory, from the root directory, the command will be
```shell
make -C kind clean
```
Loading