IBM · roytman · May 27, 2024 · May 26, 2024 · May 26, 2024 · May 27, 2024
diff --git a/README.md b/README.md
@@ -24,7 +24,6 @@ The goal is to offer high-level APIs for developers to quickly get started in wo
 
 ## 📝 Table of Contents
 - [About](#about)
-- [Setup](#setup)
 - [Getting Started](#getting_started)
 - [How to Contribute](#contribute_steps)
 - [Acknowledgments](#acknowledgement)
@@ -40,7 +39,7 @@ Eventually, Data Prep Kit will offer consistent APIs and configurations across t
 1. Python runtime
 2. Ray runtime (local and distributed)
 3. Spark runtime (local and distributed)
-4. [No-code pipelines with KFP](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
+4. [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
 
 The current matrix for the combination of modules and supported runtimes is shown in the table below. 
 Contributors are welcome to add new modules as well as add runtime support for existing modules!
@@ -66,7 +65,7 @@ Features of the toolkit:
 - It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
 - It provides a growing set of sample pipelines developed for real enterprise use cases.
 - It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
-- It uses [Kube Flow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md) for no-code data prep.
+- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).
 
 Data modalities supported: 
 
@@ -97,7 +96,8 @@ A general purpose [SQL-based filter transform](transforms/universal/filter) enab
 For a new module to be added, a user can pick the right design based on the processing to be applied. More details [here](transforms). 
 
 #### Scaling of Transforms
-To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) and [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.
+To enable processing of large data volumes leveraging multi-mode clusters, [Ray](https://docs.ray.io/en/latest/index.html) 
+or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out the Python implementations.
 A generalized workflow is shown [here](doc/data-processing.md).
 
 #### Bring Your Own Transform 
@@ -107,29 +107,26 @@ More details on the data processing library are [here](data-processing-lib/doc/o
 #### Automation
 The toolkit also supports transform execution automation based on 
 [Kubeflow pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP),
-tested on [Kind clusters](https://kind.sigs.k8s.io/). The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html)
+tested on a locally deployed [Kind cluster](https://kind.sigs.k8s.io/) and external OpenShift clusters. There is an 
+automation to create a Kind cluster and deploy all required components on it.
+The KFP implementation is based on the [KubeRay Operator](https://docs.ray.io/en/master/cluster/kubernetes/getting-started.html)
 for creating and managing the Ray cluster and [KubeRay API server](https://github.com/ray-project/kuberay/tree/master/apiserver)
 to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several
 [kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation.
 
+## &#x1F680; Getting Started <a name = "getting_started"></a>
 
-## &#x2699; Setup <a name = "setup"></a>
-
-We tried the project on different hardware/software configurations (see [Apple/Mac considerations](doc/mac.md).)
-We recommend using a laptop with at least 16GB of memory and 8 CPU cores for development without KFP, 
-and at least 32GB and preferably 16 CPU cores if you plan to run KFP on Kind.
-
-### Prerequisites
-
-* Python 3.10 or 3.11 
-* Docker/Podman
-
-Two important tools will also be installed using the steps below:
-* [pre-commit](https://pre-commit.com/)
-* [twine](https://twine.readthedocs.io/en/stable/) 
+There are various entry points that you can choose based on the use case. Each entry point has its pre-requirements and setup steps.
+The common part of are:
+#### Prerequisites
+- Python 3.10 or 3.11 
+-Docker/Podman
 
-### Installation Steps
+Two important development tools will also be installed using the steps below:
+-[pre-commit](https://pre-commit.com/)
+-[twine](https://twine.readthedocs.io/en/stable/) 
 
+#### Installation Steps
 ```shell
 pip install pre-commit
 pip install twine
@@ -138,11 +135,7 @@ git clone git@github.com:IBM/data-prep-kit.git
 cd data-prep-kit
 pre-commit install
 ```
-
-## &#x1F680; Getting Started <a name = "getting_started"></a>
-
-There are various entry points that you can choose based on the use case. Below are a few demos to get you started. 
-
+Below are a few demos to get you started.
 ### Build Your Own Transforms
 Follow the documentation [here](data-processing-lib/doc/overview.md) to build your own transform
 and run it in either the python  or Ray runtimes. 
@@ -152,11 +145,12 @@ Get started by running the "noop" transform that performs an identity operation
 [tutorial](data-processing-lib/doc/simplest-transform-tutorial.md) and associated 
 [noop implementation](transforms/universal/noop). 
 
-### Run a Data Pipeline on Local Ray
-Get started by building a data pipeline with our [example pipeline](./examples/) that can run on a laptop. To test this pipeline, you can download this repo as a zip file and get started. 
+### Run a Jupyter notebook on Local Ray cluster
+Get started by building a Jupiter notebook executing a sequence of Transforms with our  [example pipeline](./examples/) 
+that can run on your machine. This implementation can also be extended to connect to the remote Ray cluster.
 
 ### Automate a Pipeline
-The data preprocessing can be automated by running transformers as a KubeFlow pipeline (KFP). 
+The data preprocessing can be automated by running transformers as a Kubeflow pipeline (KFP). 
 See this simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md). See [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) 
 if you want to combine several data transformation steps.
 

diff --git a/kfp/README.md b/kfp/README.md
@@ -0,0 +1,6 @@
+# Automation with Kubeflow Pipelines
+
+- [Set up a Kubernetes clusters for KFP execution](./doc/setup.md)
+- [Simple Transform pipeline tutorial](./doc/simple_transform_pipeline.md)
+- [Execution several transformers](./doc/multi_transform_pipeline.md)
+- [Clean up the cluster](./doc/setup#cleanup)
diff --git a/kfp/doc/multi_transform_pipeline.md b/kfp/doc/multi_transform_pipeline.md
@@ -24,7 +24,7 @@ In the list of its input parameters, we also see `data_s3_config`. Now, we have
 ![param list](param_list2.png)
 
 
-**Note** An example super pipeline that combines several transforms, `doc_id`, `ededup`, and `fdedup`, can be found in [superworkflow_dedups_sample_wf.py](../transform_workflows/superworkflows/superworkflow_dedups_sample_wf.py).
+**Note** An example super pipeline that combines several transforms, `doc_id`, `ededup`, and `fdedup`, can be found in [superworkflow_dedups_sample_wf.py](../superworkflows/v1/superworkflow_dedups_sample_wf.py).
 
 ![super pipeline](super_pipeline.png)
 

diff --git a/kfp/doc/setup.md b/kfp/doc/setup.md
@@ -0,0 +1,126 @@
+# Set up a Kubernetes clusters for KFP execution
+
+## 📝 Table of Contents
+- [A Kind deployment supported platforms](#kind_platforms)
+- [Preinstalled software components](#preinstalled)
+  - [A Kind deployment](#kind)
+  - [An existing cluster](#existing_cluster)
+- [Installation steps](#installation)
+  - [Installation on an existing Kubernetes cluster](#installation_existing)
+- [Clean up the cluster](#cleanup")
+
+The project provides instructions and deployment automation to run all components in an all-inclusive fashion on a 
+single machine using a [Kind cluster](https://kind.sigs.k8s.io/) and a local data storage ([MinIO](https://min.io/)).
+However, this topology is not suitable for processing medium and large datasets, and deployment should be carried out 
+on a real Kubernetes or OpenShift cluster. Therefore, we recommend using Kind cluster for only for local testing and 
+debugging, not production loads. For production loads use a real Kubernetes cluster.
+
+Running a Kind Kubernetes cluster with Kubeflow pipelines (KFP) and MinIO requires significant
+memory. We recommend deploying it on machines with at least 32 GB of RAM and 8-9 CPU cores. RHEL OS requires 
+more resources, e.g. 64 GB RAM and 32 CPU cores.
+
+## A Kind deployment supported Platforms <a name = "kind_platforms"></a> 
+Executing KFP, MinIO, and Ray on a single Kind cluster pushes the system to its load limits. Therefore, although we are 
+working on extending support for additional platforms, not all platforms/configurations are currently supported.
+
+| Operating System  | Container Agent | Support  | Comments | 
+|:-----------------:|:---------------:|:--------:| :---------: |
+| RHEL 7            |     any         |    -     | Kind [doesn't support](https://github.com/kubernetes-sigs/kind/issues/3311) RHEL 7 |
+|      RHEL 8       |                 |          |
+|     RHEL 9.4      |     Docker      |   Yes    |
+|     RHEL 9.4      |     Podman      |    No    | Issues with Ray job executions
+|   Ubuntu 24-04    |     Docker      |   Yes    | 
+|   Ubuntu 24-04    |     Podman      |          |
+|   Windows WSL2    |     Docker      |   Yes    |
+|   Windows WSL2    |     Podman      |          |
+|    MacOS amd64    |     Docker      |   Yes    |
+|    MacOS amd64    |     Podman      |          |
+|    MacOS arm64    |     Docker      |          |
+|    MacOS arm64    |     Podman      |    No    | Issues with Ray job executions
+
+## Preinstalled software components <a name = "preinstalled"></a> 
+
+Depending on whether a Kind cluster or an existing Kubernetes cluster is used, different software packages need to be preinstalled.
+
+### Kind deployment <a name = "kind"></a> 
+The following programs should be manually installed:
+
+- [Helm](https://helm.sh/docs/intro/install/) 3.10.0 or greater must be installed and configured on your machine.
+- [Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) tool for running local Kubernetes clusters 0.14.0 or newer must be installed on your machine.
+- [Kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) 1.26 or newer must be installed on your machine.
+- [MinIO Client (mc)](https://min.io/docs/minio/kubernetes/upstream/index.html) must be installed on your machine. Please 
+choose your OS system, and process according to "(Optional) Install the MinIO Client". You have to install the `mc` client only.
+- [git client](https://git-scm.com/downloads), we use git client to clone installation repository
+- [lsof](https://www.ionos.com/digitalguide/server/configuration/linux-lsof/) usually it is part of Linux or MacOS distribution.
+- Container agent such as [Docker](https://www.docker.com/) or [Podman](https://podman-desktop.io/)
+
+### Existing Kubernetes cluster <a name = "existing_cluster"></a> 
+Deployment on an existing cluster requires less pre-installed software
+Only the following programs should be manually installed:
+
+- [Helm](https://helm.sh/docs/intro/install/) 3.10.0 or greater must be installed and configured on your machine.
+- [Kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) 1.26 or newer must be installed on your machine, and be 
+able to connect to the external cluster.
+- Deployment of the test data requires [MinIO Client (mc)](https://min.io/docs/minio/kubernetes/upstream/index.html) Please 
+choose your OS system, and process according to "(Optional) Install the MinIO Client". Only the `mc` client should be installed.
+
+## Installation steps <a name = "installation"></a>
+
+You can create a Kind cluster with all required software installed using the following command: 
+
+```shell
+ make setup
+```
+from this main package directory or from the `kind` directory.
+If you do not want to upload the testing data into the locally deployed Minio, and reduce memory footprint, please set:
+```bash
+export POPULATE_TEST_DATA ?= 0
+```
+
+### Installation on an existing Kubernetes cluster <a name = "installation_existing"></a>
+Alternatively you can deploy pipeline to the existing Kubernetes cluster. If your local kubectl is configured to 
+
+In order to execute data transformers on the remote Kubernetes cluster, the following packages should be installed on the cluster:
+
+- [KubeFlow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (KFP). Currently, we use 
+upstream Argo-based KFP v1.
+- [KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) controller and 
+[KubeRay API Server](https://ray-project.github.io/kuberay/components/apiserver/) 
+
+You can install the software from their repositories, or you can use our installation scripts.
+
+If your local kubectl is configured to connect to the external cluster do the following:
+```bash
+export EXTERNAL_CLUSTER=1
+make setup
+```
+
+- In addition, you should configure external access to the KFP UI (`svc/ml-pipeline-ui` in the `kubeflow` ns) and the Ray 
+Server API (`svc/kuberay-apiserver-service` in the `kuberay` ns). Depends on your cluster and its deployment it can be 
+LoadBalancer services, Ingresses or Routes. 
+
+- Optionally, you can upload the test data into the [MinIO](https://min.io/) Object Store, deployed as part of KFP. In 
+order to do this, please provide external access to the Minio (`svc/minio-service` in the `kubeflow` ns) and execute the 
+following commands: 
+```shell
+export MINIO_SERVER=<Minio external URL>
+kubectl apply -f kind/hack/s3_secret.yaml
+kind/hack/populate_minio.sh
+```
+
+## Clean up the cluster <a name = "cleanup"></a>
+If you use an external Kubernetes cluster set the `EXTERNAL_CLUSTER` environment variable.
+
+```shell
+export EXTERNAL_CLUSTER=1
+```
+Now, you can cleanup the external or Kind Kubernetes clusters by running the following command:
+
+```shell
+make clean
+```
+
+**Note** that this command has to run from the project kind subdirectory, from the root directory, the command will be
+```shell
+make -C kind clean
+```