mlcommons · aristizabal95 · Feb 16, 2022 · Feb 16, 2022 · Feb 16, 2022 · Feb 16, 2022
@@ -0,0 +1,130 @@
+# MedPerf's Data Preparator MLCube Template
+This is a Hello World implementation, following the structure and conventions MedPerf uses to process and transform raw datasets.
+
+## Purpose:
+At the time of writing, Data Preparators are in charge of standardizing the input data format models expect to receive. Additionally, they provide tools for testing the integrity of the data and for extracting useful insights from it.
+
+## How to run:
+This template was built so it can work out-of-the-box. Follow the next steps:
+
+1. Clone the repository
+2. cd to the repository
+   ```bash
+   cd mlcube_examples
+   ```
+3. Install mlcube and mlcube-docker
+
+   ```bash
+   pip install mlcube mlcube-docker
+   ```
+4. cd to current example's `mlcube` folder
+
+   ```bash
+   cd medperf/data_preparator/mlcube
+   ```
+5. execute the `prepare` task with mlcube
+   ```bash
+   mlcube run --task=prepare
+   ```
+6. check resulting data
+   ```bash
+   ls workspace/data
+   ```
+7. execute the `sanity_check` task
+    ```bash
+    mlcube run --task=sanity_check
+    ```
+8. execute the `statistics` task
+    ```bash
+    mlcube run --task=statistics
+    ``` 
+9. check the resulting statistics
+    ```bash
+    cat workspace/statistics.yaml
+    ```
+That's it! You just built and ran a hello-world data preparator mlcube!
+
+## Contents
+
+MLCubes usually share a similar folder structure and files. Here's a brief description of the role for the relevant files
+
+1. __`mlcube/mlcube.yaml`__: 
+
+    The `mlcube.yaml` file contains metadata about your data preparation procedure, including its interface. For MedPerf, we require three tasks: `prepare`, `sanity_check` and `statistics`. The description of the tasks and their input/outputs are described in the file:
+
+    ```yml
+    tasks:
+        prepare:
+        # This task is in charge of transforming the input data into the format
+        # expected by the model cubes. 
+            parameters:
+            inputs: {
+                data_path: names/,            # Required. Value must point to a directory containing the raw data inside workspace
+                labels_path: labels/,         # Required. Value must point to a directory containing labels for the data
+                parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+            }
+            outputs: {
+                output_path: data/               # Required. Indicates where to store the transformed data. Must contain transformed data and labels
+            }
+        sanity_check:
+        # This task ensures that the previously transformed data was transformed correctly.
+        # It runs a set of tests that check que quality of the data. The rigurosity of those
+        # tests is determined by the cube author.
+            parameters:
+            inputs: {
+                data_path: data/,                # Required. Value should be the output of the prepare task
+                parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+            }
+        statistics:
+        # This task computes statistics on the prepared dataset. Its purpose is to get a high-level
+        # idea of what is contained inside the data, without providing any specifics of any single entry
+            parameters:
+            inputs: {
+                data_path: data/,                # Required. Value should be the output of the prepare task
+                parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+            }
+            outputs: {
+                output_path: {
+                type: file, default: statistics.yaml # Required. Value must be `statistics.yaml`
+                }
+            }
+    ```
+
+2. __`mlcube/workspace/parameters.yaml`__:
+
+   This file provides ways to parameterize the data preparation process. You can set any key-value pairs that should be easily modifiable in order to adjust you mlcube's behavior. This file is mandatory, but can be left blank if parametrization is not needed, as is the case in this example.
+
+3. __`project`__: 
+
+   Contains the actual implementation of the mlcube. This includes all project-specific code, `Dockerfile` for building docker containers of the project and requirements for running the code.
+
+5. __`project/mlcube.py`__:
+
+   MLCube expects an entrypoint to the project in order to run the code and the specified tasks. It expects this entrypoint to behave like a CLI, in which each MLCube task (e.g. `prepare`) is executed as a subcommand, and each input/output parameter is passed as a CLI argument. An example of the expected interface is:
+   ```bash
+    python3 project/mlcube.py prepare --data_path=<DATA_PATH>  --labels_path=<LABELS_PATH> --parameters_file=<PARAMETERS_FILE> --output_path=<OUTPUT_PATH>
+   ```
+   `mlcube.py` provides such interface for this toy example. As long as you follow such CLI interface, you can implement it however you want. We provide an example that requirems minimal modifications to the original project code, by running any project task through subprocesses.
+
+## How to modify
+If you want to adjust this template for your own use-case, then the following list serves as a step-by-step guide:
+1. Remove demo artifacts from `/mlcube/workspace`: 
+     - `/mlcube/workspace/data`
+     - `/mlcube/workspace/labels`
+     - `/mlcube/workspace/names`
+     - `/mlcube/workspace/statistics.yaml`
+2. Pass your original code to the `/project` folder (removing everything but `mlcube.py`) 
+3. Adjust your code and the `/project/mlcube.py` file so that commands point to the respective code and receive the expected arguments
+4. Modify `/project/requirements.txt` so that it contains all code dependencies for your project
+5. Default `/project/Dockerfile` should suffice, but feel free to add/modify it to work with your needs. As long as it has an entrypoint pointing to `mlcube.py`
+6. Inside `/mlcube/workspace` add the input folders for preparing data.
+7. Inside `/mlcube/workspace/additional_files` add any files that are required for model execution (e.g. model weights)
+8. Adjust `/mlcube/mlcube.yaml` so that:
+   1. metadata such as `name`, `description`, `authors` and `image_name` are correctly assigned.
+   2. `data_path`, `labels_path` and other IO parameters point to the location where you expect data to be inside the `workspace` directory.
+   3. `parameters_file` should NOT be modified in any way.
+   4. Add any other required parameters that point to `additional_files` (e.g. model_weights). Naming can be arbitrary, but all files referenced from now on should be contained inside `additional_files`.
+   5. `output_path`s should NOT be modified in any way.
+
+## Requirements are negotiable
+The required fields in the mlcube task interface show what medperf currently assumes. As we are in alpha, this is a great time to raise concerns or requests about these requirements! Now is the best time for us to make changes.
@@ -0,0 +1,51 @@
+name: Hello World Medperf Data Preparator Cube
+description: MLCommons demonstration MLCube for building data preparators for MedPerf
+authors:
+ - {name: "MLCommons Medical Working Group"}
+
+platform:
+  accelerator_count: 0
+
+docker:
+  # Image name.
+  image: mlcommons/medical-data-prep-hello-world
+  # Docker build context relative to $MLCUBE_ROOT. Default is `build`.
+  build_context: "../project"
+  # Docker file name within docker build context, default is `Dockerfile`.
+  build_file: "Dockerfile"
+
+tasks:
+  prepare:
+  # This task is in charge of transforming the input data into the format
+  # expected by the model cubes. 
+    parameters:
+      inputs: {
+        data_path: names/,            # Required. Value must point to a directory containing the raw data inside workspace
+        labels_path: labels/,         # Required. Value must point to a directory containing labels for the data
+        parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+      }
+      outputs: {
+        output_path: data/               # Required. Indicates where to store the transformed data. Must contain transformed data and labels
+      }
+  sanity_check:
+  # This task ensures that the previously transformed data was transformed correctly.
+  # It runs a set of tests that check que quality of the data. The rigurosity of those
+  # tests is determined by the cube author.
+    parameters:
+      inputs: {
+        data_path: data/,                # Required. Value should be the output of the prepare task
+        parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+      }
+  statistics:
+  # This task computes statistics on the prepared dataset. Its purpose is to get a high-level
+  # idea of what is contained inside the data, without providing any specifics of any single entry
+    parameters:
+      inputs: {
+        data_path: data/,                # Required. Value should be the output of the prepare task
+        parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
+      }
+      outputs: {
+        output_path: {
+          type: file, default: statistics.yaml # Required. Value must be `statistics.yaml`
+        }
+      }
@@ -0,0 +1,13 @@
+id,greeting
+0,"Hello, Adam Smith"
+1,"Hello, John Smith"
+2,"Hello, Michael Stevens"
+3,"Howdy, Adam Smith"
+4,"Howdy, John Smith"
+5,"Howdy, Michael Stevens"
+6,"Greetings, Adam Smith"
+7,"Greetings, John Smith"
+8,"Greetings, Michael Stevens"
+9,"Bonjour, Adam Smith"
+10,"Bonjour, John Smith"
+11,"Bonjour, Michael Stevens"
@@ -0,0 +1,3 @@
+Adam Smith Miller
+John Smith Jones
+Michael M. Stevens Taylor
@@ -0,0 +1,29 @@
+FROM ubuntu:18.04
+MAINTAINER MLPerf MLBox Working Group
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+    software-properties-common \
+    python3-dev \
+    curl && \
+    rm -rf /var/lib/apt/lists/*
+
+RUN add-apt-repository ppa:deadsnakes/ppa -y && apt-get update
+
+RUN apt-get install python3 -y
+
+RUN apt-get install python3-pip -y
+
+COPY ./requirements.txt project/requirements.txt 
+
+RUN pip3 install --upgrade pip
+
+RUN pip3 install --no-cache-dir -r project/requirements.txt
+
+ENV LANG C.UTF-8
+
+COPY . /project
+
+WORKDIR /project
+
+ENTRYPOINT ["python3", "mlcube.py"]
@@ -0,0 +1,81 @@
+# MLCube Entrypoint
+#
+# This script shows how you can bridge your app with an MLCube interface.
+# MLCubes expect the entrypoint to behave like a CLI, where tasks are
+# commands, and input/output parameters and command-line arguments.
+# You can provide that interface to MLCube in any way you prefer.
+# Here, we show a way that requires minimal intrusion to the original code,
+# By running the application through subprocesses. 
+
+import yaml
+import typer
+import subprocess
+
+app = typer.Typer()
+
+def exec_python(cmd: str) -> None:
+    """Execute a python script as a subprocess
+
+    Args:
+        cmd (str): command to run as would be written inside the terminal
+    """
+    splitted_cmd = cmd.split()
+    process = subprocess.Popen(splitted_cmd, cwd=".")
+    process.wait()
+
+@app.command("prepare")
+def prepare(
+    data_path: str = typer.Option(..., "--data_path"),
+    labels_path: str = typer.Option(..., "--labels_path"),
+    params_file: str = typer.Option(..., "--parameters_file"),
+    out_path: str = typer.Option(..., "--output_path")
+):
+    """Prepare task command. This is what gets executed when we run:
+    `mlcube run --task=prepare`
+
+    Args:
+        data_path (str): Location of the data to transform. Required for Medperf Data Preparation MLCubes.
+        labels_path (str): Location of the labels. Required  for Medperf Data Preparation MLCubes
+        params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
+        out_path (str): Location to store transformed data. Required for Medperf Data Preparation MLCubes.
+    """
+    cmd = f"python3 prepare.py --names_path={data_path} --labels_path={labels_path} --out={out_path}"
+    exec_python(cmd)
+
+@app.command("sanity_check")
+def sanity_check(
+    data_path: str = typer.Option(..., "--data_path"), 
+    params_file: str = typer.Option(..., "--parameters_file")
+):
+    """Sanity check task command. This is what gets executed when we run:
+    `mlcube run --task=sanity_check`
+
+    Args:
+        data_path (str): Location of the prepared data. Required for Medperf Data Preparation MLCubes.
+        params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
+    """
+    cmd = f"python3 sanity_check.py --data_path={data_path}"
+    exec_python(cmd)
+
+@app.command("statistics")
+def statistics(
+    data_path: str = typer.Option(..., "--data_path"), 
+    params_file: str = typer.Option(..., "--parameters_file"),
+    output_path: str = typer.Option(..., "--output_path")
+):
+    """Computes statistics about the data. This statistics are uploaded
+    to the Medperf platform under the data owner's approval. Include
+    every statistic you consider useful for determining the nature of the
+    data, but keep in mind that we want to keep the data as private as 
+    possible.
+
+    Args:
+        data_path (str): Location of the prepared data. Required for Medperf Data Preparation MLCubes.
+        params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
+        output_path (str): File to store the statistics. Must be statistics.yaml. Required for Medperf Data Preparation MLCubes. 
+    """
+    cmd = f"python3 statistics.py --data_path={data_path} --out_file={output_path}"
+    exec_python(cmd)
+
+if __name__ == "__main__":
+    app()
@@ -0,0 +1,60 @@
+import os
+import shutil
+import argparse
+import pandas as pd
+
+def prepare(names: pd.DataFrame):
+    """Takes a list of names and formats them into [First Name, Last Name]
+
+    Args:
+        names (pd.DataFrame): DataFrame containing the names to be prepared
+    """
+    names["First Name"] = names["Name"].str.split().str[0]
+    names["Last Name"] = names["Name"].str.split().str[-2]
+    names.drop("Name", axis="columns", inplace=True)
+
+    return names
+
+def get_names_df(files, column_name):
+    names_files = os.listdir(args.names)
+    csv_files = [file for file in names_files if file.endswith(".csv")]
+    tsv_files = [file for file in names_files if file.endswith(".tsv")]
+    txt_files = [file for file in names_files if file.endswith(".txt")]
+
+    if len(csv_files):
+        filepath = os.path.join(files, csv_files[0])
+        df = pd.read_csv(filepath, usecols=[column_name])
+        return df
+    if len(tsv_files):
+        filepath = os.path.join(files, tsv_files[0])
+        df = pd.read_csv(filepath, usecols=[column_name], sep='\t')
+        return df
+    if len(txt_files):
+        filepath = os.path.join(files, txt_files[0])
+        with open(filepath, "r") as f:
+            names = f.readlines()
+
+        df = pd.DataFrame(data=names, columns=[column_name])
+        return df
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser("Medperf Data Preparator Example")
+    parser.add_argument("--names_path", dest="names", type=str, help="path containing raw names")
+    parser.add_argument("--labels_path", dest="labels", type=str, help="path containing labels")
+    parser.add_argument("--out", dest="out" , type=str, help="path to store prepared data")
+
+    args = parser.parse_args()
+
+    # One of the intended use-cases of the data preparator cube
+    # is to accept multiple data formats depending on the task needs
+    names_df = get_names_df(args.names, "Name")
+    prepared_names = prepare(names_df)
+
+    # add the labels to the output folder. In this case we're going to assume
+    # the labels will always follow the same format
+    in_labels = os.path.join(args.labels, "labels.csv")
+    out_labels = os.path.join(args.out, "labels.csv")
+    shutil.copyfile(in_labels, out_labels)
+
+    out_file = os.path.join(args.out, "names.csv")
+    prepared_names.to_csv(out_file, index=False)
@@ -0,0 +1,3 @@
+pyYAML
+typer
+pandas
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    pyYAML
+    typer
+    pandas