Skip to content

Implement MedPerf Model Hello World #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions medperf/data_preparator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# MedPerf's Data Preparator MLCube Template
This is a Hello World implementation, following the structure and conventions MedPerf uses to process and transform raw datasets.

## Purpose:
At the time of writing, Data Preparators are in charge of standardizing the input data format models expect to receive. Additionally, they provide tools for testing the integrity of the data and for extracting useful insights from it.

## How to run:
This template was built so it can work out-of-the-box. Follow the next steps:

1. Clone the repository
2. cd to the repository
```bash
cd mlcube_examples
```
3. Install mlcube and mlcube-docker

```bash
pip install mlcube mlcube-docker
```
4. cd to current example's `mlcube` folder

```bash
cd medperf/data_preparator/mlcube
```
5. execute the `prepare` task with mlcube
```bash
mlcube run --task=prepare
```
6. check resulting data
```bash
ls workspace/data
```
7. execute the `sanity_check` task
```bash
mlcube run --task=sanity_check
```
8. execute the `statistics` task
```bash
mlcube run --task=statistics
```
9. check the resulting statistics
```bash
cat workspace/statistics.yaml
```
That's it! You just built and ran a hello-world data preparator mlcube!

## Contents

MLCubes usually share a similar folder structure and files. Here's a brief description of the role for the relevant files

1. __`mlcube/mlcube.yaml`__:

The `mlcube.yaml` file contains metadata about your data preparation procedure, including its interface. For MedPerf, we require three tasks: `prepare`, `sanity_check` and `statistics`. The description of the tasks and their input/outputs are described in the file:

```yml
tasks:
prepare:
# This task is in charge of transforming the input data into the format
# expected by the model cubes.
parameters:
inputs: {
data_path: names/, # Required. Value must point to a directory containing the raw data inside workspace
labels_path: labels/, # Required. Value must point to a directory containing labels for the data
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
outputs: {
output_path: data/ # Required. Indicates where to store the transformed data. Must contain transformed data and labels
}
sanity_check:
# This task ensures that the previously transformed data was transformed correctly.
# It runs a set of tests that check que quality of the data. The rigurosity of those
# tests is determined by the cube author.
parameters:
inputs: {
data_path: data/, # Required. Value should be the output of the prepare task
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
statistics:
# This task computes statistics on the prepared dataset. Its purpose is to get a high-level
# idea of what is contained inside the data, without providing any specifics of any single entry
parameters:
inputs: {
data_path: data/, # Required. Value should be the output of the prepare task
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
outputs: {
output_path: {
type: file, default: statistics.yaml # Required. Value must be `statistics.yaml`
}
}
```

2. __`mlcube/workspace/parameters.yaml`__:

This file provides ways to parameterize the data preparation process. You can set any key-value pairs that should be easily modifiable in order to adjust you mlcube's behavior. This file is mandatory, but can be left blank if parametrization is not needed, as is the case in this example.

3. __`project`__:

Contains the actual implementation of the mlcube. This includes all project-specific code, `Dockerfile` for building docker containers of the project and requirements for running the code.

5. __`project/mlcube.py`__:

MLCube expects an entrypoint to the project in order to run the code and the specified tasks. It expects this entrypoint to behave like a CLI, in which each MLCube task (e.g. `prepare`) is executed as a subcommand, and each input/output parameter is passed as a CLI argument. An example of the expected interface is:
```bash
python3 project/mlcube.py prepare --data_path=<DATA_PATH> --labels_path=<LABELS_PATH> --parameters_file=<PARAMETERS_FILE> --output_path=<OUTPUT_PATH>
```
`mlcube.py` provides such interface for this toy example. As long as you follow such CLI interface, you can implement it however you want. We provide an example that requirems minimal modifications to the original project code, by running any project task through subprocesses.

## How to modify
If you want to adjust this template for your own use-case, then the following list serves as a step-by-step guide:
1. Remove demo artifacts from `/mlcube/workspace`:
- `/mlcube/workspace/data`
- `/mlcube/workspace/labels`
- `/mlcube/workspace/names`
- `/mlcube/workspace/statistics.yaml`
2. Pass your original code to the `/project` folder (removing everything but `mlcube.py`)
3. Adjust your code and the `/project/mlcube.py` file so that commands point to the respective code and receive the expected arguments
4. Modify `/project/requirements.txt` so that it contains all code dependencies for your project
5. Default `/project/Dockerfile` should suffice, but feel free to add/modify it to work with your needs. As long as it has an entrypoint pointing to `mlcube.py`
6. Inside `/mlcube/workspace` add the input folders for preparing data.
7. Inside `/mlcube/workspace/additional_files` add any files that are required for model execution (e.g. model weights)
8. Adjust `/mlcube/mlcube.yaml` so that:
1. metadata such as `name`, `description`, `authors` and `image_name` are correctly assigned.
2. `data_path`, `labels_path` and other IO parameters point to the location where you expect data to be inside the `workspace` directory.
3. `parameters_file` should NOT be modified in any way.
4. Add any other required parameters that point to `additional_files` (e.g. model_weights). Naming can be arbitrary, but all files referenced from now on should be contained inside `additional_files`.
5. `output_path`s should NOT be modified in any way.

## Requirements are negotiable
The required fields in the mlcube task interface show what medperf currently assumes. As we are in alpha, this is a great time to raise concerns or requests about these requirements! Now is the best time for us to make changes.
51 changes: 51 additions & 0 deletions medperf/data_preparator/mlcube/mlcube.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Hello World Medperf Data Preparator Cube
description: MLCommons demonstration MLCube for building data preparators for MedPerf
authors:
- {name: "MLCommons Medical Working Group"}

platform:
accelerator_count: 0

docker:
# Image name.
image: mlcommons/medical-data-prep-hello-world
# Docker build context relative to $MLCUBE_ROOT. Default is `build`.
build_context: "../project"
# Docker file name within docker build context, default is `Dockerfile`.
build_file: "Dockerfile"

tasks:
prepare:
# This task is in charge of transforming the input data into the format
# expected by the model cubes.
parameters:
inputs: {
data_path: names/, # Required. Value must point to a directory containing the raw data inside workspace
labels_path: labels/, # Required. Value must point to a directory containing labels for the data
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
outputs: {
output_path: data/ # Required. Indicates where to store the transformed data. Must contain transformed data and labels
}
sanity_check:
# This task ensures that the previously transformed data was transformed correctly.
# It runs a set of tests that check que quality of the data. The rigurosity of those
# tests is determined by the cube author.
parameters:
inputs: {
data_path: data/, # Required. Value should be the output of the prepare task
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
statistics:
# This task computes statistics on the prepared dataset. Its purpose is to get a high-level
# idea of what is contained inside the data, without providing any specifics of any single entry
parameters:
inputs: {
data_path: data/, # Required. Value should be the output of the prepare task
parameters_file: parameters.yaml # Required. Value must be `parameters.yaml`
}
outputs: {
output_path: {
type: file, default: statistics.yaml # Required. Value must be `statistics.yaml`
}
}
13 changes: 13 additions & 0 deletions medperf/data_preparator/mlcube/workspace/labels/labels.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
id,greeting
0,"Hello, Adam Smith"
1,"Hello, John Smith"
2,"Hello, Michael Stevens"
3,"Howdy, Adam Smith"
4,"Howdy, John Smith"
5,"Howdy, Michael Stevens"
6,"Greetings, Adam Smith"
7,"Greetings, John Smith"
8,"Greetings, Michael Stevens"
9,"Bonjour, Adam Smith"
10,"Bonjour, John Smith"
11,"Bonjour, Michael Stevens"
3 changes: 3 additions & 0 deletions medperf/data_preparator/mlcube/workspace/names/names.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Adam Smith Miller
John Smith Jones
Michael M. Stevens Taylor
Empty file.
29 changes: 29 additions & 0 deletions medperf/data_preparator/project/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM ubuntu:18.04
MAINTAINER MLPerf MLBox Working Group

RUN apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common \
python3-dev \
curl && \
rm -rf /var/lib/apt/lists/*

RUN add-apt-repository ppa:deadsnakes/ppa -y && apt-get update

RUN apt-get install python3 -y

RUN apt-get install python3-pip -y

COPY ./requirements.txt project/requirements.txt

RUN pip3 install --upgrade pip

RUN pip3 install --no-cache-dir -r project/requirements.txt

ENV LANG C.UTF-8

COPY . /project

WORKDIR /project

ENTRYPOINT ["python3", "mlcube.py"]
81 changes: 81 additions & 0 deletions medperf/data_preparator/project/mlcube.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# MLCube Entrypoint
#
# This script shows how you can bridge your app with an MLCube interface.
# MLCubes expect the entrypoint to behave like a CLI, where tasks are
# commands, and input/output parameters and command-line arguments.
# You can provide that interface to MLCube in any way you prefer.
# Here, we show a way that requires minimal intrusion to the original code,
# By running the application through subprocesses.

import yaml
import typer
import subprocess

app = typer.Typer()

def exec_python(cmd: str) -> None:
"""Execute a python script as a subprocess

Args:
cmd (str): command to run as would be written inside the terminal
"""
splitted_cmd = cmd.split()
process = subprocess.Popen(splitted_cmd, cwd=".")
process.wait()

@app.command("prepare")
def prepare(
data_path: str = typer.Option(..., "--data_path"),
labels_path: str = typer.Option(..., "--labels_path"),
params_file: str = typer.Option(..., "--parameters_file"),
out_path: str = typer.Option(..., "--output_path")
):
"""Prepare task command. This is what gets executed when we run:
`mlcube run --task=prepare`

Args:
data_path (str): Location of the data to transform. Required for Medperf Data Preparation MLCubes.
labels_path (str): Location of the labels. Required for Medperf Data Preparation MLCubes
params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
out_path (str): Location to store transformed data. Required for Medperf Data Preparation MLCubes.
"""
cmd = f"python3 prepare.py --names_path={data_path} --labels_path={labels_path} --out={out_path}"
exec_python(cmd)

@app.command("sanity_check")
def sanity_check(
data_path: str = typer.Option(..., "--data_path"),
params_file: str = typer.Option(..., "--parameters_file")
):
"""Sanity check task command. This is what gets executed when we run:
`mlcube run --task=sanity_check`

Args:
data_path (str): Location of the prepared data. Required for Medperf Data Preparation MLCubes.
params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
"""
cmd = f"python3 sanity_check.py --data_path={data_path}"
exec_python(cmd)

@app.command("statistics")
def statistics(
data_path: str = typer.Option(..., "--data_path"),
params_file: str = typer.Option(..., "--parameters_file"),
output_path: str = typer.Option(..., "--output_path")
):
"""Computes statistics about the data. This statistics are uploaded
to the Medperf platform under the data owner's approval. Include
every statistic you consider useful for determining the nature of the
data, but keep in mind that we want to keep the data as private as
possible.

Args:
data_path (str): Location of the prepared data. Required for Medperf Data Preparation MLCubes.
params_file (str): Location of the parameters.yaml file. Required for Medperf Data Preparation MLCubes.
output_path (str): File to store the statistics. Must be statistics.yaml. Required for Medperf Data Preparation MLCubes.
"""
cmd = f"python3 statistics.py --data_path={data_path} --out_file={output_path}"
exec_python(cmd)

if __name__ == "__main__":
app()
60 changes: 60 additions & 0 deletions medperf/data_preparator/project/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import os
import shutil
import argparse
import pandas as pd

def prepare(names: pd.DataFrame):
"""Takes a list of names and formats them into [First Name, Last Name]

Args:
names (pd.DataFrame): DataFrame containing the names to be prepared
"""
names["First Name"] = names["Name"].str.split().str[0]
names["Last Name"] = names["Name"].str.split().str[-2]
names.drop("Name", axis="columns", inplace=True)

return names

def get_names_df(files, column_name):
names_files = os.listdir(args.names)
csv_files = [file for file in names_files if file.endswith(".csv")]
tsv_files = [file for file in names_files if file.endswith(".tsv")]
txt_files = [file for file in names_files if file.endswith(".txt")]

if len(csv_files):
filepath = os.path.join(files, csv_files[0])
df = pd.read_csv(filepath, usecols=[column_name])
return df
if len(tsv_files):
filepath = os.path.join(files, tsv_files[0])
df = pd.read_csv(filepath, usecols=[column_name], sep='\t')
return df
if len(txt_files):
filepath = os.path.join(files, txt_files[0])
with open(filepath, "r") as f:
names = f.readlines()

df = pd.DataFrame(data=names, columns=[column_name])
return df

if __name__ == '__main__':
parser = argparse.ArgumentParser("Medperf Data Preparator Example")
parser.add_argument("--names_path", dest="names", type=str, help="path containing raw names")
parser.add_argument("--labels_path", dest="labels", type=str, help="path containing labels")
parser.add_argument("--out", dest="out" , type=str, help="path to store prepared data")

args = parser.parse_args()

# One of the intended use-cases of the data preparator cube
# is to accept multiple data formats depending on the task needs
names_df = get_names_df(args.names, "Name")
prepared_names = prepare(names_df)

# add the labels to the output folder. In this case we're going to assume
# the labels will always follow the same format
in_labels = os.path.join(args.labels, "labels.csv")
out_labels = os.path.join(args.out, "labels.csv")
shutil.copyfile(in_labels, out_labels)

out_file = os.path.join(args.out, "names.csv")
prepared_names.to_csv(out_file, index=False)
3 changes: 3 additions & 0 deletions medperf/data_preparator/project/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pyYAML
typer
pandas
Loading