scikit-learn pipelines #100

casenave · 2025-06-29T20:57:03Z

casenave
Jun 29, 2025
Maintainer

A relatively clean tentative made here: https://github.com/PLAID-lib/plaid/tree/pipefunc_tests/examples/pipelines

bstaber · 2025-06-30T10:56:12Z

bstaber
Jun 30, 2025
Maintainer

Thanks for picking your brain on this. Before looking at the details, I was wondering if we could also discuss a practical way of creating new "nodes" and automatically parsing the configuration file. I have something in mind like a model_manager module (possibly included in plaid).

I tried to make a toy example as follows. The idea is to end up with the simplest API as possible. In my example the layout is the following:

├── main.py
├── model_config.yml
├── model_manager
│   ├── __init__.py
│   ├── builder.py
    └── registrer.py

The main (user-defined) script looks like this.

# main.py

from pathlib import Path

import numpy as np
from rich.pretty import pprint
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
from sklearn.pipeline import Pipeline

from model_manager.builder import build_pipeline_from
from model_manager.registrer import register_node


# task can be defined elsewhere of course, but for this example we define it here
@register_node("standard_scaler")
class StandardScalerNode(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def fit(self, X, y=None):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)


@register_node("pca_reduction")
class DummyPCANode(BaseEstimator, TransformerMixin):
    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X


@register_node("dummy_regressor")
class DummyRegressorNode(BaseEstimator, RegressorMixin):
    def __init__(self, coef):
        self.coef = np.array(coef)

    def fit(self, X, y):
        return self

    def predict(self, X):
        return X @ self.coef


# create pipeline from config
config_file_path = Path("model_config.yml")
pipeline: Pipeline = build_pipeline_from(config_file_path)

pprint(pipeline)

We define three tasks (nodes) using a decorator @register_node which updates a global registry of available nodes. In practice, I think that the user can define all his tasks in a folder elsewhere and just import them.

The decorator itself is defined in the register module:

from typing import Type

NODE_REGISTRY: dict[str, Type] = {}


def register_node(name: str):
    def decorator(cls):
        NODE_REGISTRY[name] = cls
        return cls

    return decorator

Whenever something is decorated with this and imported (either in the main script or through the user package in a __init__.py) it updates the global NODE_REGISTRY that keeps track of the available nodes (at run time).

The function that builds the pipeline simply reads the configuration file, loops over the steps and looks in the node registry to build the pipeline:

from sklearn.pipeline import Pipeline
from pathlib import Path
import yaml
from model_manager.registrer import NODE_REGISTRY


def build_pipeline_from(config: Path) -> Pipeline:
    """Build a scikit-learn pipeline from a configuration dictionary.

    Args:
        config): A dictionary where keys are node names and values are dictionaries containing the node type and its parameters.

    Returns:
        Pipeline: A scikit-learn Pipeline object constructed from the configuration.
    """
    
    with open(config.as_posix()) as f:
        config = yaml.safe_load(f)
    
    steps = []
    for name, cfg in config.items():
        node_type = cfg.pop("type")
        cls = NODE_REGISTRY[node_type]
        instance = cls(**cfg)
        steps.append((name, instance))
    return Pipeline(steps)

For the main script shown above, I use this configuration file:

scaler:
type: standard_scaler

reduction:
type: pca_reduction
n_components: 2

regressor:
type: dummy_regressor
coef: [1.0, 2.0]

Summary:

The user implements his tasks using the decorator. He can do that in his own package, let's say my_nodes/*
As long as these nodes are imported in the python path (through an init file for instance or through a straight import) they will be added to the node registry
Building the pipeline is then automatically achieved with the provided function (it could be a class if you prefer, such as ModelBuilder, idk)
I guess we can provide classic nodes/tasks in plaid right away

I think it would be nice to:

Think about this high-level interface
Then dive into the more complicated stuff you did

I uploaded the toy manager if you want to try it out:

toy_pipeline_project.zip

1 reply

xroynard Jul 12, 2025
Maintainer

We define three tasks (nodes) using a decorator @register_node which updates a global registry of available nodes. In practice, I think that the user can define all his tasks in a folder elsewhere and just import them.

It looks a lot like https://mit-ll-responsible-ai.github.io/hydra-zen/, except in hydra-zen you don’t register blocks (transforms or predictor), but their configs, and configs can be automatically generated from the blocks

You should look at it, I think it is capable of nearly everything you suggest.

casenave · 2025-06-30T11:22:55Z

casenave
Jun 30, 2025
Maintainer Author

Thanks a lot for this insight ! It could be very nice, but I think we should first converge on some "manual" exemples to decide if pipelines on plaid objects natively make sense.

0 replies

casenave · 2025-07-01T06:11:52Z

casenave
Jul 1, 2025
Maintainer Author

For the moment, I have a deepcopy of the dataset in order to have a clean copy of the modified dataset as output of each node of the pipeline, leaving the input dataset unmodified at each node. This is a safe mechanism, but quite expensive. In particuler, applying twice the pipeline on the same dataset do not return different results.

An alternative could be:

specify cleanly in each pipeline node parameter which parts of the dataset is used by the node,
specify names for generated outputs and check that nothing already exists on the dataset (no overwrite)
have the node return the modified datasets

Pros:

no data overwriting
no expensive copy
applying twice the pipeline on the same dataset raises an error

Cons:

the dataset keeps increasing in size / parameters (fields, scalars and so on)
datasets are "corrupted" at each node / pipeline evaluation: need to read it again if we want to use the original one

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLAID org

scikit-learn pipelines #100

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

PLAID org

scikit-learn pipelines #100

Uh oh!

casenave Jun 29, 2025 Maintainer

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

bstaber Jun 30, 2025 Maintainer

Uh oh!

xroynard Jul 12, 2025 Maintainer

Uh oh!

casenave Jun 30, 2025 Maintainer Author

Uh oh!

Uh oh!

casenave Jul 1, 2025 Maintainer Author

casenave
Jun 29, 2025
Maintainer

Replies: 3 comments 1 reply

bstaber
Jun 30, 2025
Maintainer

xroynard Jul 12, 2025
Maintainer

casenave
Jun 30, 2025
Maintainer Author

casenave
Jul 1, 2025
Maintainer Author