Replies: 3 comments 1 reply
-
Thanks for picking your brain on this. Before looking at the details, I was wondering if we could also discuss a practical way of creating new "nodes" and automatically parsing the configuration file. I have something in mind like a I tried to make a toy example as follows. The idea is to end up with the simplest API as possible. In my example the layout is the following: ├── main.py
├── model_config.yml
├── model_manager
│ ├── __init__.py
│ ├── builder.py
└── registrer.py The main (user-defined) script looks like this. # main.py
from pathlib import Path
import numpy as np
from rich.pretty import pprint
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
from sklearn.pipeline import Pipeline
from model_manager.builder import build_pipeline_from
from model_manager.registrer import register_node
# task can be defined elsewhere of course, but for this example we define it here
@register_node("standard_scaler")
class StandardScalerNode(BaseEstimator, TransformerMixin):
def __init__(self):
self.mean_ = None
self.std_ = None
def fit(self, X, y=None):
self.mean_ = np.mean(X, axis=0)
self.std_ = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean_) / self.std_
def fit_transform(self, X, y=None):
return self.fit(X).transform(X)
@register_node("pca_reduction")
class DummyPCANode(BaseEstimator, TransformerMixin):
def __init__(self, n_components):
self.n_components = n_components
def fit(self, X, y=None):
return self
def transform(self, X):
return X
@register_node("dummy_regressor")
class DummyRegressorNode(BaseEstimator, RegressorMixin):
def __init__(self, coef):
self.coef = np.array(coef)
def fit(self, X, y):
return self
def predict(self, X):
return X @ self.coef
# create pipeline from config
config_file_path = Path("model_config.yml")
pipeline: Pipeline = build_pipeline_from(config_file_path)
pprint(pipeline) We define three tasks (nodes) using a decorator The decorator itself is defined in the from typing import Type
NODE_REGISTRY: dict[str, Type] = {}
def register_node(name: str):
def decorator(cls):
NODE_REGISTRY[name] = cls
return cls
return decorator Whenever something is decorated with this and imported (either in the main script or through the user package in a The function that builds the pipeline simply reads the configuration file, loops over the steps and looks in the node registry to build the pipeline: from sklearn.pipeline import Pipeline
from pathlib import Path
import yaml
from model_manager.registrer import NODE_REGISTRY
def build_pipeline_from(config: Path) -> Pipeline:
"""Build a scikit-learn pipeline from a configuration dictionary.
Args:
config): A dictionary where keys are node names and values are dictionaries containing the node type and its parameters.
Returns:
Pipeline: A scikit-learn Pipeline object constructed from the configuration.
"""
with open(config.as_posix()) as f:
config = yaml.safe_load(f)
steps = []
for name, cfg in config.items():
node_type = cfg.pop("type")
cls = NODE_REGISTRY[node_type]
instance = cls(**cfg)
steps.append((name, instance))
return Pipeline(steps) For the main script shown above, I use this configuration file: scaler:
type: standard_scaler
reduction:
type: pca_reduction
n_components: 2
regressor:
type: dummy_regressor
coef: [1.0, 2.0] Summary:
I think it would be nice to:
I uploaded the toy manager if you want to try it out: |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for this insight ! It could be very nice, but I think we should first converge on some "manual" exemples to decide if pipelines on plaid objects natively make sense. |
Beta Was this translation helpful? Give feedback.
-
For the moment, I have a deepcopy of the dataset in order to have a clean copy of the modified dataset as output of each node of the pipeline, leaving the input dataset unmodified at each node. This is a safe mechanism, but quite expensive. In particuler, applying twice the pipeline on the same dataset do not return different results. An alternative could be:
Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A relatively clean tentative made here: https://github.com/PLAID-lib/plaid/tree/pipefunc_tests/examples/pipelines
Beta Was this translation helpful? Give feedback.
All reactions