Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separated pandas and numpy implementations of sklearn. #21803

Merged
merged 9 commits into from
Jun 13, 2022
6 changes: 0 additions & 6 deletions sdks/python/apache_beam/ml/inference/sklearn_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,6 @@ class SklearnModelHandlerNumpy(ModelHandler[numpy.ndarray,
BaseEstimator]):
""" Implementation of the ModelHandler interface for scikit-learn
using numpy arrays as input.

NOTE: This API and its implementation are under development and
do not provide backward compatibility guarantees.
"""
def __init__(
self,
Expand Down Expand Up @@ -94,9 +91,6 @@ class SklearnModelHandlerPandas(ModelHandler[pandas.DataFrame,
BaseEstimator]):
""" Implementation of the ModelHandler interface for scikit-learn that
supports pandas dataframes.

NOTE: This API and its implementation are under development and
do not provide backward compatibility guarantees.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @robertwb wanted to keep the implementations with named inputs marked experimental

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We're moving to marking those things that are experimental at a more fine-grained level, and the named inputs should fall into that class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I can leave this note here, but I don't really see support for pandas dataframes as something separate. Sklearn users will want it as much as numpy array support, since sklearn models are built on top of have named inputs I'm not really understanding why we would modify our support for numpy but not sklearn arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a question of what is likely to change. I am 99% we'll want to change the way we handle dataframes, not so sure about numpy. We could call it safe and mark both (as long as there's still enough meat in the "non-experimental" portions).

Copy link
Member

@TheNeuralBit TheNeuralBit Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific on why this might change - now that the batching DoFn infrastructure is in, I'd like to make the pandas sklearn implementation leverage it. We'd move to a model where the element type is Beam Row (with schema), and the batch type is a pandas DataFrame. As opposed to the current model where the batch type is a list of single element dataframes.

Once we do that we could pass data from the DataFrame API (under the hood a PCollection[pd.DataFrame]) directly to RunInference, without having to unbatch it and then batch it back up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I agree with this change. Let's do that as part of a separate PR.

"""
def __init__(
self,
Expand Down