-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separated pandas and numpy implementations of sklearn. #21803
Merged
TheNeuralBit
merged 9 commits into
apache:master
from
ryanthompson591:update_sklearn_handlers
Jun 13, 2022
Merged
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
cd098c2
separated pandas and numpy implementations
ryanthompson591 aec5d2d
separated pandas and numpy implementations
ryanthompson591 6487bbd
merged and fixed
ryanthompson591 efd3c63
merged
ryanthompson591 aff4d40
Update sdks/python/apache_beam/ml/inference/sklearn_inference.py
ryanthompson591 64a55fb
fixed unit test
ryanthompson591 aff1d8a
merged to fix conflicts
ryanthompson591 4e45015
merged to fix conflicts
ryanthompson591 23650ba
fixed import order
ryanthompson591 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @robertwb wanted to keep the implementations with named inputs marked experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We're moving to marking those things that are experimental at a more fine-grained level, and the named inputs should fall into that class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can leave this note here, but I don't really see support for pandas dataframes as something separate. Sklearn users will want it as much as numpy array support, since sklearn models are built on top of have named inputs I'm not really understanding why we would modify our support for numpy but not sklearn arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a question of what is likely to change. I am 99% we'll want to change the way we handle dataframes, not so sure about numpy. We could call it safe and mark both (as long as there's still enough meat in the "non-experimental" portions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific on why this might change - now that the batching DoFn infrastructure is in, I'd like to make the pandas sklearn implementation leverage it. We'd move to a model where the element type is Beam Row (with schema), and the batch type is a pandas DataFrame. As opposed to the current model where the batch type is a list of single element dataframes.
Once we do that we could pass data from the DataFrame API (under the hood a
PCollection[pd.DataFrame]
) directly to RunInference, without having to unbatch it and then batch it back up.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. I agree with this change. Let's do that as part of a separate PR.