Separated pandas and numpy implementations of sklearn. #21803

ryanthompson591 · 2022-06-10T19:15:15Z

Pandas and numpy implementations of sklearn runinference handler are now separate implementations.

For example, of pandas will now call the runinference in this fashion.

beam.RunInference(SklearnModelHandlerPandas(model_uri='gs://project/my_pandas_sklearn_pipeline.pkl')

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: "addresses [BEAM-121] Add DisplayData for IO transforms #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment "fixes #" instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

ryanthompson591 · 2022-06-10T19:18:21Z

R: @robertwb R: @yeandy R: @TheNeuralBit

TheNeuralBit · 2022-06-10T19:24:09Z

Should we remove the NOTE about backwards incompatibility on the numpy implementation?

asf-ci · 2022-06-10T19:25:18Z

Can one of the admins verify this patch?

asf-ci · 2022-06-10T19:25:18Z

Can one of the admins verify this patch?

ryanthompson591 · 2022-06-10T19:56:03Z

yeah, let's remove the notes about backwards compatability here and everywhere, since the whole point of this is to try and set our API as supported.

TheNeuralBit · 2022-06-10T20:35:38Z

sdks/python/apache_beam/ml/inference/sklearn_inference.py

@@ -94,9 +91,6 @@ class SklearnModelHandlerPandas(ModelHandler[pandas.DataFrame,
                                             BaseEstimator]):
  """ Implementation of the ModelHandler interface for scikit-learn that
      supports pandas dataframes.
-
-      NOTE: This API and its implementation are under development and
-      do not provide backward compatibility guarantees.


I think @robertwb wanted to keep the implementations with named inputs marked experimental

+1. We're moving to marking those things that are experimental at a more fine-grained level, and the named inputs should fall into that class.

OK. I can leave this note here, but I don't really see support for pandas dataframes as something separate. Sklearn users will want it as much as numpy array support, since sklearn models are built on top of have named inputs I'm not really understanding why we would modify our support for numpy but not sklearn arrays.

It's a question of what is likely to change. I am 99% we'll want to change the way we handle dataframes, not so sure about numpy. We could call it safe and mark both (as long as there's still enough meat in the "non-experimental" portions).

To be more specific on why this might change - now that the batching DoFn infrastructure is in, I'd like to make the pandas sklearn implementation leverage it. We'd move to a model where the element type is Beam Row (with schema), and the batch type is a pandas DataFrame. As opposed to the current model where the batch type is a list of single element dataframes.

Once we do that we could pass data from the DataFrame API (under the hood a PCollection[pd.DataFrame]) directly to RunInference, without having to unbatch it and then batch it back up.

yeah. I agree with this change. Let's do that as part of a separate PR.

sdks/python/apache_beam/ml/inference/sklearn_inference.py

TheNeuralBit

LGTM. Unfortunately there are issues with Jenkins right now, and it looks like this needs a rebase/merge.

sdks/python/apache_beam/ml/inference/sklearn_inference.py

codecov · 2022-06-11T01:22:54Z

Codecov Report

Merging #21803 (23650ba) into master (986fb40) will decrease coverage by 0.01%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##           master   #21803      +/-   ##
==========================================
- Coverage   74.15%   74.13%   -0.02%     
==========================================
  Files         698      698              
  Lines       92417    92400      -17     
==========================================
- Hits        68530    68503      -27     
- Misses      22636    22646      +10     
  Partials     1251     1251

Flag	Coverage Δ
python	`83.75% <95.45%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...thon/apache_beam/ml/inference/sklearn_inference.py	`94.54% <95.45%> (+1.21%)`	⬆️
.../python/apache_beam/testing/test_stream_service.py	`88.09% <0.00%> (-4.77%)`	⬇️
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py	`73.46% <0.00%> (-2.05%)`	⬇️
.../python/apache_beam/transforms/periodicsequence.py	`96.72% <0.00%> (-1.64%)`	⬇️
sdks/python/apache_beam/io/source_test_utils.py	`88.01% <0.00%> (-1.39%)`	⬇️
...che_beam/runners/interactive/interactive_runner.py	`90.06% <0.00%> (-1.33%)`	⬇️
...eam/runners/portability/fn_api_runner/execution.py	`92.44% <0.00%> (-0.65%)`	⬇️
...examples/inference/pytorch_image_classification.py	`0.00% <0.00%> (ø)`
sdks/python/apache_beam/ml/inference/api.py
sdks/python/apache_beam/ml/inference/__init__.py	`100.00% <0.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 986fb40...23650ba. Read the comment docs.

Co-authored-by: Brian Hulette <hulettbh@gmail.com>

ryanthompson591 · 2022-06-13T15:35:27Z

This ran well locally for me. Have. one more look and merge if it looks fine.

TheNeuralBit · 2022-06-13T15:38:10Z

Could you also run DirectRunner tests locally, since we don't have Jenkins running PostCommits?

TheNeuralBit · 2022-06-13T17:09:36Z

@ryanthompson591 indicated offfline that those tests (./gradlew :sdks:python:test-suites:direct:py37:inferencePostCommitIT) shouldn't be affected since they only run pytorch examples now. He also confirmed that lint, mypy, and yapf are happy. I will merge when GitHub Actions finishes

TheNeuralBit · 2022-06-13T18:53:21Z

Lint PreCommit is running now and failed:

11:12:05  from apache_beam.ml.inference.base import PredictionResult
11:12:05  from apache_beam.ml.inference.base import RunInference
11:12:05  from apache_beam.ml.inference.sklearn_inference import ModelFileType
11:12:05 +from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05  from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerPandas
11:12:05 -from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05  from apache_beam.testing.test_pipeline import TestPipeline
11:12:05  from apache_beam.testing.util import assert_that
11:12:05  from apache_beam.testing.util import equal_to

ryanthompson591 · 2022-06-13T19:13:40Z

Yep fixed. I'm not sure why my link hooks don't verify import order.

…

On Mon, Jun 13, 2022, 2:53 PM Brian Hulette ***@***.***> wrote: Lint PreCommit is running now and failed: 11:12:05 from apache_beam.ml.inference.base import PredictionResult 11:12:05 from apache_beam.ml.inference.base import RunInference 11:12:05 from apache_beam.ml.inference.sklearn_inference import ModelFileType 11:12:05 +from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy 11:12:05 from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerPandas 11:12:05 -from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy 11:12:05 from apache_beam.testing.test_pipeline import TestPipeline 11:12:05 from apache_beam.testing.util import assert_that 11:12:05 from apache_beam.testing.util import equal_to — Reply to this email directly, view it on GitHub <#21803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABS23CRJ4GTJBIBN3BQF2OLVO57SXANCNFSM5YOTLHWA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

TheNeuralBit · 2022-06-13T20:24:51Z

I can help trace the code so you know what to add to your hooks :)

The precommit runs the py37-lint tox config:

beam/sdks/python/tox.ini

Lines 107 to 120 in d2fb942

    
           [testenv:py37-lint] 
        
           # Don't set TMPDIR to avoid "AF_UNIX path too long" errors in pylint. 
        
           setenv = 
        
           # keep the version of pylint in sync with the 'rev' in .pre-commit-config.yaml 
        
           deps = 
        
             -r build-requirements.txt 
        
             astroid<2.9,>=2.8.0 
        
             pycodestyle==2.3.1 
        
             pylint==2.11.1 
        
             isort==4.2.15 
        
             flake8==3.5.0 
        
           commands = 
        
             pylint --version 
        
             time {toxinidir}/scripts/run_pylint.sh

Which runs the run_pylint.sh script, which runs isort with this definition:

beam/sdks/python/scripts/run_pylint.sh

Lines 109 to 110 in d2fb942

    
           isort ${MODULE} -p apache_beam --line-width 120 --check-only --order-by-type \ 
        
               --combine-star --force-single-line-imports --diff --recursive ${SKIP_PARAM}

* separated pandas and numpy implementations * separated pandas and numpy implementations * Update sdks/python/apache_beam/ml/inference/sklearn_inference.py Co-authored-by: Brian Hulette <hulettbh@gmail.com> * fixed unit test * merged to fix conflicts * fixed import order Co-authored-by: Brian Hulette <hulettbh@gmail.com>

separated pandas and numpy implementations

cd098c2

github-actions bot added the python label Jun 10, 2022

separated pandas and numpy implementations

aec5d2d

TheNeuralBit reviewed Jun 10, 2022

View reviewed changes

robertwb reviewed Jun 10, 2022

View reviewed changes

sdks/python/apache_beam/ml/inference/sklearn_inference.py Outdated Show resolved Hide resolved

merged and fixed

6487bbd

TheNeuralBit approved these changes Jun 10, 2022

View reviewed changes

sdks/python/apache_beam/ml/inference/sklearn_inference.py Show resolved Hide resolved

TheNeuralBit mentioned this pull request Jun 10, 2022

Split PytorchModelHandler into PytorchModelHandlerTensor and PytorchModelHandlerKeyedTensor #21810

Merged

4 tasks

ryanthompson591 and others added 3 commits June 13, 2022 09:40

merged

efd3c63

Update sdks/python/apache_beam/ml/inference/sklearn_inference.py

aff4d40

Co-authored-by: Brian Hulette <hulettbh@gmail.com>

fixed unit test

64a55fb

ryanthompson591 added 2 commits June 13, 2022 13:47

merged to fix conflicts

aff1d8a

merged to fix conflicts

4e45015

fixed import order

23650ba

TheNeuralBit merged commit 383c08d into apache:master Jun 13, 2022

tvalentyn mentioned this pull request Jun 15, 2022

[Task]: Add input type hint to inference model loader #21769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separated pandas and numpy implementations of sklearn. #21803

Separated pandas and numpy implementations of sklearn. #21803

ryanthompson591 commented Jun 10, 2022 •

edited by TheNeuralBit

Loading

ryanthompson591 commented Jun 10, 2022

TheNeuralBit commented Jun 10, 2022

asf-ci commented Jun 10, 2022

asf-ci commented Jun 10, 2022

ryanthompson591 commented Jun 10, 2022

TheNeuralBit Jun 10, 2022

robertwb Jun 10, 2022

ryanthompson591 Jun 10, 2022

robertwb Jun 10, 2022

TheNeuralBit Jun 10, 2022 •

edited

Loading

ryanthompson591 Jun 13, 2022

TheNeuralBit left a comment

codecov bot commented Jun 11, 2022 •

edited

Loading

ryanthompson591 commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

ryanthompson591 commented Jun 13, 2022 via email

TheNeuralBit commented Jun 13, 2022

Separated pandas and numpy implementations of sklearn. #21803

Separated pandas and numpy implementations of sklearn. #21803

Conversation

ryanthompson591 commented Jun 10, 2022 • edited by TheNeuralBit Loading

GitHub Actions Tests Status (on master branch)

ryanthompson591 commented Jun 10, 2022

TheNeuralBit commented Jun 10, 2022

asf-ci commented Jun 10, 2022

asf-ci commented Jun 10, 2022

ryanthompson591 commented Jun 10, 2022

TheNeuralBit Jun 10, 2022

Choose a reason for hiding this comment

robertwb Jun 10, 2022

Choose a reason for hiding this comment

ryanthompson591 Jun 10, 2022

Choose a reason for hiding this comment

robertwb Jun 10, 2022

Choose a reason for hiding this comment

TheNeuralBit Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

ryanthompson591 Jun 13, 2022

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 11, 2022 • edited Loading

Codecov Report

ryanthompson591 commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

TheNeuralBit commented Jun 13, 2022

ryanthompson591 commented Jun 13, 2022 via email

TheNeuralBit commented Jun 13, 2022

ryanthompson591 commented Jun 10, 2022 •

edited by TheNeuralBit

Loading

TheNeuralBit Jun 10, 2022 •

edited

Loading

codecov bot commented Jun 11, 2022 •

edited

Loading