Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separated pandas and numpy implementations of sklearn. #21803

Merged
merged 9 commits into from
Jun 13, 2022

Conversation

ryanthompson591
Copy link
Contributor

@ryanthompson591 ryanthompson591 commented Jun 10, 2022

Pandas and numpy implementations of sklearn runinference handler are now separate implementations.

For example, of pandas will now call the runinference in this fashion.

beam.RunInference(SklearnModelHandlerPandas(model_uri='gs://project/my_pandas_sklearn_pipeline.pkl')

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@ryanthompson591
Copy link
Contributor Author

R: @robertwb R: @yeandy R: @TheNeuralBit

@TheNeuralBit
Copy link
Member

Should we remove the NOTE about backwards incompatibility on the numpy implementation?

@asf-ci
Copy link

asf-ci commented Jun 10, 2022

Can one of the admins verify this patch?

1 similar comment
@asf-ci
Copy link

asf-ci commented Jun 10, 2022

Can one of the admins verify this patch?

@ryanthompson591
Copy link
Contributor Author

yeah, let's remove the notes about backwards compatability here and everywhere, since the whole point of this is to try and set our API as supported.

@@ -94,9 +91,6 @@ class SklearnModelHandlerPandas(ModelHandler[pandas.DataFrame,
BaseEstimator]):
""" Implementation of the ModelHandler interface for scikit-learn that
supports pandas dataframes.

NOTE: This API and its implementation are under development and
do not provide backward compatibility guarantees.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @robertwb wanted to keep the implementations with named inputs marked experimental

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We're moving to marking those things that are experimental at a more fine-grained level, and the named inputs should fall into that class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I can leave this note here, but I don't really see support for pandas dataframes as something separate. Sklearn users will want it as much as numpy array support, since sklearn models are built on top of have named inputs I'm not really understanding why we would modify our support for numpy but not sklearn arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a question of what is likely to change. I am 99% we'll want to change the way we handle dataframes, not so sure about numpy. We could call it safe and mark both (as long as there's still enough meat in the "non-experimental" portions).

Copy link
Member

@TheNeuralBit TheNeuralBit Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific on why this might change - now that the batching DoFn infrastructure is in, I'd like to make the pandas sklearn implementation leverage it. We'd move to a model where the element type is Beam Row (with schema), and the batch type is a pandas DataFrame. As opposed to the current model where the batch type is a list of single element dataframes.

Once we do that we could pass data from the DataFrame API (under the hood a PCollection[pd.DataFrame]) directly to RunInference, without having to unbatch it and then batch it back up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I agree with this change. Let's do that as part of a separate PR.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Unfortunately there are issues with Jenkins right now, and it looks like this needs a rebase/merge.

@codecov
Copy link

codecov bot commented Jun 11, 2022

Codecov Report

Merging #21803 (23650ba) into master (986fb40) will decrease coverage by 0.01%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##           master   #21803      +/-   ##
==========================================
- Coverage   74.15%   74.13%   -0.02%     
==========================================
  Files         698      698              
  Lines       92417    92400      -17     
==========================================
- Hits        68530    68503      -27     
- Misses      22636    22646      +10     
  Partials     1251     1251              
Flag Coverage Δ
python 83.75% <95.45%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...thon/apache_beam/ml/inference/sklearn_inference.py 94.54% <95.45%> (+1.21%) ⬆️
.../python/apache_beam/testing/test_stream_service.py 88.09% <0.00%> (-4.77%) ⬇️
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py 73.46% <0.00%> (-2.05%) ⬇️
.../python/apache_beam/transforms/periodicsequence.py 96.72% <0.00%> (-1.64%) ⬇️
sdks/python/apache_beam/io/source_test_utils.py 88.01% <0.00%> (-1.39%) ⬇️
...che_beam/runners/interactive/interactive_runner.py 90.06% <0.00%> (-1.33%) ⬇️
...eam/runners/portability/fn_api_runner/execution.py 92.44% <0.00%> (-0.65%) ⬇️
...examples/inference/pytorch_image_classification.py 0.00% <0.00%> (ø)
sdks/python/apache_beam/ml/inference/api.py
sdks/python/apache_beam/ml/inference/__init__.py 100.00% <0.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 986fb40...23650ba. Read the comment docs.

@ryanthompson591
Copy link
Contributor Author

This ran well locally for me. Have. one more look and merge if it looks fine.

@TheNeuralBit
Copy link
Member

Could you also run DirectRunner tests locally, since we don't have Jenkins running PostCommits?

@TheNeuralBit
Copy link
Member

@ryanthompson591 indicated offfline that those tests (./gradlew :sdks:python:test-suites:direct:py37:inferencePostCommitIT) shouldn't be affected since they only run pytorch examples now. He also confirmed that lint, mypy, and yapf are happy. I will merge when GitHub Actions finishes

@TheNeuralBit
Copy link
Member

Lint PreCommit is running now and failed:

11:12:05  from apache_beam.ml.inference.base import PredictionResult
11:12:05  from apache_beam.ml.inference.base import RunInference
11:12:05  from apache_beam.ml.inference.sklearn_inference import ModelFileType
11:12:05 +from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05  from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerPandas
11:12:05 -from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy
11:12:05  from apache_beam.testing.test_pipeline import TestPipeline
11:12:05  from apache_beam.testing.util import assert_that
11:12:05  from apache_beam.testing.util import equal_to

@ryanthompson591
Copy link
Contributor Author

ryanthompson591 commented Jun 13, 2022 via email

@TheNeuralBit
Copy link
Member

I can help trace the code so you know what to add to your hooks :)

The precommit runs the py37-lint tox config:

beam/sdks/python/tox.ini

Lines 107 to 120 in d2fb942

[testenv:py37-lint]
# Don't set TMPDIR to avoid "AF_UNIX path too long" errors in pylint.
setenv =
# keep the version of pylint in sync with the 'rev' in .pre-commit-config.yaml
deps =
-r build-requirements.txt
astroid<2.9,>=2.8.0
pycodestyle==2.3.1
pylint==2.11.1
isort==4.2.15
flake8==3.5.0
commands =
pylint --version
time {toxinidir}/scripts/run_pylint.sh

Which runs the run_pylint.sh script, which runs isort with this definition:

isort ${MODULE} -p apache_beam --line-width 120 --check-only --order-by-type \
--combine-star --force-single-line-imports --diff --recursive ${SKIP_PARAM}

@TheNeuralBit TheNeuralBit merged commit 383c08d into apache:master Jun 13, 2022
bullet03 pushed a commit to akvelon/beam that referenced this pull request Jun 20, 2022
* separated pandas and numpy implementations

* separated pandas and numpy implementations

* Update sdks/python/apache_beam/ml/inference/sklearn_inference.py

Co-authored-by: Brian Hulette <hulettbh@gmail.com>

* fixed unit test

* merged to fix conflicts

* fixed import order

Co-authored-by: Brian Hulette <hulettbh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants