Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev #6651

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

vnherdeiro
Copy link
Contributor

@vnherdeiro vnherdeiro commented Sep 11, 2024

Fixes #6653

(edit: taken over by @jameslamb, description re-written below)

  • raises minimum supported version to scikit-learn>=0.24.2
  • implements __sklearn_tags__() (replacement for _more_tags()) for scikit-learn estimators
  • starts using sklearn.utils.validation.validate_data() in fit() and predict()
  • adds tests confirming that scikit-learn estimators reject inputs with the wrong number of features

Notes for Reviewers

see https://scikit-learn.org/dev/whats_new/v1.6.html and scikit-learn/scikit-learn#29677

@vnherdeiro
Copy link
Contributor Author

Update:

The change introduced in scikit-learn/scikit-learn#29677 makes it hard to subclass a sklearn estimator in a codebase while being compatible with sklearn < 1.6.0 and sklearn >= 1.6.0. Essentially the former looks up ._more_tags() and ignore __sklearn_tags__() while the former looks up __sklearn_tags__() and forbids existence of a
._more_tags() tags method.

The issue is discussed here:
scikit-learn/scikit-learn#29801

and it looks like a relaxation of the impossibility of having both ._more_tags() and __sklearn_tags__() simulatenously will be relaxed. If it goes through let's park this MR until lightgbm decides to force a scikit-learn>=1.6.0 dependency.

@adrinjalali
Copy link

@vnherdeiro note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment)), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801

@vnherdeiro
Copy link
Contributor Author

vnherdeiro commented Sep 12, 2024 via email

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 15, 2024

Thanks for starting on this @vnherdeiro . I've documented it in an issue: #6653 (and added that to the PR description).

Note there that I intentionally put the exact errors messages in plain text instead of just referring to _more_tags() ... that helps people to find this work from search engines.

Note also that the _more_tags() thing is only 1 of 3 breaking changes in scikit-learn that lightgbm will have to adjust to to get those tests passing again with scikit-learn==1.6.0.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting on this! Please see scikit-learn/scikit-learn#29801 (comment):

The story becomes "If you want to support multiple scikit-learn versions, define both."

I think we should leave _more_tags() untouched and add __sklearn_tags__(). And have self.__sklearn_tags__() call self._more_tags() to get its data, so we don't define things like _xfail_checks twice.

Do you have time to do that in the next few days? We need to fix this to unblock CI here, so if you don't have time to fix it this week please let me know and I will work on this.

@jameslamb jameslamb changed the title __sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_ [python-package] make scikit-learn tags compatible with scikit-learn>=1.16 Sep 15, 2024
@jameslamb jameslamb changed the title [python-package] make scikit-learn tags compatible with scikit-learn>=1.16 [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16 Sep 15, 2024
@vnherdeiro
Copy link
Contributor Author

@jameslamb Have just pushe a sklearn_tags trying a conversion from _more_tags. I added a out of current argument scope warning to catch a change from the arguments in _more_tags (they don't seem to change much though).

@vnherdeiro vnherdeiro changed the title [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16 [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Sep 15, 2024
Copy link

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a maintainer here, but coming from sklearn side. Leaving thoughts hoping it'd help.

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I've reviewed the dataclasses at https://github.com/scikit-learn/scikit-learn/blob/e2ee93156bd3692722a39130c011eea313628690/sklearn/utils/_tags.py and agree with the choices you've made about how to map the dictionary-formatted values from _more_tags() to the dataclass attributes scikit-learn now prefers.

Please see the other comments about simplifying this.

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

This is not ready for another review yet (I've moved it back to draft to make that clearer).

I've just pushed what I have so you can see where I'm going. Calling validate_data() has introduced other forms of complexity, ... for example, it now matters more exactly where and when self.n_features_in_ is populated. Or maybe it always mattered, and lightgbm was just silently not quite supporting scikit-learn's expectations correctly 😫

I'll try to continue with this tomorrow.

@hcho3 @trivialfis if you are not already testing xgboost against scikit-learn==1.6.dev0 nightlies, I recommend trying it... this has opened up a lot of changes required for lightgbm. I can also help with xgboost after this, if you'd like.

@jameslamb jameslamb dismissed their stale review October 5, 2024 05:26

dismissing my own review, now that I've taken over this PR

@jameslamb jameslamb changed the title WIP: [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Oct 5, 2024
@jameslamb jameslamb marked this pull request as ready for review October 5, 2024 06:25
@jameslamb
Copy link
Collaborator

Ok, this is ready for another review!

But understand if reviewers would like to wait until CI is fixed first before reviewing (#6663).

Copy link

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks great! Thank you for the heads-up

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive work!
I left some minor comments below:

python-package/lightgbm/compat.py Outdated Show resolved Hide resolved
@@ -144,6 +147,32 @@ def _get_weight_from_constructed_dataset(dataset: Dataset) -> Optional[np.ndarra
return weight


def _num_features_for_raw_input(X: _LGBM_ScikitMatrixLike) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_num_features() was added in 0.24 version:
scikit-learn/scikit-learn@b4d5ad6
I think we can move this into compat.py and try to import _num_features() firstly, then in case of ImportError emulate it with this function.

This approach will benefit from auto upstream updates of _num_features() in future versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion, thanks! I attempted this and found that it exposed some other complexity, which I've tried to describe in code comments and these inline GitHub comments:

To simplify the implementation a bit, I'm now also proposing:

  • calling validate_data(reset=True), which will internally call _num_features() on scikit-learn>=1.6
  • directly and unconditionally importing _num_features() and calling it in the pre-1.6 validate_data() implemented in compat.py (so no separate implementation to maintain!)
  • raising lightgbm's scikit-learn floor to >=0.24.2 so users will always have a version at runtime with _num_features() defined

The new floor on scikit-learn>=0.24.2 should not impact users much. That version was released in April 2021 and did not have wheels for Python versions newer than 3.9 (PyPi release page), so I think it's unlikely many people will try to be using the next release of lightgbm with such an old version of scikit-learn.

But this is the first time we've had a floor on that dependency, so for awareness: cc @borchero @jmoralez @guolinke @shiyu1994

# _LGBMModelBase.__sklearn_tags__() cannot be called unconditionally,
# because that method isn't defined for scikit-learn<1.6
if not hasattr(_LGBMModelBase, "__sklearn_tags__"):
from sklearn import __version__ as sklearn_version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely import this in compat.py.

__version__ was in __init__.py at least in 2011 year:
https://github.com/scikit-learn/scikit-learn/blob/dacdd3ad7b455a46b5e344ecfeaf5a369b554860/sklearn/__init__.py#L50

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally thinking that it'd be good for users to not incur the cost of this import when it's only needed in an error message... but I guess since it's a top-level attribute of sklearn, it will already have been imported anyway by the time any other sklearn imports have run.

I've moved this to compat.py, thanks for the suggestion.

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
tests/python_package_test/test_sklearn.py Outdated Show resolved Hide resolved
# its scikit-learn estimators, for consistency with scikit-learn's own behavior.
@pytest.mark.parametrize("predict_disable_shape_check", [True, False])
def test_predict_rejects_inputs_with_incorrect_number_of_features(predict_disable_shape_check):
X, y, _ = _create_data(task="regression", n_features=4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use classification here and test predict_proba() as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point, I agree. Thinking about it more, I think there's good reason to just parametrize this over all tasks, since each is slightly different (e.g. classification has 2 predict() and predict_proba(), ranking inherits directly from LGBMModel with no mixin).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this test to use classification, regression, and ranking... and to call both predict() and predict_proba(). Let me know if you think it looks too complicated with all the if statements and I'll happily change it to just classification.

# because scikit-learn's 'check_fit1d' estimator check sets that expectation that
# estimators must raise a ValueError when a 1-dimensional input is passed to fit().
#
# So here, lightgbm avoids calling _num_features() on 1-dimensional inputs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1067,6 +1137,21 @@ def n_features_in_(self) -> int:
raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.")
return self._n_features_in

@n_features_in_.setter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass reset=True to sklearn.utils.validation.validate_data(), it will try to:

We want the "set estimator.n_features_in_" behavior, because without it we have to manually set estimator.n_features_in_ in fit().

Doing that requires determining the number of features in X, which requires either re-implementing something like sklearn.utils.validation._num_features() (as I originally tried to do) or just calling that function directly. But that function can't safely be called directly before calling check_array(), because it raises a TypeError on 1-D inputs, which violates the check_fit1d estimator check (code link).

So here, I'm proposing that we do the following:

  • add a setter for n_features_in_ and a deleter for feature_names_in_
  • pass reset=True at fit() time to validate_data()
  • modify the pre-1.6 implementation of validate_data() in compat.py to match

@jameslamb jameslamb changed the title [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev [python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev Oct 6, 2024
@jameslamb
Copy link
Collaborator

@StrikerRUS your comments were definitely not "minor", they really helped a lot! I've re-thought a lot of this PR based on trying to implement those suggestions.

This is ready for another review. Thank you for all your reviewing effort here, I know this change has become quite complex and there are many competing constraints it's trying to satisfy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] [python-package] scikit-learn compatibility tests fail with scikit-learn 1.6.dev0
5 participants