[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro · 2024-09-11T10:42:47Z

(edit: taken over by @jameslamb, description re-written below)

raises minimum supported version to scikit-learn>=0.24.2
implements __sklearn_tags__() (replacement for _more_tags()) for scikit-learn estimators
starts using sklearn.utils.validation.validate_data() in fit() and predict()
adds tests confirming that scikit-learn estimators reject inputs with the wrong number of features

Notes for Reviewers

see https://scikit-learn.org/dev/whats_new/v1.6.html and scikit-learn/scikit-learn#29677

vnherdeiro · 2024-09-11T13:53:31Z

Update:

The change introduced in scikit-learn/scikit-learn#29677 makes it hard to subclass a sklearn estimator in a codebase while being compatible with sklearn < 1.6.0 and sklearn >= 1.6.0. Essentially the former looks up ._more_tags() and ignore __sklearn_tags__() while the former looks up __sklearn_tags__() and forbids existence of a
._more_tags() tags method.

The issue is discussed here:
scikit-learn/scikit-learn#29801

and it looks like a relaxation of the impossibility of having both ._more_tags() and __sklearn_tags__() simulatenously will be relaxed. If it goes through let's park this MR until lightgbm decides to force a scikit-learn>=1.6.0 dependency.

adrinjalali · 2024-09-12T10:33:53Z

@vnherdeiro note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment)), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801

vnherdeiro · 2024-09-12T12:03:47Z

Correct I am waiting for that PR to go in to bring back _more_tags Using @available_if would require another sklearn import and make the code less readable I reckon

…

On Thu, 12 Sept 2024, 11:34 am Adrin Jalali, ***@***.***> wrote: @vnherdeiro <https://github.com/vnherdeiro> note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment) <scikit-learn/scikit-learn#29677 (comment)>), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801 <scikit-learn/scikit-learn#29801> — Reply to this email directly, view it on GitHub <#6651 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE4CNVUURU6AMLDYUXKPFTTZWFU2TAVCNFSM6AAAAABOAVNTLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBVHA4DCOJWGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jameslamb · 2024-09-15T03:08:09Z

Thanks for starting on this @vnherdeiro . I've documented it in an issue: #6653 (and added that to the PR description).

Note there that I intentionally put the exact errors messages in plain text instead of just referring to _more_tags() ... that helps people to find this work from search engines.

Note also that the _more_tags() thing is only 1 of 3 breaking changes in scikit-learn that lightgbm will have to adjust to to get those tests passing again with scikit-learn==1.6.0.

jameslamb

Thanks for starting on this! Please see scikit-learn/scikit-learn#29801 (comment):

The story becomes "If you want to support multiple scikit-learn versions, define both."

I think we should leave _more_tags() untouched and add __sklearn_tags__(). And have self.__sklearn_tags__() call self._more_tags() to get its data, so we don't define things like _xfail_checks twice.

Do you have time to do that in the next few days? We need to fix this to unblock CI here, so if you don't have time to fix it this week please let me know and I will work on this.

…n_tags

vnherdeiro · 2024-09-15T12:12:23Z

@jameslamb Have just pushe a sklearn_tags trying a conversion from _more_tags. I added a out of current argument scope warning to catch a change from the arguments in _more_tags (they don't seem to change much though).

adrinjalali

Not a maintainer here, but coming from sklearn side. Leaving thoughts hoping it'd help.

python-package/lightgbm/sklearn.py

jameslamb

Thanks for this.

I've reviewed the dataclasses at https://github.com/scikit-learn/scikit-learn/blob/e2ee93156bd3692722a39130c011eea313628690/sklearn/utils/_tags.py and agree with the choices you've made about how to map the dictionary-formatted values from _more_tags() to the dataclass attributes scikit-learn now prefers.

Please see the other comments about simplifying this.

python-package/lightgbm/sklearn.py

jameslamb · 2024-10-04T05:17:24Z

This is not ready for another review yet (I've moved it back to draft to make that clearer).

I've just pushed what I have so you can see where I'm going. Calling validate_data() has introduced other forms of complexity, ... for example, it now matters more exactly where and when self.n_features_in_ is populated. Or maybe it always mattered, and lightgbm was just silently not quite supporting scikit-learn's expectations correctly 😫

I'll try to continue with this tomorrow.

@hcho3 @trivialfis if you are not already testing xgboost against scikit-learn==1.6.dev0 nightlies, I recommend trying it... this has opened up a lot of changes required for lightgbm. I can also help with xgboost after this, if you'd like.

dismissing my own review, now that I've taken over this PR

jameslamb · 2024-10-05T06:25:55Z

Ok, this is ready for another review!

But understand if reviewers would like to wait until CI is fixed first before reviewing (#6663).

trivialfis

The change looks great! Thank you for the heads-up

StrikerRUS

Very impressive work!
I left some minor comments below:

python-package/lightgbm/compat.py

StrikerRUS · 2024-10-05T16:57:54Z

python-package/lightgbm/sklearn.py

@@ -144,6 +147,32 @@ def _get_weight_from_constructed_dataset(dataset: Dataset) -> Optional[np.ndarra
    return weight


+def _num_features_for_raw_input(X: _LGBM_ScikitMatrixLike) -> int:


_num_features() was added in 0.24 version:
scikit-learn/scikit-learn@b4d5ad6
I think we can move this into compat.py and try to import _num_features() firstly, then in case of ImportError emulate it with this function.

This approach will benefit from auto upstream updates of _num_features() in future versions.

Very good suggestion, thanks! I attempted this and found that it exposed some other complexity, which I've tried to describe in code comments and these inline GitHub comments:

https://github.com/microsoft/LightGBM/pull/6651/files#r1788923387

https://github.com/microsoft/LightGBM/pull/6651/files#r1788920665

To simplify the implementation a bit, I'm now also proposing:

calling validate_data(reset=True), which will internally call _num_features() on scikit-learn>=1.6

directly and unconditionally importing _num_features() and calling it in the pre-1.6 validate_data() implemented in compat.py (so no separate implementation to maintain!)

raising lightgbm's scikit-learn floor to >=0.24.2 so users will always have a version at runtime with _num_features() defined

The new floor on scikit-learn>=0.24.2 should not impact users much. That version was released in April 2021 and did not have wheels for Python versions newer than 3.9 (PyPi release page), so I think it's unlikely many people will try to be using the next release of lightgbm with such an old version of scikit-learn.

But this is the first time we've had a floor on that dependency, so for awareness: cc @borchero @jmoralez @guolinke @shiyu1994

StrikerRUS · 2024-10-05T17:03:42Z

python-package/lightgbm/sklearn.py

+        # _LGBMModelBase.__sklearn_tags__() cannot be called unconditionally,
+        # because that method isn't defined for scikit-learn<1.6
+        if not hasattr(_LGBMModelBase, "__sklearn_tags__"):
+            from sklearn import __version__ as sklearn_version


I think we can safely import this in compat.py.

__version__ was in __init__.py at least in 2011 year:
https://github.com/scikit-learn/scikit-learn/blob/dacdd3ad7b455a46b5e344ecfeaf5a369b554860/sklearn/__init__.py#L50

I was originally thinking that it'd be good for users to not incur the cost of this import when it's only needed in an error message... but I guess since it's a top-level attribute of sklearn, it will already have been imported anyway by the time any other sklearn imports have run.

I've moved this to compat.py, thanks for the suggestion.

python-package/lightgbm/sklearn.py

tests/python_package_test/test_sklearn.py

StrikerRUS · 2024-10-05T18:47:13Z

tests/python_package_test/test_sklearn.py

+# its scikit-learn estimators, for consistency with scikit-learn's own behavior.
+@pytest.mark.parametrize("predict_disable_shape_check", [True, False])
+def test_predict_rejects_inputs_with_incorrect_number_of_features(predict_disable_shape_check):
+    X, y, _ = _create_data(task="regression", n_features=4)


I think it's better to use classification here and test predict_proba() as well.

great point, I agree. Thinking about it more, I think there's good reason to just parametrize this over all tasks, since each is slightly different (e.g. classification has 2 predict() and predict_proba(), ranking inherits directly from LGBMModel with no mixin).

I've updated this test to use classification, regression, and ranking... and to call both predict() and predict_proba(). Let me know if you think it looks too complicated with all the if statements and I'll happily change it to just classification.

…rn_more_tags_deprecation

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

…ting for older versions

jameslamb · 2024-10-06T06:31:15Z

python-package/lightgbm/compat.py

+            # because scikit-learn's 'check_fit1d' estimator check sets that expectation that
+            # estimators must raise a ValueError when a 1-dimensional input is passed to fit().
+            #
+            # So here, lightgbm avoids calling _num_features() on 1-dimensional inputs.


reference: https://github.com/scikit-learn/scikit-learn/blob/545d99e0fd1de69b317496c77bd5c92a46cd1a9e/sklearn/utils/validation.py#L358-L362

jameslamb · 2024-10-06T06:38:16Z

python-package/lightgbm/sklearn.py

@@ -1067,6 +1137,21 @@ def n_features_in_(self) -> int:
            raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.")
        return self._n_features_in

+    @n_features_in_.setter


If you pass reset=True to sklearn.utils.validation.validate_data(), it will try to:

set estimator.n_features_in_ (code link)

delete estimator.feature_names_in_ (code link)

We want the "set estimator.n_features_in_" behavior, because without it we have to manually set estimator.n_features_in_ in fit().

Doing that requires determining the number of features in X, which requires either re-implementing something like sklearn.utils.validation._num_features() (as I originally tried to do) or just calling that function directly. But that function can't safely be called directly before calling check_array(), because it raises a TypeError on 1-D inputs, which violates the check_fit1d estimator check (code link).

So here, I'm proposing that we do the following:

add a setter for n_features_in_ and a deleter for feature_names_in_

pass reset=True at fit() time to validate_data()

modify the pre-1.6 implementation of validate_data() in compat.py to match

python-package/lightgbm/sklearn.py

jameslamb · 2024-10-06T06:59:55Z

@StrikerRUS your comments were definitely not "minor", they really helped a lot! I've re-thought a lot of this PR based on trying to implement those suggestions.

This is ready for another review. Thank you for all your reviewing effort here, I know this change has become quite complex and there are many competing constraints it's trying to satisfy.

…rn_more_tags_deprecation

…eiro/LightGBM into fix_sklearn_more_tags_deprecation

__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_

1adb77b

vnherdeiro requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners September 11, 2024 10:42

vnherdeiro added 5 commits September 11, 2024 12:01

fixing tags dict -> dataclass

8ed87d2

fixing wrong import

32ec431

remove type hint

ade9798

remove type hint

2085a12

fix linting

a9ec348

triggering new CI (scikit-learn dev has changed)

fcc4e12

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] scikit-learn compatibility tests fail with scikit-learn 1.6.dev0 #6653

Open

jameslamb requested changes Sep 15, 2024

View reviewed changes

jameslamb changed the title ~~__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_~~ [python-package] make scikit-learn tags compatible with scikit-learn>=1.16 Sep 15, 2024

jameslamb added in progress fix labels Sep 15, 2024

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier #6654

Merged

jameslamb changed the title ~~[python-package] make scikit-learn tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16 Sep 15, 2024

bringing back _more_tags, adding convertsion from more_tags to sklear…

3b15646

…n_tags

vnherdeiro changed the title ~~[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Sep 15, 2024

lint fix

34d9eb4

adrinjalali reviewed Sep 15, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb previously requested changes Sep 16, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb added 4 commits October 5, 2024 00:04

fix n_features_in setting

6689faa

fix return type

9a05670

remove now-unnecessary _LGBMCheckXY()

815433f

correct comment

ffebe41

jameslamb changed the title ~~WIP: [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Oct 5, 2024

jameslamb marked this pull request as ready for review October 5, 2024 06:25

jameslamb added awaiting review and removed in progress labels Oct 5, 2024

trivialfis approved these changes Oct 5, 2024

View reviewed changes

StrikerRUS reviewed Oct 5, 2024

View reviewed changes

jameslamb and others added 6 commits October 5, 2024 22:19

Merge branch 'master' of github.com:microsoft/LightGBM into fix_sklea…

722474d

…rn_more_tags_deprecation

Apply suggestions from code review

f2cb2fe

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

move __version__ import to compat.py, test with all ML tasks

86b5ab3

just set the setters and deleters

125f4ea

set floor of scikit-learn>=0.24.2, fix ordering of n_features_in_ set…

4233d70

…ting for older versions

fix conflicts

330df3f

jameslamb reviewed Oct 6, 2024

View reviewed changes

jameslamb changed the title ~~[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev~~ [python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev Oct 6, 2024

jameslamb reviewed Oct 6, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

Update python-package/lightgbm/sklearn.py

e8e4cdb

jameslamb requested a review from StrikerRUS October 6, 2024 06:49

jameslamb added 4 commits October 6, 2024 16:23

Merge branch 'master' into fix_sklearn_more_tags_deprecation

0b0ea24

forgot to commit ... fix comment

f22e494

Merge branch 'master' of github.com:microsoft/LightGBM into fix_sklea…

b124797

…rn_more_tags_deprecation

Merge branch 'fix_sklearn_more_tags_deprecation' of github.com:vnherd…

beab71c

…eiro/LightGBM into fix_sklearn_more_tags_deprecation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 •

edited

Loading

jameslamb left a comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

jameslamb left a comment

jameslamb commented Oct 4, 2024

jameslamb commented Oct 5, 2024

trivialfis left a comment

StrikerRUS left a comment

StrikerRUS Oct 5, 2024

jameslamb Oct 6, 2024

StrikerRUS Oct 5, 2024

jameslamb Oct 6, 2024

StrikerRUS Oct 5, 2024

jameslamb Oct 6, 2024

jameslamb Oct 6, 2024

jameslamb Oct 6, 2024

jameslamb Oct 6, 2024

jameslamb commented Oct 6, 2024

		@@ -144,6 +147,32 @@ def _get_weight_from_constructed_dataset(dataset: Dataset) -> Optional[np.ndarra
		return weight


		def _num_features_for_raw_input(X: _LGBM_ScikitMatrixLike) -> int:

[python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev #6651

Are you sure you want to change the base?

[python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev #6651

Conversation

vnherdeiro commented Sep 11, 2024 • edited by jameslamb Loading

Notes for Reviewers

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 4, 2024

jameslamb commented Oct 5, 2024

trivialfis left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Oct 6, 2024

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

jameslamb commented Sep 15, 2024 •

edited

Loading