Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is xgboostclassifier imcomptabile with calibratedclassifier? #5887

Closed
zahs123 opened this issue Jul 13, 2020 · 8 comments · Fixed by #5953
Closed

is xgboostclassifier imcomptabile with calibratedclassifier? #5887

zahs123 opened this issue Jul 13, 2020 · 8 comments · Fixed by #5953
Assignees
Labels

Comments

@zahs123
Copy link

zahs123 commented Jul 13, 2020

all,

i do not know why i am getting the following error: ValueError: feature_names mismatch. this is what i am running:
to get my data:

target=df['status']
train = df.drop(columns=['status'])
x_train, x_valid, y_train, y_valid = train_test_split(train, target, stratify=target, random_state=42, test_size=0.2)
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, stratify=y_train, random_state=42, test_size=0.2)

and then i run grid search

kfolds = StratifiedKFold(3)
clf = GridSearchCV(models['XGBOOST'], params['XGBOOST'], cv=kfolds.split(x_train, y_train),
                       scoring='roc_auc', return_train_score=True)

clf.fit(x_train, y_train)

model = clf.best_estimator_

clf_isotonic = CalibratedClassifierCV(model, cv='prefit', method='isotonic')
clf_isotonic.fit(x_valid, y_valid)

but i get the above error. my x_valid and x_train have the same columns even when i fix the columns using:

#f_names = model.get_booster().feature_names
f_names = x_train.columns.tolist()
x_valid[f_names]

i do not understand why i am getting that error even when i fix the columns as above, i have tried doing x_valid.values but still no hope... they have the same features so i really do not know what is happening

@trivialfis
Copy link
Member

Which XGBoost version are you using?

@zahs123
Copy link
Author

zahs123 commented Jul 16, 2020

its 1.0.2

@trivialfis
Copy link
Member

Will investigate this.

@trivialfis trivialfis self-assigned this Jul 24, 2020
@trivialfis
Copy link
Member

Reproduced.

@trivialfis
Copy link
Member

I looked into this briefly. sklearn converts pandas dataframe into numpy array during its validation (check_array function), so information like feature names are loss. Not sure if this is expected behaviour from skl.

@trivialfis
Copy link
Member

There's a related issue in scikit-learn/scikit-learn#5523 .

@zahs123
Copy link
Author

zahs123 commented Jul 29, 2020

i still can't figure this priblem out, my validation and train set have exact same features

@trivialfis
Copy link
Member

trivialfis commented Jul 29, 2020

@zahs123 There's a columns in pandas.DataFrame, which contains the name of each columns. Here are the events:

  1. Scikit learn grid search pass the dataframe to XGBoost as it is, so XGBoost memorize the feature names from your dataframe during grid searching.
  2. But scikit learn calibrate classifier removes the feature names by converting your dataframe to numpy array before passing it down to XGBoost. Hence this time when XGBoost got the data it's an array instead of dataframe.
  3. The error happens when XGBoost try to compare the feature names in numpy array (which is generated automatically as array doesn't have feature names) to previously memorized feature names from pandas dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants