selectKBest with chi2 throws ValueError: could not convert string to float: 'Self_emp_not

I am trying to select the best categorical features for a classification problem with chi2 and selectKBest. Here, I've sorted out the categorical columns: I separated the features and target like this and fit it to selectKBest:

from sklearn.feature_selection import chi2, SelectKBest

X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]
selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)

When I run it, I am getting the error:

ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_13272\2211654466.py in <module>
----> 1 selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)

E:\Anaconda\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
    853         else:
    854             # fit method of arity 2 (supervised transformation)
--> 855             return self.fit(X, y, **fit_params).transform(X)
    856 
    857 

...
...

E:\Anaconda\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
   1991 
   1992     def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993         return np.asarray(self._values, dtype=dtype)
   1994 
   1995     def __array_wrap__(

ValueError: could not convert string to float: 'Self_emp_not_inc'

As far as I know, I can apply chi-square on categorical columns. Here, all the features are categorical, also the target. Then why is it saying that 'it can't convert string to float'?

CodePudding user response：

Encode features would do the job. For example

from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.pipeline import make_pipeline

X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]

selector = make_pipe(OneHotEncoder(drop='first'),SelectKBest(score_func=chi2, k=3)).fit_transform(X, y)

We have added a pre-processor! One-hot encoding. You can choose other encoding. The bottom line is that you need to transform your objects to numerical data ;)

There are other contributors encoders from contrib.scikit-category_encoders that might be helpful to your need