I am trying to select the best categorical features for a classification problem with chi2
and selectKBest
. Here, I've sorted out the categorical columns:
I separated the features and target like this and fit it to selectKBest
:
from sklearn.feature_selection import chi2, SelectKBest
X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]
selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)
When I run it, I am getting the error:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_13272\2211654466.py in <module>
----> 1 selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)
E:\Anaconda\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
853 else:
854 # fit method of arity 2 (supervised transformation)
--> 855 return self.fit(X, y, **fit_params).transform(X)
856
857
...
...
E:\Anaconda\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
ValueError: could not convert string to float: 'Self_emp_not_inc'
As far as I know, I can apply chi-square on categorical columns. Here, all the features are categorical, also the target. Then why is it saying that 'it can't convert string to float'?
CodePudding user response:
Encode features would do the job. For example
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.pipeline import make_pipeline
X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]
selector = make_pipe(OneHotEncoder(drop='first'),SelectKBest(score_func=chi2, k=3)).fit_transform(X, y)
We have added a pre-processor! One-hot encoding. You can choose other encoding. The bottom line is that you need to transform your objects to numerical data ;)
There are other contributors encoders from contrib.scikit-category_encoders that might be helpful to your need