feature selection after one hot encoding-CodePudding

I have done one hot encoding on my X_train dataframe in order to convert the categorical variables in the dataframe to numerical variables. This resulted in my columns to increase significantly with some members/elements of some columns named as individual columns. I then ran feature selection using the filter method's univariate selection algorithm and selected top 15 features that correlate most with my target variable using the selectKbest and chi square methods. The problem here now is that, the features selected have confusing names. Here is my code:

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(X_test)


y_train = pd.get_dummies(y_train)

y_test = pd.get_dummies(y_test)


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns

print(featureScores.nlargest(15,'Score'))  #print 10 best features


 Specs          Score
4                                weeks worked in year  131890.720755
2                     num_persons_worked_for_employer   10900.486787
1                                     instance_weight    8087.766885
67  major_occupation_code_ Executive admin and man...    7606.586291
29  education_ Prof school degree (MD DDS DVM LLB JD)    5616.479469
75      major_occupation_code_ Professional specialty    5505.713604
28  education_ Masters degree(MA MS MEng MEd MSW MBA)    5019.018784
24              education_ Bachelors degree(BA AB BS)    3692.481274
25               education_ Doctorate degree(PhD EdD)    3587.589683
96                                          sex_ Male    3424.928788
11        class_of_worker_ Self-employed-incorporated    3372.042663
55   major_industry_code_ Not in universe or children    3142.494445
71             major_occupation_code_ Not in universe    3142.494445
9                    class_of_worker_ Not in universe    3125.278635
95                                        sex_ Female    3034.914202

For example, 'class_of_worker_ Self-employed-incorporated' (no. 11) and 'class_of_worker_ Not in universe' (no. 9) features are selected. However, these columns/features are features that were created from the 'class_of_worker' column after one hot encoding and this column actually has about 9 more elements, therefore, about 8 more features created from 'class_of_worker' column but have not been selected by the univariate selection method. Is this right? How do I select just two of the eight 'class_of_worker' features and forget the rest six?

CodePudding user response：

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(X_test)

CodePudding user response：

pandas.get_dummies()

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(X_test)

reference :

pandas.get_dummies()

CodePudding user response：

It seems like you one hot encoded all your features. Therefore, I am assuming that all your features were categorical.

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(X_test)

The one hot encoding generated multiple columns from a single column. All of these columns are now 'features'. Therefore, when you pass them to a feature selection algorithm, it will select top k relevant features from this entire set and some will dropped.

Meta point: I see that you have also one-hot encoded your labels. In that case, you have converted your problem from a Multi-class to a Multi-label problem.