I used TargetEncoder on all my categorical, nominal features in my dataset. After splitting the df into train and test, I am fitting a XGB on the dataset.
After the model is trained, I am looking to plot feature importance, however, the features are showing up in an "encoded" state. How can I reverse the features, so the importance plot is interpretable?
import category_encoders as ce
encoder=ce.TargetEncoder(cols=X.select_dtypes(['object']).columns)
encoder.fit_transform(X,y)
model = XGBClassifier(use_label_encoder=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y_closed)
model.fit(X_train, y_train)
%matplotlib inline
import matplotlib.pyplot as plt
N_FEATURES = 10
importances = model.feature_importances_
indices = np.argsort(importances)[-N_FEATURES:]
plt.title('Feature Importances')
plt.xlabel('Relative Importance')
plt.show()
CodePudding user response:
As stated in the documentation: you can get return the encoded column/feature names by using get_feature_names()
and then drop the original feature names.
Also, do you need to encode your target features (y)?
In the below example, I assumed that you only need to encode features that corresponded to the 'object' datatype in your X_train dataset.
Finally, it is good practice to first split your dataset into train and test and then do fit_transform
on the training set and only fit
on the test set. In this way you prevent leakage.
object_col = X_train.select_dtypes(['object']).columns
encoder = ce.TargetEncoder(cols = object_col)
df_enc = encoder.fit_transform(X_train)
df_enc.columns = encoder.get_feature_names()
X_train.drop(object_col, axis = 1, inplace = True)
df_final = pd.concat([X_train, df_enc], axis = 1)