I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal
My train dataset looks like this where I have only taken 4 columns out of 20 columns
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | city1 | james | 25-55 | abc |
20 | city2 | adam | 30-40 | bcc |
15 | city1 | charles | 30-40 | bcc |
I want to write an efficient code to target encode all the categorical columns without individually having to do each column.
The resulting training dataframe should look like
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | 15 | 10 | 10 | 10 |
20 | 20 | 20 | 17 | 17 |
15 | 15 | 15 | 17 | 17 |
I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.
encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])
In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.
CodePudding user response:
Assuming you're using the category_encoders
implementation, it should accept several columns just fine, at least for the recent versions:
cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']
train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])
Alternatively, you could use a loop:
for column in cat_cols:
encoder = TargetEncoder()
train[column] = encoder.fit_transform(train[column], train['target'])
test[column] = encoder.transform(test[column])