Home > other >  target encoding train and test data set with many categorical columns
target encoding train and test data set with many categorical columns

Time:12-27

I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal

My train dataset looks like this where I have only taken 4 columns out of 20 columns

target cat_col1 cat_col2 cat_col3 cat_col4
10 city1 james 25-55 abc
20 city2 adam 30-40 bcc
15 city1 charles 30-40 bcc

I want to write an efficient code to target encode all the categorical columns without individually having to do each column.

The resulting training dataframe should look like

target cat_col1 cat_col2 cat_col3 cat_col4
10 15 10 10 10
20 20 20 17 17
15 15 15 17 17

I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.

encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])

In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.

CodePudding user response:

Assuming you're using the category_encoders implementation, it should accept several columns just fine, at least for the recent versions:

cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']

train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])

Alternatively, you could use a loop:

for column in cat_cols:
    encoder = TargetEncoder()
    train[column] = encoder.fit_transform(train[column], train['target'])
    test[column] = encoder.transform(test[column])
  • Related