Home > other >  Target encoding multiple columns in pandas python
Target encoding multiple columns in pandas python

Time:09-03

I have the following table.

id col1 col2 col3 col4  target
1    A    B  A    101   1
2    B    B  A    191   1
3    A    B  A     81   0 
4    C    B  C     67   1
5    B    C  C      3   0

I want to target encode every column except col4.

Expected Output:

e1    e2     e3     target
0.5   0.75   0.667    1
0.5   0.75   0.667    1
0.5   0.75   0.667    0
1.0   0.75   0.5      1
0.5   0.00   0.5      0

EDIT: For each column of col1, col2, col3 I want to get the target encodings.

For example, in col3, A appears 3 times and 2/3 times it has a target of 1. thus the encoding will be 0.667 for A. Similarly for C it will be 0.5 in col3.

I've tried something like this one for one column:

encodings = df.groupby('col1')['target'].mean().reset_index()
df = df.merge(encodings, how = 'left', on = 'col1')
df.drop('col1', axis = 1, inplace = TRUE)

CodePudding user response:

update after clarification:

You need to use the same approach as in your original attempt, but using map

df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(df['target'].groupby(s).mean()))
          )

output:

   id col1  col2      col3  col4  target
0   1  0.5  0.75  0.666667   101       1
1   2  0.5  0.75  0.666667   191       1
2   3  0.5  0.75  0.666667    81       0
3   4  1.0  0.75       0.5    67       1
4   5  0.5   0.0       0.5     3       0
older answer prior to OP clarification

IIUC, you want to map the normalized value_counts:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(s.value_counts(normalize=True)))

output:

   col1  col2  col3
0   0.4   0.8   0.6
1   0.4   0.8   0.6
2   0.4   0.8   0.6
3   0.2   0.8   0.4
4   0.4   0.2   0.4
updating the data in place:
df.update(df[['col1', 'col2', 'col3']]
          .apply(lambda s: s.map(s.value_counts(normalize=True)))
          )

updated DataFrame:

   id col1 col2 col3  col4  target
0   1  0.4  0.8  0.6   101       1
1   2  0.4  0.8  0.6   191       1
2   3  0.4  0.8  0.6    81       0
3   4  0.2  0.8  0.4    67       1
4   5  0.4  0.2  0.4     3       0

CodePudding user response:

You may can try with transform with for loop

l = [df.groupby(col)['target'].transform('mean') for col in ['col1','col2','col3']]
out = pd.concat(l   [df.target],keys = ['e1','e2','e3','target'],axis=1)
out
Out[247]: 
    e1    e2        e3  target
0  0.5  0.75  0.666667       1
1  0.5  0.75  0.666667       1
2  0.5  0.75  0.666667       0
3  1.0  0.75  0.500000       1
4  0.5  0.00  0.500000       0

CodePudding user response:

Use .apply. For each column - calculate the average of target grouped by this column:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean()))
   col1  col2      col3
0   0.5  0.75  0.666667
1   0.5  0.75  0.666667
2   0.5  0.75  0.666667
3   1.0  0.75  0.500000
4   0.5  0.00  0.500000

If you also want to have a target column, you can just use .assign() at the end:

df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean())).assign(target=df['target'])
   col1  col2      col3  target
0   0.5  0.75  0.666667       1
1   0.5  0.75  0.666667       1
2   0.5  0.75  0.666667       0
3   1.0  0.75  0.500000       1
4   0.5  0.00  0.500000       0

Note: .apply() and .transform() give identical results here. You can replace one with the other.

  • Related