I have the following table.
id col1 col2 col3 col4 target
1 A B A 101 1
2 B B A 191 1
3 A B A 81 0
4 C B C 67 1
5 B C C 3 0
I want to target encode every column except col4
.
Expected Output:
e1 e2 e3 target
0.5 0.75 0.667 1
0.5 0.75 0.667 1
0.5 0.75 0.667 0
1.0 0.75 0.5 1
0.5 0.00 0.5 0
EDIT:
For each column of col1
, col2
, col3
I want to get the target encodings.
For example, in col3, A appears 3 times and 2/3 times it has a target of 1. thus the encoding will be 0.667 for A. Similarly for C it will be 0.5 in col3.
I've tried something like this one for one column:
encodings = df.groupby('col1')['target'].mean().reset_index()
df = df.merge(encodings, how = 'left', on = 'col1')
df.drop('col1', axis = 1, inplace = TRUE)
CodePudding user response:
update after clarification:
You need to use the same approach as in your original attempt, but using map
df.update(df[['col1', 'col2', 'col3']]
.apply(lambda s: s.map(df['target'].groupby(s).mean()))
)
output:
id col1 col2 col3 col4 target
0 1 0.5 0.75 0.666667 101 1
1 2 0.5 0.75 0.666667 191 1
2 3 0.5 0.75 0.666667 81 0
3 4 1.0 0.75 0.5 67 1
4 5 0.5 0.0 0.5 3 0
older answer prior to OP clarification
IIUC, you want to map
the normalized value_counts
:
df[['col1', 'col2', 'col3']].apply(lambda s: s.map(s.value_counts(normalize=True)))
output:
col1 col2 col3
0 0.4 0.8 0.6
1 0.4 0.8 0.6
2 0.4 0.8 0.6
3 0.2 0.8 0.4
4 0.4 0.2 0.4
updating the data in place:
df.update(df[['col1', 'col2', 'col3']]
.apply(lambda s: s.map(s.value_counts(normalize=True)))
)
updated DataFrame:
id col1 col2 col3 col4 target
0 1 0.4 0.8 0.6 101 1
1 2 0.4 0.8 0.6 191 1
2 3 0.4 0.8 0.6 81 0
3 4 0.2 0.8 0.4 67 1
4 5 0.4 0.2 0.4 3 0
CodePudding user response:
You may can try with transform
with for loop
l = [df.groupby(col)['target'].transform('mean') for col in ['col1','col2','col3']]
out = pd.concat(l [df.target],keys = ['e1','e2','e3','target'],axis=1)
out
Out[247]:
e1 e2 e3 target
0 0.5 0.75 0.666667 1
1 0.5 0.75 0.666667 1
2 0.5 0.75 0.666667 0
3 1.0 0.75 0.500000 1
4 0.5 0.00 0.500000 0
CodePudding user response:
Use .apply
. For each column - calculate the average of target
grouped by this column:
df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean()))
col1 col2 col3
0 0.5 0.75 0.666667
1 0.5 0.75 0.666667
2 0.5 0.75 0.666667
3 1.0 0.75 0.500000
4 0.5 0.00 0.500000
If you also want to have a target
column, you can just use .assign()
at the end:
df[['col1', 'col2', 'col3']].apply(lambda s: s.map(df['target'].groupby(s).mean())).assign(target=df['target'])
col1 col2 col3 target
0 0.5 0.75 0.666667 1
1 0.5 0.75 0.666667 1
2 0.5 0.75 0.666667 0
3 1.0 0.75 0.500000 1
4 0.5 0.00 0.500000 0
Note:
.apply()
and.transform()
give identical results here. You can replace one with the other.