What I have:
target | var1 | var2 |
---|---|---|
1 | jack, jones, phil | en-us |
0 | don, sam, bob | vn-en |
1 | jones, alex, sam | en-us |
What I want:
var1 | target | no_target | var2 |
---|---|---|---|
jones | 2 | 0 | en-us |
target column is how much 'jones' appears when target is 1
so far I have tried to use DictVetorize, and it helped me to count number of times target equal to 1 for jones but I done know how to get var2 column
CodePudding user response:
IIUC, you can use:
df['target'].mul(df['var1'].str.count('jones')).sum()
output: 2
For a more generic method to get all names:
out = (df
.assign(var1=df['var1'].str.split(',\s*').where(df['target'].eq(1)))
.explode('var1')
.groupby(['var1', 'var2'], as_index=False)
.sum()
)
output:
var1 var2 target
0 alex en-us 1
1 jack en-us 1
2 jones en-us 2
3 phil en-us 1
4 sam en-us 1
counting target/no_target
(df
.assign(var1=df['var1'].str.split(',\s*'),
target=np.where(df['target'].eq(1), 'target', 'no_target'),
value=1
)
.explode('var1')
.pivot_table(index=['var1', 'var2'], columns='target', values='value',
aggfunc='sum', fill_value=0)
.reset_index().rename_axis(columns=None)
.groupby(['var1'], as_index=False)
.agg({'target': 'sum', 'no_target': 'sum', 'var2': 'first'})
)
output:
var1 target no_target var2
0 alex 1 0 en-us
1 bob 0 1 vn-en
2 don 0 1 vn-en
3 jack 1 0 en-us
4 jones 2 0 en-us
5 phil 1 0 en-us
6 sam 1 1 en-us