I have a pandas Dataframe, whose columns are
cat1, cat2, cat3, cat4, value1, value2
I would like to add a new column value3 and assing it a value that depends on value1 and the sum of value1 over all values in cat4. Is it possible to do this in a vectorized/efficient way, or do I need to write a for loop?
For example if the values were
cat1, cat2, cat3, cat4, value1, value2
A,A,A,A, 3,1
A,A,A,B, 5,2
A,A,A,C, 22,4
A,A,X,A, 0,5
A,A,X,B, 4,5
A,A,X,C, 6,5
My desired endresult would be
cat1, cat2, cat3, cat4, value1, value2, value3
A,A,A,A, 3,1, 30
A,A,A,B, 5,2, 30
A,A,A,C, 22,4, 30
A,A,X,A, 0,5, 10
A,A,X,B, 4,5, 10
A,A,X,C, 6,5, 10
where value3 is the same sum over the value1 values with different cat4 entries when all entries of cat1, cat2, and cat3 are the same. Thanks!
CodePudding user response:
IIUC, you can transform('sum')
per groups of cat1/cat2/cat3:
df['value3'] = (df.groupby(['cat1', 'cat2', 'cat3'])
['value1'].transform('sum'))
output:
cat1 cat2 cat3 cat4 value1 value2 value3
0 A A A A 3 1 30
1 A A A B 5 2 30
2 A A A C 22 4 30
3 A A X A 0 5 10
4 A A X B 4 5 10
5 A A X C 6 5 10