I use this code to first filter out my dataframe according to a particular column role
.
Next, I calculate the sum of the value
column and add it to the newColumn
. For now, I get the same value in all rows of the newCol
. However, now I want to group by the sum per each value of the id
col. How can I achieve that?
mask = chunk["role"].eq("senior")
chunk.loc[mask, 'newColumn'] = chunk.loc[mask, 'value'].sum()
df
id role value
1 junior 2
1 senior 3
1 senior 4
2 junior 2
2 senior 6
Desired Output:
id role value sum
1 junior 2
1 senior 3 7
1 senior 4 7
2 junior 2
2 senior 6 8
2 senior 2 8
So I am basically only considering rows where the role
is senior
CodePudding user response:
IIUC, you can try query
where role
value is senior then use groupby.transform
df['sum'] = (df.query('role == "senior"')
.groupby('id')['value'].transform('sum'))
print(df)
id role value sum
0 1 junior 2 NaN
1 1 senior 3 7.0
2 1 senior 4 7.0
3 2 junior 2 NaN
4 2 senior 6 8.0
5 2 senior 2 8.0
CodePudding user response:
You should consider adding samples of your dataframe to make it easier for eveyrone to help. Check my sample below - this is what the df structure should be given as when you're asking question.
Something like this should work:
# Sample of the data frame
>>> df = pd.DataFrame({
'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2},
'role': {0: 'junior', 1: 'senior', 2: 'senior', 3: 'junior', 4: 'senior'},
'value': {0: 2, 1: 3, 2: 4, 3: 2, 4: 6}})
# Massaging the data to achieve desired result
>>> print(df
.merge(df.groupby(['id','role'], as_index=False)[['value']]
.agg(sum), on=['id','role'])
.rename({'value_x': 'value', 'value_y': 'sum'}, axis=1))
id role value sum
0 1 junior 2 2
1 1 senior 3 7
2 1 senior 4 7
3 2 junior 2 2
4 2 senior 6 6