I've got an ordered dataframe which I'm trying to aggregate by some grouping columns and based on accumulated previous values of other columns.
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
I'd like to aggregate three different ways using Value1 and Value 2, Grouped by ID and Group. df is already ordered (based on date, ID and Group)
Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)
here's my desired output:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
To make sure it's clear what I'm trying to do, let's look at the output index 3 (the fourtth row)
3 ID1 Group2 1 4 2 9 5
Output1 = 2 because there are two rows above it in ID1/Group2 that has Value1 = 1.
Output2 = 9 because the sum of Value2 of all rows above it in ID1/Group2, including the row itself is (2 3 4 = 9).
Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 3 = 5)
I'd like to add I'm working on a large dataset, so I'm looking for an efficient/high performance solution.
CodePudding user response:
Solution
- For Output 1 and 2: We can use
groupby cumsum
- For Output 3: Its a little tricky calculation as you have to first the mask the values in column
Value2
where the corresponding value in column Value1 is 0, after that you need to group the masked column and usecumsum
to calculate cumulative sum now in order to exclude the current row you have can subtract the masked column from the cummulative sum
g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()
s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s
Result
print(df)
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 0
1 ID1 Group2 1 2 0 2 0
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 0
5 ID2 Group2 1 4 0 4 0
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 0
CodePudding user response:
You can add a masked column for the third output and computer a grouped, shifted cumsum:
import numpy as np
# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill = {'Value1': 0}
df[['Output1', 'Output2', 'Output3']] = (df
.assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
.groupby(['ID', 'Group'])
.transform(lambda x: x.shift(d_shift.get(x.name, 0),
fill_value=d_fill.get(x.name, np.nan)).cumsum())
)
Or, as linear form:
g = (df.assign(Value3=df['Value2']
.mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
)
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())
output:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 NaN
8 ID2 Group1 0 2 0 2 NaN