Home > OS >  aggregation of an ordered dataframe based on conditional accumulated previous values in a group (pan
aggregation of an ordered dataframe based on conditional accumulated previous values in a group (pan

Time:03-25

I've got an ordered dataframe which I'm trying to aggregate by some grouping columns and based on accumulated previous values of other columns.

df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
            , 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
            , 'Value1':[0,1,1,1,1,1,0,0,0]
            , 'Value2':[1,2,3,4,5,4,3,2,2]})


df
        ID  Group   Value1  Value2
    0   ID1 Group1    0       1
    1   ID1 Group2    1       2
    2   ID1 Group2    1       3
    3   ID1 Group2    1       4
    4   ID1 Group1    1       5
    5   ID2 Group2    1       4
    6   ID2 Group2    0       3
    7   ID2 Group2    0       2
    8   ID2 Group1    0       2

I'd like to aggregate three different ways using Value1 and Value 2, Grouped by ID and Group. df is already ordered (based on date, ID and Group)

Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)

Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)

Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)

here's my desired output:

    ID  Group   Value1  Value2  Output1 Output2 Output3
0   ID1 Group1    0       1        0      1       NaN
1   ID1 Group2    1       2        0      2       NaN
2   ID1 Group2    1       3        1      5        2
3   ID1 Group2    1       4        2      9        5
4   ID1 Group1    1       5        0      6       NaN 
5   ID2 Group2    1       4        0      4       NaN
6   ID2 Group2    0       3        1      7        4
7   ID2 Group2    0       2        1      9        4
8   ID2 Group1    0       2        0      2       NaN

To make sure it's clear what I'm trying to do, let's look at the output index 3 (the fourtth row)

3   ID1 Group2    1       4        2      9        5

Output1 = 2 because there are two rows above it in ID1/Group2 that has Value1 = 1.

Output2 = 9 because the sum of Value2 of all rows above it in ID1/Group2, including the row itself is (2 3 4 = 9).

Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 3 = 5)

I'd like to add I'm working on a large dataset, so I'm looking for an efficient/high performance solution.

CodePudding user response:

Solution

  • For Output 1 and 2: We can use groupby cumsum
  • For Output 3: Its a little tricky calculation as you have to first the mask the values in column Value2 where the corresponding value in column Value1 is 0, after that you need to group the masked column and use cumsum to calculate cumulative sum now in order to exclude the current row you have can subtract the masked column from the cummulative sum
g = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()

s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s

Result

print(df)

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1        0
1  ID1  Group2       1       2        0        2        0
2  ID1  Group2       1       3        1        5        2
3  ID1  Group2       1       4        2        9        5
4  ID1  Group1       1       5        0        6        0
5  ID2  Group2       1       4        0        4        0
6  ID2  Group2       0       3        1        7        4
7  ID2  Group2       0       2        1        9        4
8  ID2  Group1       0       2        0        2        0

CodePudding user response:

You can add a masked column for the third output and computer a grouped, shifted cumsum:

import numpy as np

# dictionary of shift values
d_shift = {'Value1': 1, 'Value3': 1}
# dictionary of fill values
d_fill  = {'Value1': 0}

df[['Output1', 'Output2', 'Output3']] = (df
 .assign(Value3=df['Value2'].where(df['Value1'].eq(1)))
 .groupby(['ID', 'Group'])
 .transform(lambda x: x.shift(d_shift.get(x.name, 0),
                              fill_value=d_fill.get(x.name, np.nan)).cumsum())
)

Or, as linear form:

g = (df.assign(Value3=df['Value2']
       .mask(df['Value1'].ne(1))).groupby(['ID', 'Group'])
     )
df['Output1'] = g['Value1'].apply(lambda s: s.shift(fill_value=0).cumsum())
df['Output2'] = g['Value2'].cumsum()
df['Output3'] = g['Value3'].apply(lambda s: s.shift().cumsum())

output:

    ID   Group  Value1  Value2  Output1  Output2  Output3
0  ID1  Group1       0       1        0        1      NaN
1  ID1  Group2       1       2        0        2      NaN
2  ID1  Group2       1       3        1        5      2.0
3  ID1  Group2       1       4        2        9      5.0
4  ID1  Group1       1       5        0        6      NaN
5  ID2  Group2       1       4        0        4      NaN
6  ID2  Group2       0       3        1        7      4.0
7  ID2  Group2       0       2        1        9      NaN
8  ID2  Group1       0       2        0        2      NaN
  • Related