Home > Mobile >  pandas conditionally include values in aggregation operation
pandas conditionally include values in aggregation operation

Time:10-05

for a dataframe of:

import pandas as pd
df = pd.DataFrame({
    'trigger':[0,0,0, 1,1,1, 2,2,2, 3,3,3,], 
    'score'  :[1,0,0, 0,1,0 ,0,0,1 ,1,1,1], 
    'label'  :[1,0,0, 0,1,0 ,0,0,1 ,1,1,1]
})

# in reality ranked using some other column
df['rank'] = df.groupby(['trigger']).cumcount()
display(df)

I almost want to compute:

d_eval = df[df['rank'] <=2]
d_eval.groupby(['trigger']).agg({'score':'max', 'label':'max'})

However, I do not want to loose the other values (not included in the rank-filter), but only consider the filtered ones in the aggregation.

Is there another way than:

  1. filtering
  2. joining the result back
  3. continue

more straight forward in pandas?

CodePudding user response:

One option is merge:

d_eval = (df[df['rank'] <=2].groupby(['trigger'])
            .agg({'score':'max', 'label':'max'})
         )

df.merge(d_eval, on='trigger', suffixes=['','_max'])

Output:

    trigger  score  label  rank  score_max  label_max
0         0      1      1     0          1          1
1         0      0      0     1          1          1
2         0      0      0     2          1          1
3         1      0      0     0          1          1
4         1      1      1     1          1          1
5         1      0      0     2          1          1
6         2      0      0     0          1          1
7         2      0      0     1          1          1
8         2      1      1     2          1          1
9         3      1      1     0          1          1
10        3      1      1     1          1          1
11        3      1      1     2          1          1

Or sort-of one-liner

df.merge(df.assign(rank=df.groupby('trigger').cumcount())
           .query('rank <=2')
           .groupby('trigger')[['score','label']].max(),
         on='trigger', suffixes=['','_max']
        )
  • Related