Condition aggregate in pandas-CodePudding

import pandas as pd
import numpy as np

cols = ['name', 'performance', 'activity']

data = [
    ['bob',   50, 95],
    ['bob',   0, 80],
    ['bob',   82, 4],
    ['bob',  50, 120],   
    ['joey',  37, 50],
    ['joey', -4, 70],
    ['joey', 95, 70],
    ['joey', 35, 70]]

df = pd.DataFrame(data, columns=cols)

df  = df.groupby(['name']).agg({'performance' : np.median, 'activity': np.median})

Above is the working snippet of aggregate over groupby. I have a different use case, here I wish to take median of performance only when performance is non-negative. Similarly, while taking median of activity will only consider rows with activity greater than or equal to 10. Is there any way to specify such things during groupby?

CodePudding user response：

Assign back to columns in DataFrame.assign with set not matched values to NaNs by Series.where and then aggregate:

df  = (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
                  activity = df['activity'].where(df['activity'].ge(10)))
                .groupby(['name'], as_index=False)
                .agg(performance_pos=('performance' , np.median),
                      activity_10= ('activity', np.median))
                )

print (df)
   name  performance_pos  activity_10
0   bob             50.0         95.0
1  joey             37.0         70.0

Details:

print (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
                 activity = df['activity'].where(df['activity'].ge(10))))
   name  performance  activity
0   bob         50.0      95.0
1   bob          0.0      80.0
2   bob         82.0       NaN
3   bob         50.0     120.0
4  joey         37.0      50.0
5  joey          NaN      70.0
6  joey         95.0      70.0
7  joey         35.0      70.0

If performance is not important you can filter each group separately:

df  = (df.groupby(['name'], as_index=False)
         .agg({'performance' : lambda x: np.median(x[x.ge(0)]), 
               'activity': lambda x: np.median(x[x.ge(10)])}))
print (df)

   name  performance  activity
0   bob           50        95
1  joey           37        70