Home > OS >  Condition aggregate in pandas
Condition aggregate in pandas

Time:11-28

import pandas as pd
import numpy as np

cols = ['name', 'performance', 'activity']

data = [
    ['bob',   50, 95],
    ['bob',   0, 80],
    ['bob',   82, 4],
    ['bob',  50, 120],   
    ['joey',  37, 50],
    ['joey', -4, 70],
    ['joey', 95, 70],
    ['joey', 35, 70]]

df = pd.DataFrame(data, columns=cols)

df  = df.groupby(['name']).agg({'performance' : np.median, 'activity': np.median})

Above is the working snippet of aggregate over groupby. I have a different use case, here I wish to take median of performance only when performance is non-negative. Similarly, while taking median of activity will only consider rows with activity greater than or equal to 10. Is there any way to specify such things during groupby?

CodePudding user response:

Assign back to columns in DataFrame.assign with set not matched values to NaNs by Series.where and then aggregate:

df  = (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
                  activity = df['activity'].where(df['activity'].ge(10)))
                .groupby(['name'], as_index=False)
                .agg(performance_pos=('performance' , np.median),
                      activity_10= ('activity', np.median))
                )

print (df)
   name  performance_pos  activity_10
0   bob             50.0         95.0
1  joey             37.0         70.0

Details:

print (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
                 activity = df['activity'].where(df['activity'].ge(10))))
   name  performance  activity
0   bob         50.0      95.0
1   bob          0.0      80.0
2   bob         82.0       NaN
3   bob         50.0     120.0
4  joey         37.0      50.0
5  joey          NaN      70.0
6  joey         95.0      70.0
7  joey         35.0      70.0

If performance is not important you can filter each group separately:

df  = (df.groupby(['name'], as_index=False)
         .agg({'performance' : lambda x: np.median(x[x.ge(0)]), 
               'activity': lambda x: np.median(x[x.ge(10)])}))
print (df)

   name  performance  activity
0   bob           50        95
1  joey           37        70
  • Related