import pandas as pd
import numpy as np
cols = ['name', 'performance', 'activity']
data = [
['bob', 50, 95],
['bob', 0, 80],
['bob', 82, 4],
['bob', 50, 120],
['joey', 37, 50],
['joey', -4, 70],
['joey', 95, 70],
['joey', 35, 70]]
df = pd.DataFrame(data, columns=cols)
df = df.groupby(['name']).agg({'performance' : np.median, 'activity': np.median})
Above is the working snippet of aggregate over groupby. I have a different use case, here I wish to take median of performance only when performance is non-negative. Similarly, while taking median of activity will only consider rows with activity greater than or equal to 10. Is there any way to specify such things during groupby?
CodePudding user response:
Assign back to columns in DataFrame.assign
with set not matched values to NaN
s by Series.where
and then aggregate:
df = (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
activity = df['activity'].where(df['activity'].ge(10)))
.groupby(['name'], as_index=False)
.agg(performance_pos=('performance' , np.median),
activity_10= ('activity', np.median))
)
print (df)
name performance_pos activity_10
0 bob 50.0 95.0
1 joey 37.0 70.0
Details:
print (df.assign(performance = df['performance'].where(df['performance'].ge(0)),
activity = df['activity'].where(df['activity'].ge(10))))
name performance activity
0 bob 50.0 95.0
1 bob 0.0 80.0
2 bob 82.0 NaN
3 bob 50.0 120.0
4 joey 37.0 50.0
5 joey NaN 70.0
6 joey 95.0 70.0
7 joey 35.0 70.0
If performance is not important you can filter each group separately:
df = (df.groupby(['name'], as_index=False)
.agg({'performance' : lambda x: np.median(x[x.ge(0)]),
'activity': lambda x: np.median(x[x.ge(10)])}))
print (df)
name performance activity
0 bob 50 95
1 joey 37 70