Let's say I have a DataFrame df=pd.DataFrame({'a':[1,2,np.nan,3,4,5,3], 'b':[11,22,22,11,22,22,22]})
a b
0 1.0 11
1 2.0 22
2 NaN 22
3 3.0 11
4 4.0 22
5 5.0 22
6 3.0 22
I want compute a reduced dataframe where I group by b
, and where my column depends on the groupwise mean. Specifically, I want the column to contain the
number of elements in a
that are smaller than the group wise mean.
For this I found a solution which seems like it could be improved because I am guessing it recomputed the mean 2 times for the '11' group and 5 times for the '22' group:
Slow solution using groupby, agg and NamedAgg:
df.groupby('b').agg(c=pd.NamedAgg(column='a', aggfunc=lambda x: sum(i<x.mean() for i in x)))
dff=df.groupby('b').agg(c=pd.NamedAgg(column='a', aggfunc=lambda x: sum(i<x.mean() for i in x)))
print(dff)
c
b
11 1
22 2
Do you know a better way where the mean is only computed once per group?
I have searched parameters in pandas merge, concat, join, agg, apply etc. But I think there must be a savant combination of these that would achieve what I am trying to do.
CodePudding user response:
Don't use python's sum
, use the vectorial counterpart, it will enable you to compute the mean only once per group:
df.groupby('b')['a'].agg(c=lambda s: s.lt(s.mean()).sum())
output:
c
b
11 1
22 2
Speed comparison
## provided example
# vectorial approach
1.07 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# loop
2.86 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
## 70k rows
# vectorial approach
3.19 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# loop
7.67 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
From my experience, if having a large dataset (millions or more of obs), then using merge
turn out to be more efficient and faster than using lambda
or transform
.
tem = df.groupby('b')['a'].mean().reset_index(name='mean')
df = pd.merge(df, tem, on='b', how='left')
df.loc[df['a']<df['mean']].groupby('b')['a'].count().reset_index(name='c')