pandas agg using "intermediate" column without recomputing [group size] times the same val-CodePudding

Let's say I have a DataFrame df=pd.DataFrame({'a':[1,2,np.nan,3,4,5,3], 'b':[11,22,22,11,22,22,22]})

I want compute a reduced dataframe where I group by b, and where my column depends on the groupwise mean. Specifically, I want the column to contain the

number of elements in a that are smaller than the group wise mean.

For this I found a solution which seems like it could be improved because I am guessing it recomputed the mean 2 times for the '11' group and 5 times for the '22' group:

Slow solution using groupby, agg and NamedAgg:

df.groupby('b').agg(c=pd.NamedAgg(column='a', aggfunc=lambda x: sum(i<x.mean() for i in x)))

dff=df.groupby('b').agg(c=pd.NamedAgg(column='a', aggfunc=lambda x: sum(i<x.mean() for i in x)))

print(dff)

    c
b    
11  1
22  2

Do you know a better way where the mean is only computed once per group?

I have searched parameters in pandas merge, concat, join, agg, apply etc. But I think there must be a savant combination of these that would achieve what I am trying to do.

CodePudding user response：

Don't use python's sum, use the vectorial counterpart, it will enable you to compute the mean only once per group:

df.groupby('b')['a'].agg(c=lambda s: s.lt(s.mean()).sum())

output:

Speed comparison

## provided example

# vectorial approach
1.07 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# loop
2.86 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 70k rows

# vectorial approach
3.19 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# loop
7.67 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

From my experience, if having a large dataset (millions or more of obs), then using merge turn out to be more efficient and faster than using lambda or transform.

tem = df.groupby('b')['a'].mean().reset_index(name='mean')
df = pd.merge(df, tem, on='b', how='left')
df.loc[df['a']<df['mean']].groupby('b')['a'].count().reset_index(name='c')