pandas groupby.apply is slow, even on small DataSets-CodePudding

I want to aggregate a pandas DataFrame by two group variables and do calculations on each group. As I want to mix columns, I use dataframe.groupby.apply The following code works but is inexplicably slow. 3 seconds to aggregate 4000 rows. When I change the code to one group variable, it is just half the time, maybe a little less. Any ideas, why it is so slow?

import random
df = pd.DataFrame(np.random.rand(4000,4), columns=list('abcd'))
df['group'] = random.choices([0, 0, 1, 1],k=4000)
df["grupp"]=  random.choices([2, 3, 4, 2],k=4000)
df

def f(x):
    d = {}
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['c_d_prodsum'])

import time
start = time.time()
%timeit b=df.groupby(['group','grupp']).apply(f)
end = time.time()
print(end - start)

On my machine, it shows 33.2 ms ± 2.03 ms per loop and 2.77 as the number of seconds

CodePudding user response：

You'll get better performance if you restrict yourself to only those functions provided by pandas.

For instance...

def totime():
    df['c*d'] = df['c']*df['d']
    d = df.groupby(['group','grupp'])['c*d'].sum().rename('c_d_prodsum')

%timeit totime()

shows 842 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)