Home > Enterprise >  pandas groupby.apply is slow, even on small DataSets
pandas groupby.apply is slow, even on small DataSets

Time:11-17

I want to aggregate a pandas DataFrame by two group variables and do calculations on each group. As I want to mix columns, I use dataframe.groupby.apply The following code works but is inexplicably slow. 3 seconds to aggregate 4000 rows. When I change the code to one group variable, it is just half the time, maybe a little less. Any ideas, why it is so slow?

import random
df = pd.DataFrame(np.random.rand(4000,4), columns=list('abcd'))
df['group'] = random.choices([0, 0, 1, 1],k=4000)
df["grupp"]=  random.choices([2, 3, 4, 2],k=4000)
df

def f(x):
    d = {}
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['c_d_prodsum'])

import time
start = time.time()
%timeit b=df.groupby(['group','grupp']).apply(f)
end = time.time()
print(end - start)

On my machine, it shows 33.2 ms ± 2.03 ms per loop and 2.77 as the number of seconds

CodePudding user response:

You'll get better performance if you restrict yourself to only those functions provided by pandas.

For instance...

def totime():
    df['c*d'] = df['c']*df['d']
    d = df.groupby(['group','grupp'])['c*d'].sum().rename('c_d_prodsum')

%timeit totime()

shows 842 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

  • Related