Home > OS >  Named aggregations with pandas group by agg are super slow. Why?
Named aggregations with pandas group by agg are super slow. Why?

Time:11-02

I have a df with many groups.

N = int(1E6)
df = pd.DataFrame({'A':np.random.randint(300_000, size=N),
                  'B': np.random.rand(N)})
df.loc[::2, ['B']] = np.nan

I want to calculate the sum of each group, given that the group has at least one non-Nan value. I encounter that the following is very slow:

df.groupby('A').agg(**{
  'newname' : ('B', lambda x: x.sum(min_count=1))
})

(22 seconds)

while the following is fast:

df.groupby('A').sum(min_count=1)

(0.11 seconds).

However I would like to use named aggregations.

Am I doing something wrong in the named_aggregation approach, hereby reducing performance? I tried functools.partial as well (instead of the lambda function), but this yields the same performance.

CodePudding user response:

Once you pass in lambda, the operation is no longer vectorized across the groups even though it can be vectorized within each group. For example: df.groupby('A').agg(**{'newname' : ('B', 'sum')}) is comparable to df.groupby('A')['B'].sum() and is largely faster than lambda x: x.sum().

That said, I read somewhere that named agg can be a bit slower than straight up applying the built-in functions. For example, this would be a bit faster than .agg:

d = df.groupby('A')

pd.DataFrame({'new_name': d['B'].sum(min_count=1), 
              'other_name': d['B'].size()
            })

but then, you code base looks not as clean as .agg.

CodePudding user response:

One reason the second solution is faster could be because internally, it's using Cython, which is Python converted to C, and is known to be much faster for algorithms.

GroupBy.sum() calls GroupBy._agg_general(), which in turn calls GroupBy._cython_agg_general()...

  • Related