Pandas Groupby Agg to List Very Slow-CodePudding

I have a DataFrame with 200,000 rows, generated by the code below:

In [5]: def create_df():
   ...:     df = {}
   ...:     for col_name in ['a', 'b', 'c', 'd', 'e', 'f']:
   ...:         df[col_name] = np.tile(list(range(100000)), 2)
   ...:     return pd.DataFrame(df)

I am trying to perform some downstream applications which require a groupby then aggregating the groupby rows into lists as follows:

In [7]: df.groupby(['a', 'b', 'c', 'd']).agg(list)
Out[7]: 
                                      e               f
a     b     c     d                                    
0     0     0     0              [0, 0]          [0, 0]
1     1     1     1              [1, 1]          [1, 1]
2     2     2     2              [2, 2]          [2, 2]
3     3     3     3              [3, 3]          [3, 3]
4     4     4     4              [4, 4]          [4, 4]
...                                 ...             ...
99995 99995 99995 99995  [99995, 99995]  [99995, 99995]
99996 99996 99996 99996  [99996, 99996]  [99996, 99996]
99997 99997 99997 99997  [99997, 99997]  [99997, 99997]
99998 99998 99998 99998  [99998, 99998]  [99998, 99998]
99999 99999 99999 99999  [99999, 99999]  [99999, 99999]

However, this operation is far slower than expected (~15 seconds on average), given that the DataFrame is only 200,000 rows:

In [8]: %timeit df.groupby(['a', 'b', 'c', 'd']).agg(list)
14.7 s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

By contrast, applying a count to the groupby only takes ~45 milliseconds on average, which indicates that the slowdown appears to only be aggregating by list:

In [10]: %timeit df.groupby(['a', 'b', 'c', 'd']).agg('count')
44.7 ms ± 2.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

What could I do here to make the aggregate by list operation faster? Right now it's a real bottleneck in my code.

CodePudding user response：

One approach is to use tuple instead of list:

df.groupby(['a', 'b', 'c', 'd']).agg(tuple)

This gave a 2x speedup in my environment:

df = create_df()
%timeit df.groupby(['a', 'b', 'c', 'd']).agg(tuple)
2.38 s ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby(['a', 'b', 'c', 'd']).agg(list)
5.41 s ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

Try df.groupby(['a', 'b', 'c', 'd']).agg(lambda x: x.tolist()).