Home > other >  Faster alternative to groupby, unstack then fillna
Faster alternative to groupby, unstack then fillna

Time:06-04

I'm currently doing the following operations based on a dataframe (A) made of two columns with multiple thousands of unique values each.

>>> pd.DataFrame({
    'col1': ['foo', 'bar', 'bar', 'foo', 'baz', 'bar', 'baz'],
    'col2': ['abc', 'def', 'abc', 'abc', 'def', 'abc', 'ghi']
})

  col1 col2
0  foo  abc
1  bar  def
2  bar  abc
3  foo  abc
4  baz  def
5  bar  abc
6  baz  ghi

The operations performed on this dataframe are:

res = df.groupby(['col1', 'col2']).size().unstack().fillna(0)

The output is a table (B) with unique values of col1 in rows and unique values of col2 in columns, and each cell is the count of rows, from the original dataframe, matching this combination of unique values.

>>> res
col2  abc  def  ghi
col1               
bar   2.0  1.0  0.0
baz   0.0  1.0  1.0
foo   2.0  0.0  0.0

The amount of time spent in each operation is approximately the following:

  • groupby().size() -> 5%
  • unstack() -> 15%
  • fillna(0) -> 80%

The whole sequence can take about 30 minutes on a real dataset (similar structure as above, just more rows and more unique values).

Is there a better/faster alternative to get from (A) the original dataframe to (B) the end-result table? The most costly operation is by far the final fillna(0) so I'd be interested in an alternative for this bit in particular, but an entirely different approach would be great as well.

Note: converting the strings to integer in the original df speeds up the groupby().size() operation by about 5x, however it doesn't really affect the following operations

CodePudding user response:

Take advantage of filling the NAs in the same step as you unstack by setting a fill_value:

 >>> df.groupby(['col1', 'col2']).size().unstack(fill_value=0)

CodePudding user response:

timeit on Google Colab:

%timeit df.groupby(['col1', 'col2']).size().unstack().fillna(0)
1000 loops, best of 5: 1.54 ms per loop

%timeit df.groupby(['col1', 'col2']).size().unstack(fill_value=0)
1000 loops, best of 5: 1.47 ms per loop

%timeit df.groupby(['col1','col2'])['col2'].count().unstack(fill_value=0)
1000 loops, best of 5: 1.43 ms per loop

%timeit pd.crosstab(index=df.col1, columns=df.col2)
100 loops, best of 5: 8.11 ms per loop

update: I've included the rafaelc answer

  • Related