I'm currently doing the following operations based on a dataframe (A) made of two columns with multiple thousands of unique values each.
>>> pd.DataFrame({
'col1': ['foo', 'bar', 'bar', 'foo', 'baz', 'bar', 'baz'],
'col2': ['abc', 'def', 'abc', 'abc', 'def', 'abc', 'ghi']
})
col1 col2
0 foo abc
1 bar def
2 bar abc
3 foo abc
4 baz def
5 bar abc
6 baz ghi
The operations performed on this dataframe are:
res = df.groupby(['col1', 'col2']).size().unstack().fillna(0)
The output is a table (B) with unique values of col1
in rows and unique values of col2
in columns, and each cell is the count of rows, from the original dataframe, matching this combination of unique values.
>>> res
col2 abc def ghi
col1
bar 2.0 1.0 0.0
baz 0.0 1.0 1.0
foo 2.0 0.0 0.0
The amount of time spent in each operation is approximately the following:
groupby().size()
-> 5%unstack()
-> 15%fillna(0)
-> 80%
The whole sequence can take about 30 minutes on a real dataset (similar structure as above, just more rows and more unique values).
Is there a better/faster alternative to get from (A) the original dataframe to (B) the end-result table? The most costly operation is by far the final fillna(0)
so I'd be interested in an alternative for this bit in particular, but an entirely different approach would be great as well.
Note: converting the strings to integer in the original df
speeds up the groupby().size()
operation by about 5x, however it doesn't really affect the following operations
CodePudding user response:
Take advantage of filling the NAs in the same step as you unstack
by setting a fill_value
:
>>> df.groupby(['col1', 'col2']).size().unstack(fill_value=0)
CodePudding user response:
timeit
on Google Colab:
%timeit df.groupby(['col1', 'col2']).size().unstack().fillna(0)
1000 loops, best of 5: 1.54 ms per loop
%timeit df.groupby(['col1', 'col2']).size().unstack(fill_value=0)
1000 loops, best of 5: 1.47 ms per loop
%timeit df.groupby(['col1','col2'])['col2'].count().unstack(fill_value=0)
1000 loops, best of 5: 1.43 ms per loop
%timeit pd.crosstab(index=df.col1, columns=df.col2)
100 loops, best of 5: 8.11 ms per loop
update: I've included the rafaelc answer