Let's say I have a frame with a dozen columns, col1
through col12
. This frame is large, so I want to minimize time spent in grouping.
So, I'd like to first group by 3 columns then by a 4th column. I'm using the aggregated data set and using the group as an entire object. So,
dfNew1 = df.groupby(["col1", "col2", "col3"], as_index= False)
dfNew2 = df.groupby(["col1", "col2", "col3", "col4"], as_index= False)
This is expensive and IRL takes almost 10 minutes. What I'd like to do is retain some of my "work" for the second groupby. In an ideal world, something like
baseGrouper = df.groupby(["col1", "col2", "col3"])
dfNew1 = baseGrouper.reset_index()
dfNew2 = baseGrouper.groupby(["col4"]).reset_index() # DataFrameGroupBy does not actually have a "groupby" method
Surely there's some way to keep my grouping work and not doing it twice?
CodePudding user response:
If you only care about having the groups, you can use ngroup
of the first group:
%%timeit
g1 = df.groupby(["col1", "col2", "col3"], as_index= False)
g2 = df.groupby(["col1", "col2", "col3", "col4"], as_index= False)
g1['col12'].sum(), g2['col12'].sum() # needed for faithful timing comparison
# 378 ms ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
g1 = df.groupby(["col1", "col2", "col3"], as_index= False)
g2 = df.groupby([g1.ngroup(), "col4"], as_index= False)
g1['col12'].sum(), g2['col12'].sum()
# 264 ms ± 58.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The effect is greater with more columns:
%%timeit
g1 = df.groupby(["col1", "col2", "col3", "col4", "col5", "col6", "col7"], as_index= False)
g2 = df.groupby(["col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8"], as_index= False)
g1['col12'].sum(), g2['col12'].sum()
# 3.93 s ± 843 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
g1 = df.groupby(["col1", "col2", "col3", "col4", "col5", "col6", "col7"], as_index= False)
g2 = df.groupby([g1.ngroup(), "col8"], as_index= False)
g1['col12'].sum(), g2['col12'].sum()
# 1.68 s ± 544 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NB. depending on your real aggregation, there might be some filtering to perform.
Used input (1M rows):
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 20, size=(1_000_000, 12)),
columns=[f'col{x 1}' for x in range(12)])