I'm currently trying to find a better way to chain multiple groupby-agg operations in pandas. The idea is to reuse the multi-index from a previous groupby-agg, except but one column, to groupby and perform a new aggregation.
Here's a toy example of how I would write spontanously the operations
import random
import pandas as pd
data = pd.DataFrame({
"name" : random.choices(["N1", "N2", "N3"], k=100),
"part" : random.choices(["P1", "P2", "P3"], k=100),
"sub" : random.choices(["S1", "S2", "S3"], k=100),
"val" : range(100),
})
(
data
.groupby(["name", "part", "sub"])
["val"]
.mean()
.groupby(["name", "part", ])
.agg(["mean", "std"])
.groupby(["name",])
.mean()
)
Rather than explicitely referencing the grouping list at each step, I would like to be able to use the current multi-index but one column.
I'm purposely trying to avoid the syntax using groupby(level=[0, 1])
because I dont think it is much different of writing the actual list.
I'm hoping for a syntax that would allow me to do something like this in pseudo-code :
(
data
.groupby(["name", "part", "sub"])
["val"]
.mean()
.groupby(current_index_but('sub'))
.agg(["mean", "std"])
.groupby(current_index_but('part'))
.mean()
)
OR
(
data
.groupby(["name", "part", "sub"])
["val"]
.mean()
.reset_index("sub")
.groupby(current_index)
.agg(["mean", "std"])
.reset_index("part")
.groupby(current_index)
.mean()
)
One of my main issue so far is that using a callable inside pd.groupby seems to actually apply the function to each elements of the index. This leaves me unable to do something like this
groupby(lambda df: df.index.drop("sub"))
Does anyone has a nice way to work around this issue ? Thanks
CodePudding user response:
What about using pipe
?
(
data
.groupby(["name", "part", "sub"])
["val"]
.mean()
.pipe(lambda g: g
.groupby(g.index.names[:-1])
.agg(["mean", "std"])
.pipe(lambda g: g
.groupby(g.index.names[:-1])
.mean()
)
)
)