I have a pandas DataFrame with Data
and Group
s and I want to perform multiple functions using the agg
-method.
from scipy.stats import iqr, norm
import pandas as pd
df = pd.DataFrame({'Data':[1,2,3,5,10,5,3,3,4,1], 'Group':[1,2,2,1,1,1,2,2,2,1]})
df.groupby('Group')['Data'].agg(['median', iqr])
which works fine. But now I want to slice the groups befor I perform the operations.
The problem is, that
df.groupby('Group')['Data'].iloc[2:-2].agg(['median', iqr])
thows an error because slicing is not supported for PandasGroupByObjects. An error comes for
df.groupby('Group')['Data'].nth[2:-2].agg(['median', iqr])
too, because the combinations is not allowed.
This post shows that a lambda function
can be used to slice the values.
I tried to use this, but
df.groupby('Group')['Data'].agg(lambda x: [np.median(x.iloc[2:-2]), iqr(x.iloc[2:-2])])
returns a DataFrames with lists in it, which is not what I want.
How to apply the slicing and get a well formated DataFrame as return value?
Wanted output:
median iqr
Group
1 10.0 0.0
2 3.0 0.0
CodePudding user response:
If you have many groups or many functions to apply, one option is a double groupby
:
(df.groupby('Group', as_index=False, group_keys=False)
.apply(lambda g: g.iloc[2:-2])
.groupby('Group')['Data']
.agg(['median', iqr])
)
Efficiency will depend on the number of functions and groups.
output:
median iqr
Group
1 10.0 0.0
2 3.0 0.0
timings
The double groupby
is less efficient than named aggregation for few functions and few groups, but more efficient for many of either, which is expected as the slicing is perform once for the double groupby
vs as many times as there are groups × aggregations functions (although on a smaller size) in the other case.
Provided example:
# double groupby
2.43 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# lambdas with named aggregation
2.39 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100k rows, 2 groups, 2 aggregations:
# double groupby
12.1 ms ± 913 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# lambdas with named aggregation
8.12 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
100k rows, 2 groups, 100 aggregations:
# double groupby
149 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# lambdas with named aggregation
300 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100k rows, 52 groups, 5 aggregations:
# double groupby
30.1 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# lambdas with named aggregation
52.2 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
100k rows, 52 groups, 100 aggregations:
# double groupby
220 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# lambdas with named aggregation
837 ms ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
Ok, it looks like this works, but this might be ineffectiv.
df.groupby('Group')['Data'].agg(
median=lambda x: np.median(x.iloc[2:-2]),
iqr=lambda x: iqr(x.iloc[2:-2])
)
Which gives:
median iqr
Group
1 10.0 0.0
2 3.0 0.0