I have a DataFrame with a 3-level MultiIndex, for example:
df = pd.DataFrame({
'col0': [0,8,3,1,2,2,0,0],
'col1': range(8),
}, index=pd.MultiIndex.from_product([[0,1]] * 3, names=['idx0', 'idx1', 'idx2']))
>>> df
col0 col1
idx0 idx1 idx2
0 0 0 0 0
1 8 1
1 0 3 2
1 1 3
1 0 0 2 4
1 2 5
1 0 0 6
1 0 7
For each idx0
, I want to find the idx1
that has the lowest mean of col0
. This gives me idx0, idx1
pairs. Then I'd like to select all the rows matching those pairs.
In the example above, the pairs are [(0, 1), (1, 1)]
(with means 2
and 0
, respectively) and the desired result is:
col0 col1
idx0 idx1 idx2
0 1 0 3 2
1 1 3
1 1 0 0 6
1 0 7
What I have tried
Step 1: Group by idx0, idx1
and calculate the mean of col0
:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
>>> mean_col0
idx0 idx1
0 0 4.0
1 2.0
1 0 2.0
1 0.0
Step 2: Select the indexmin (idx1
) by group of idx0
:
level_idxs = mean_col0.groupby('idx0').idxmin()
>>> level_idxs
idx0
0 (0, 1)
1 (1, 1)
Step 3: Use that to filter the original dataframe.
That's the main problem. When I simply try df.loc[ix]
, I get a ValueError
due to shape mismatch. I would need the third index value or a wildcard.
I think I have a solution. Putting it all together with the steps above:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
level_idxs = mean_col0.groupby(["idx0"]).idxmin()
result = df[df.index.droplevel(2).isin(level_idxs)]
But it seems quite complicated. Is there a better way?
CodePudding user response:
You can use .apply()
.
For each group of idx0
: query only those idx1
-s which have the smallest mean in col0
:
df.groupby('idx0').apply(lambda g:
g.query(f"idx1 == {g.groupby('idx1')['col0'].mean().idxmin()}")
).droplevel(0)
The same can be written in this (hopefully more readable) way:
def f(df):
chosen_idx1 = df.groupby('idx1')['col0'].mean().idxmin()
return df.query('idx1 == @chosen_idx1')
df.groupby('idx0').apply(f).droplevel(0)