Home > Enterprise >  Select rows with minimum mean column
Select rows with minimum mean column

Time:08-21

I have a DataFrame with a 3-level MultiIndex, for example:

df = pd.DataFrame({
    'col0': [0,8,3,1,2,2,0,0],
    'col1': range(8),
}, index=pd.MultiIndex.from_product([[0,1]] * 3, names=['idx0', 'idx1', 'idx2']))
>>> df
                col0  col1
idx0 idx1 idx2            
0    0    0        0     0
          1        8     1
     1    0        3     2
          1        1     3
1    0    0        2     4
          1        2     5
     1    0        0     6
          1        0     7

For each idx0, I want to find the idx1 that has the lowest mean of col0. This gives me idx0, idx1 pairs. Then I'd like to select all the rows matching those pairs.

In the example above, the pairs are [(0, 1), (1, 1)] (with means 2 and 0, respectively) and the desired result is:

                col0  col1
idx0 idx1 idx2            
0    1    0        3     2
          1        1     3
1    1    0        0     6
          1        0     7

What I have tried

Step 1: Group by idx0, idx1 and calculate the mean of col0:

mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
>>> mean_col0
idx0  idx1
0     0       4.0
      1       2.0
1     0       2.0
      1       0.0

Step 2: Select the indexmin (idx1) by group of idx0:

level_idxs = mean_col0.groupby('idx0').idxmin()
>>> level_idxs
idx0
0    (0, 1)
1    (1, 1)

Step 3: Use that to filter the original dataframe.

That's the main problem. When I simply try df.loc[ix], I get a ValueError due to shape mismatch. I would need the third index value or a wildcard.

I think I have a solution. Putting it all together with the steps above:

mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean() 
level_idxs = mean_col0.groupby(["idx0"]).idxmin()
result = df[df.index.droplevel(2).isin(level_idxs)]

But it seems quite complicated. Is there a better way?

CodePudding user response:

You can use .apply().

For each group of idx0: query only those idx1-s which have the smallest mean in col0:

df.groupby('idx0').apply(lambda g:
        g.query(f"idx1 == {g.groupby('idx1')['col0'].mean().idxmin()}")
    ).droplevel(0)

The same can be written in this (hopefully more readable) way:

def f(df):
    chosen_idx1 = df.groupby('idx1')['col0'].mean().idxmin()
    return df.query('idx1 == @chosen_idx1')

df.groupby('idx0').apply(f).droplevel(0)
  • Related