reduce Pandas DataFrame by selecting specific rows (max/min) groupby-CodePudding

I have a long pandas DataFrame and like to select a single row of a subset if a criterion applies (min of 'value' in my case).

I have a dataframe that starts like this:

    time name_1 name_2  idx     value
0      0      A      B    0  0.927323
1      0      A      B    1  0.417376 
2      0      A      B    2  0.167633 
3      0      A      B    3  0.458307
4      0      A      B    4  0.312337 
5      0      A      B    5  0.876870
6      0      A      B    6  0.096035 
7      0      A      B    7  0.656454
8      0      A      B    8  0.261049 
9      0      A      B    9  0.220294
10     0      A      C    0  0.902397 
11     0      A      C    1  0.887394
12     0      A      C    2  0.593686
13     0      A      C    3  0.394785 
14     0      A      C    4  0.569566
15     0      A      C    5  0.544009 
16     0      A      C    6  0.404803
17     0      A      C    7  0.209683 
18     0      A      C    8  0.309946
19     0      A      C    9  0.049598

I like to select the rows with the minimum of 'value' to a given 'time','name_1' and 'idx'.

This code does what I want:

import pandas as pd
import numpy as np


values = np.array([0.927323  , 0.41737634, 0.16763339, 0.45830677, 0.31233708,
       0.87687015, 0.09603466, 0.65645383, 0.26104928, 0.22029422,
       0.90239674, 0.88739363, 0.59368645, 0.39478497, 0.56956551,
       0.54400922, 0.40480253, 0.20968343, 0.30994597, 0.04959793,
       0.19251744, 0.52135761, 0.25858556, 0.21825577, 0.0371907 ,
       0.09493446, 0.11676115, 0.95710755, 0.20447907, 0.47587798,
       0.51848566, 0.88683689, 0.33567338, 0.55024871, 0.90575771,
       0.80171702, 0.09314208, 0.55236301, 0.84181111, 0.15364926,
       0.98555741, 0.30371372, 0.05154821, 0.83176642, 0.32537832,
       0.75952016, 0.85063717, 0.13447965, 0.2362897 , 0.51945735,
       0.90693226, 0.85405705, 0.43393479, 0.91383604, 0.11018263,
       0.01436286, 0.39829369, 0.66487798, 0.22727205, 0.13352898,
       0.54781443, 0.60894777, 0.35963582, 0.12307987, 0.45876915,
       0.02289212, 0.12621582, 0.42680046, 0.83070886, 0.40761464,
       0.64063501, 0.20836704, 0.17291092, 0.75085509, 0.1570349 ,
       0.03859196, 0.6824537 , 0.84710239, 0.89886199, 0.2094902 ,
       0.58992632, 0.7078019 , 0.16779968, 0.2419259 , 0.73452264,
       0.09091338, 0.10095228, 0.62192591, 0.20698809, 0.29000293,
       0.20460181, 0.01493776, 0.52598607, 0.16651766, 0.89677289,
       0.52880975, 0.67722748, 0.89929363, 0.30735003, 0.40878873,
       0.66854908, 0.4131948 , 0.40704838, 0.59434805, 0.13346655,
       0.47503708, 0.09459362, 0.48804776, 0.90442952, 0.81338104,
       0.17684766, 0.19449489, 0.81657825, 0.76595993, 0.46624606,
       0.27780779, 0.95146104, 0.37054388, 0.69655618, 0.39371977])

df = pd.DataFrame({'time':[j for j in range(2) for i in range(60)],
                   'name_1':[j for j in ['A','B','C']*2 for i in range(20)],
                   'name_2':[j for j in ['B','C','A']*4 for i in range(10)],
                   'idx':[i for j in range(12) for i in range(10)],
                   'value':values})



out_df = pd.DataFrame()

for t in np.unique(df.time):
    a = df[df.time==t]
    
    for n1 in np.unique(df.name_1):
        
        b = a[a.name_1==n1]
        
        for idx in np.unique(df.idx):
            
            
            c = b[b.idx==idx]
            
            # find the minimum index in c of value
            min_idx = np.argmin(c.value)

            out_df=out_df.append(c.iloc[min_idx])

out_df[:10]
        time name_1 name_2  idx     value
    10   0.0      A      C  0.0  0.902397
    1    0.0      A      B  1.0  0.417376
    2    0.0      A      B  2.0  0.167633
    13   0.0      A      C  3.0  0.394785
    4    0.0      A      B  4.0  0.312337
    15   0.0      A      C  5.0  0.544009
    6    0.0      A      B  6.0  0.096035
    17   0.0      A      C  7.0  0.209683
    8    0.0      A      B  8.0  0.261049
    19   0.0      A      C  9.0  0.049598

But this is really slow on the 4Million rows - of cause. How to speed this up?

I tried groupby, but unfortunately this behaves not as expected:

If I take this DataFrame c:

print(c)
    time name_1 name_2  idx     value
0      0      A      B    0  0.927323
10     0      A      C    0  0.902397

groupby should select the second row since value is the minimum here. However groupby behaves different:

c.groupby(by=['time','name_1','idx']).apply(np.min)
                 time name_1 name_2  idx     value
time name_1 idx                                   
0    A      0       0      A      B    0  0.902397

The minimum value is correct, but name_2 should be C not B.

Any suggestions?

CodePudding user response：

you could try to use idxmin() and use the following line of code:

out_df = df.loc[df.loc[:,['time','name_1','idx','value']].groupby(by=['time','name_1','idx']).idxmin()['value'], :]

CodePudding user response：

Use:

df.loc[df.groupby(['time','name_1','idx'])['value']).idxmin()]