I have a long pandas DataFrame and like to select a single row of a subset if a criterion applies (min of 'value' in my case).
I have a dataframe that starts like this:
time name_1 name_2 idx value
0 0 A B 0 0.927323
1 0 A B 1 0.417376
2 0 A B 2 0.167633
3 0 A B 3 0.458307
4 0 A B 4 0.312337
5 0 A B 5 0.876870
6 0 A B 6 0.096035
7 0 A B 7 0.656454
8 0 A B 8 0.261049
9 0 A B 9 0.220294
10 0 A C 0 0.902397
11 0 A C 1 0.887394
12 0 A C 2 0.593686
13 0 A C 3 0.394785
14 0 A C 4 0.569566
15 0 A C 5 0.544009
16 0 A C 6 0.404803
17 0 A C 7 0.209683
18 0 A C 8 0.309946
19 0 A C 9 0.049598
I like to select the rows with the minimum of 'value' to a given 'time','name_1' and 'idx'.
This code does what I want:
import pandas as pd
import numpy as np
values = np.array([0.927323 , 0.41737634, 0.16763339, 0.45830677, 0.31233708,
0.87687015, 0.09603466, 0.65645383, 0.26104928, 0.22029422,
0.90239674, 0.88739363, 0.59368645, 0.39478497, 0.56956551,
0.54400922, 0.40480253, 0.20968343, 0.30994597, 0.04959793,
0.19251744, 0.52135761, 0.25858556, 0.21825577, 0.0371907 ,
0.09493446, 0.11676115, 0.95710755, 0.20447907, 0.47587798,
0.51848566, 0.88683689, 0.33567338, 0.55024871, 0.90575771,
0.80171702, 0.09314208, 0.55236301, 0.84181111, 0.15364926,
0.98555741, 0.30371372, 0.05154821, 0.83176642, 0.32537832,
0.75952016, 0.85063717, 0.13447965, 0.2362897 , 0.51945735,
0.90693226, 0.85405705, 0.43393479, 0.91383604, 0.11018263,
0.01436286, 0.39829369, 0.66487798, 0.22727205, 0.13352898,
0.54781443, 0.60894777, 0.35963582, 0.12307987, 0.45876915,
0.02289212, 0.12621582, 0.42680046, 0.83070886, 0.40761464,
0.64063501, 0.20836704, 0.17291092, 0.75085509, 0.1570349 ,
0.03859196, 0.6824537 , 0.84710239, 0.89886199, 0.2094902 ,
0.58992632, 0.7078019 , 0.16779968, 0.2419259 , 0.73452264,
0.09091338, 0.10095228, 0.62192591, 0.20698809, 0.29000293,
0.20460181, 0.01493776, 0.52598607, 0.16651766, 0.89677289,
0.52880975, 0.67722748, 0.89929363, 0.30735003, 0.40878873,
0.66854908, 0.4131948 , 0.40704838, 0.59434805, 0.13346655,
0.47503708, 0.09459362, 0.48804776, 0.90442952, 0.81338104,
0.17684766, 0.19449489, 0.81657825, 0.76595993, 0.46624606,
0.27780779, 0.95146104, 0.37054388, 0.69655618, 0.39371977])
df = pd.DataFrame({'time':[j for j in range(2) for i in range(60)],
'name_1':[j for j in ['A','B','C']*2 for i in range(20)],
'name_2':[j for j in ['B','C','A']*4 for i in range(10)],
'idx':[i for j in range(12) for i in range(10)],
'value':values})
out_df = pd.DataFrame()
for t in np.unique(df.time):
a = df[df.time==t]
for n1 in np.unique(df.name_1):
b = a[a.name_1==n1]
for idx in np.unique(df.idx):
c = b[b.idx==idx]
# find the minimum index in c of value
min_idx = np.argmin(c.value)
out_df=out_df.append(c.iloc[min_idx])
out_df[:10]
time name_1 name_2 idx value
10 0.0 A C 0.0 0.902397
1 0.0 A B 1.0 0.417376
2 0.0 A B 2.0 0.167633
13 0.0 A C 3.0 0.394785
4 0.0 A B 4.0 0.312337
15 0.0 A C 5.0 0.544009
6 0.0 A B 6.0 0.096035
17 0.0 A C 7.0 0.209683
8 0.0 A B 8.0 0.261049
19 0.0 A C 9.0 0.049598
But this is really slow on the 4Million rows - of cause. How to speed this up?
I tried groupby, but unfortunately this behaves not as expected:
If I take this DataFrame c:
print(c)
time name_1 name_2 idx value
0 0 A B 0 0.927323
10 0 A C 0 0.902397
groupby should select the second row since value is the minimum here. However groupby behaves different:
c.groupby(by=['time','name_1','idx']).apply(np.min)
time name_1 name_2 idx value
time name_1 idx
0 A 0 0 A B 0 0.902397
The minimum value is correct, but name_2 should be C not B.
Any suggestions?
CodePudding user response:
you could try to use idxmin() and use the following line of code:
out_df = df.loc[df.loc[:,['time','name_1','idx','value']].groupby(by=['time','name_1','idx']).idxmin()['value'], :]
CodePudding user response:
Use:
df.loc[df.groupby(['time','name_1','idx'])['value']).idxmin()]