How to select the subset of rows where distance is lowest, grouping by date
and p
columns?
df
v p distance date
0 14.6 sst 22454.1 2021-12-30
1 14.9 sst 24454.1 2021-12-30
2 14.8 sst 33687.4 2021-12-30
3 1.67 wvht 23141.8 2021-12-30
4 1.9 wvht 24454.1 2021-12-30
5 1.8 wvht 24454.1 2021-12-30
6 1.7 wvht 23141.4 2021-12-31
7 2.1 wvht 24454.1 2021-12-31
Ideally, the returned dataframe should contain:
df
v p distance date
0 14.6 sst 22454.1 2021-12-30
3 1.67 wvht 23141.8 2021-12-30
6 1.7 wvht 23141.4 2021-12-31
CodePudding user response:
One way is to use groupby
idxmin
to get the index of the smallest distance per group, then use loc
to get the desired output:
out = df.loc[df.groupby(['date', 'p'])['distance'].idxmin()]
Output:
v p distance date
0 14.60 sst 22454.1 2021-12-30
3 1.67 wvht 23141.8 2021-12-30
6 1.70 wvht 23141.4 2021-12-31
CodePudding user response:
sort values by p and distance. Drop any duplicates keeping first occurance in each p and date
df.sort_values(by=['p', 'distance']).drop_duplicates(subset=['p','date'],keep='first')
v p distance date
0 14.60 sst 22454.1 2021-12-30
6 1.70 wvht 23141.4 2021-12-31
3 1.67 wvht 23141.8 2021-12-30
CodePudding user response:
If you don't need original indexes then you can use .first()
or .min('distance')
and later reset_index()
.
df.groupby(['date', 'p']).first().reset_index()
import pandas as pd
text = '''v p distance date
0 14.6 sst 22454.1 2021-12-30
1 14.9 sst 24454.1 2021-12-30
2 14.8 sst 33687.4 2021-12-30
3 1.67 wvht 23141.8 2021-12-30
4 1.9 wvht 24454.1 2021-12-30
5 1.8 wvht 24454.1 2021-12-30
6 1.7 wvht 23141.4 2021-12-31
7 2.1 wvht 24454.1 2021-12-31'''
import io
df = pd.read_csv(io.StringIO(text), sep='\s ')
df.groupby(['date', 'p']).first().reset_index()
Result:
date p v distance
0 2021-12-30 sst 14.60 22454.1
1 2021-12-30 wvht 1.67 23141.8
2 2021-12-31 wvht 1.70 23141.4