I have the following dataset:
df
NUTS_ID_from ID_inter_start ID_inter_end NUTS_ID_to Total_cost
0 DE111 DEA12 DE403 DE111 1886.332
1 DE111 DEA12 DE403 DE112 1918.088
2 DE111 DEA12 DE403 DE113 1887.726
3 DE111 DEA12 DE403 DE114 1858.129
4 DE111 DEA12 DE403 DE115 1868.074
... ... ... ... ... ...
4663224 DEG0P DE212 DEA23 DEG0K 1434.235
4663225 DEG0P DE212 DEA23 DEG0L 1474.661
4663226 DEG0P DE212 DEA23 DEG0M 1507.811
4663227 DEG0P DE212 DEA23 DEG0N 1259.543
4663228 DEG0P DE212 DEA23 DEG0P 1247.711
There are 401 unique values in each NUTS_ID_from and NUTS_ID_to, therefore 401*401 = 160 801 unique pairs. I need to find minimal total cost for each of these unique pairs.
I tried to accomplish that by using the formula below:
min_cost = df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False).agg(min)
It identifies the minimal cost, however the sequence of rows in other two columns (ID_inter_start and ID_inter_end) is disrupted. But for me it's critical to preserve values of rows as in the initial dataframe. I would really appreciate some help with that.
Thank you.
Update
Small chunk of the initial dataset:
df
NUTS_ID_from ID_inter_start ID_inter_end NUTS_ID_to Total_cost
0 DE111 DEA12 DE403 DE120 1886.332
1 DE111 DEA13 DE405 DE120 1918.088
2 DE113 DEA18 DE407 DE131 1887.726
3 DE113 DEA19 DE408 DE131 1858.129
4 DE115 DEA22 DE409 DE166 1868.074
5 DE115 DEA23 DE511 DE166 1434.235
Desired outcome:
df
NUTS_ID_from ID_inter_start ID_inter_end NUTS_ID_to Total_cost
0 DE111 DEA12 DE403 DE120 1886.332
3 DE113 DEA19 DE408 DE131 1858.129
5 DE115 DEA23 DE511 DE166 1434.235
For each pair of 'NUTS_ID_from' and 'NUTS_ID_to' only the row mith minimal cost must remain in the dataset. Values in other columns of the selected rows must remain the same as before groupby.
Update
Running the code below I still get the same issue that other columns are shifted:
# create a sequence id after dataframe is sorted in the desired sequence
df['seq']=df.assign(seq=1)['seq'].cumsum()
# find the min for each group and order result by seq id
(df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False)
.agg(min)
.sort_values('seq'))
Outcome:
NUTS_ID_from ID_inter_start ID_inter_end NUTS_ID_to Total_cost
0 DE111 DEA12 DE403 DE120 1886.332
1 DE113 DEA18 DE407 DE131 1858.129
2 DE115 DEA22 DE409 DE166 1434.235
CodePudding user response:
IIUC,
it helped when you posted teh sample data and expected out. You need to specify the column that you need to use for the finding the min value
# with groupby choose the idx where there is a min price for the group
# using loc, list that row
df.loc[df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False)['Total_cost'].agg({'idx':'idxmin'})['idx']]
NUTS_ID_from ID_inter_start ID_inter_end NUTS_ID_to Total_cost
0 DE111 DEA12 DE403 DE120 1886.332
3 DE113 DEA19 DE408 DE131 1858.129
5 DE115 DEA23 DE511 DE166 1434.235