Home > OS >  Pandas: How to find minimal value in a column based on pair of values from 2 other columns, keeping
Pandas: How to find minimal value in a column based on pair of values from 2 other columns, keeping

Time:10-31

I have the following dataset:

df
            NUTS_ID_from     ID_inter_start     ID_inter_end    NUTS_ID_to   Total_cost
0           DE111            DEA12              DE403           DE111        1886.332
1           DE111            DEA12              DE403           DE112        1918.088
2           DE111            DEA12              DE403           DE113        1887.726
3           DE111            DEA12              DE403           DE114        1858.129
4           DE111            DEA12              DE403           DE115        1868.074
...         ...              ...                ...             ...          ...
4663224     DEG0P            DE212              DEA23           DEG0K        1434.235
4663225     DEG0P            DE212              DEA23           DEG0L        1474.661
4663226     DEG0P            DE212              DEA23           DEG0M        1507.811
4663227     DEG0P            DE212              DEA23           DEG0N        1259.543
4663228     DEG0P            DE212              DEA23           DEG0P        1247.711

There are 401 unique values in each NUTS_ID_from and NUTS_ID_to, therefore 401*401 = 160 801 unique pairs. I need to find minimal total cost for each of these unique pairs.

I tried to accomplish that by using the formula below:

min_cost = df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False).agg(min) 

It identifies the minimal cost, however the sequence of rows in other two columns (ID_inter_start and ID_inter_end) is disrupted. But for me it's critical to preserve values of rows as in the initial dataframe. I would really appreciate some help with that.

Thank you.

Update

Small chunk of the initial dataset:

df
            NUTS_ID_from     ID_inter_start     ID_inter_end    NUTS_ID_to   Total_cost
0           DE111            DEA12              DE403           DE120        1886.332
1           DE111            DEA13              DE405           DE120        1918.088
2           DE113            DEA18              DE407           DE131        1887.726
3           DE113            DEA19              DE408           DE131        1858.129
4           DE115            DEA22              DE409           DE166        1868.074
5           DE115            DEA23              DE511           DE166        1434.235

Desired outcome:

df
            NUTS_ID_from     ID_inter_start     ID_inter_end    NUTS_ID_to   Total_cost
0           DE111            DEA12              DE403           DE120        1886.332
3           DE113            DEA19              DE408           DE131        1858.129
5           DE115            DEA23              DE511           DE166        1434.235

For each pair of 'NUTS_ID_from' and 'NUTS_ID_to' only the row mith minimal cost must remain in the dataset. Values in other columns of the selected rows must remain the same as before groupby.

Update

Running the code below I still get the same issue that other columns are shifted:

# create a sequence id after dataframe is sorted in the desired sequence
df['seq']=df.assign(seq=1)['seq'].cumsum()

# find the min for each group and order result by seq id
(df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False)
 .agg(min)
 .sort_values('seq'))

Outcome:

    NUTS_ID_from    ID_inter_start  ID_inter_end    NUTS_ID_to  Total_cost
0   DE111           DEA12           DE403           DE120       1886.332
1   DE113           DEA18           DE407           DE131       1858.129
2   DE115           DEA22           DE409           DE166       1434.235

CodePudding user response:

IIUC,

it helped when you posted teh sample data and expected out. You need to specify the column that you need to use for the finding the min value

# with groupby choose the idx where there is a min price for the group
# using loc, list that row

df.loc[df.groupby(['NUTS_ID_from', 'NUTS_ID_to'], as_index=False)['Total_cost'].agg({'idx':'idxmin'})['idx']]
    NUTS_ID_from    ID_inter_start  ID_inter_end    NUTS_ID_to  Total_cost
0          DE111             DEA12         DE403         DE120  1886.332
3          DE113             DEA19         DE408         DE131  1858.129
5          DE115             DEA23         DE511         DE166  1434.235
  • Related