Home > OS >  Pandas - How to drop rows based on a unique column value where another column value is a minimum and
Pandas - How to drop rows based on a unique column value where another column value is a minimum and

Time:09-22

I have a pandas dataframe with something like the following:

index order_id cost
123a 123 5
123b 123 None
123c 123 3
124a 124 None
124b 124 None

For each unique value of order_id, I'd like to drop any row that isn't the lowest cost. For any order_id that only contains nulls for the cost, any row for an order_id can be retained.

I've been struggling with this for a while now.

ol3 = ol3.loc[ol3.groupby('Order_ID').cost.idxmin()]

This code doesn't play nice with the order_id's that have only nulls. So, I tried to figure out how to drop the null's I don't want with

ol4 = ol3.loc[ol3['cost'].isna()].drop_duplicates(subset=['Order_ID', 'cost'], keep='first')

This gives me the list of null order_id's I want to retain. Not sure where to go from here. I'm pretty sure I'm looking at this the wrong way. Any help would be appreciated!

CodePudding user response:

You can use transform to get the indexes with min cost per order_id. We additionally need isna check for the special order_ids that have only NaNs:

order_mins = df.groupby('order_id').cost.transform('min')
df[(df.cost == order_mins) | (order_mins.isna())]

CodePudding user response:

You can (temporarily) fill the NA/None with np.inf before getting the idxmin:

ol3.loc[ol3['cost'].fillna(np.inf).groupby(ol3['order_id']).idxmin()]

You will have exactly one row per order_id

output:

  index  order_id  cost
2  123c       123   3.0
3  124a       124   NaN
  • Related