I have a pandas dataframe with something like the following:
index | order_id | cost |
---|---|---|
123a | 123 | 5 |
123b | 123 | None |
123c | 123 | 3 |
124a | 124 | None |
124b | 124 | None |
For each unique value of order_id, I'd like to drop any row that isn't the lowest cost. For any order_id that only contains nulls for the cost, any row for an order_id can be retained.
I've been struggling with this for a while now.
ol3 = ol3.loc[ol3.groupby('Order_ID').cost.idxmin()]
This code doesn't play nice with the order_id's that have only nulls. So, I tried to figure out how to drop the null's I don't want with
ol4 = ol3.loc[ol3['cost'].isna()].drop_duplicates(subset=['Order_ID', 'cost'], keep='first')
This gives me the list of null order_id's I want to retain. Not sure where to go from here. I'm pretty sure I'm looking at this the wrong way. Any help would be appreciated!
CodePudding user response:
You can use transform
to get the indexes with min cost
per order_id
. We additionally need isna
check for the special order_ids
that have only NaN
s:
order_mins = df.groupby('order_id').cost.transform('min')
df[(df.cost == order_mins) | (order_mins.isna())]
CodePudding user response:
You can (temporarily) fill the NA/None with np.inf
before getting the idxmin:
ol3.loc[ol3['cost'].fillna(np.inf).groupby(ol3['order_id']).idxmin()]
You will have exactly one row per order_id
output:
index order_id cost
2 123c 123 3.0
3 124a 124 NaN