Get the max, min value and the respective owner of that max, min price

I have this df:

df = pd.DataFrame({'code_1':[123,456,789,456,123],
                   'code_2':[333,666,999,666,333],
                   'seller':['Andres','Mike','Paul','Andy','Mike'],
                   'price':[12.2,23.51,12.34,10.2,13.0]})

and I want to get this as output:

dff = pd.DataFrame({'code_1':[123,456,789],
                    'code_2':[333,666,999],
                   'seller_maxprice':['Mike','Mike','Paul'],
                   'max_price':[13,23.51,12.34],
                    'avg_price':[12.6,16.85,12.34],
                   'seller_mainprice':['Andres','Andy','Paul'],
                   'min_price':[12.2,10.20,12.34]})

So, from DF I want to get the maxprice, minprice, avgprice, sellermaxprice and sellerminprice of cod_1 and/or cod_2

And there´s one more issue: I dont have both unique indentifier: cod_1 and cod_2 all the time. Sometimes I Have cod_1 and cod_2 is null and vice-versa

I tried:

result = df.groupby(['code_1']).agg({'price': ['mean', 'min', 'max']})

But this approach has two issues:

1: will fail when code_1 is null

2: Does not shoe the seller max and seller min_price

an anyone help me with this?

CodePudding user response：

You can do that with groupby and apply

dff = df.groupby(['code_1','code_2']).apply(lambda x : pd.Series({'seller_maxprice':x.loc[x['price'].idxmax(),'seller'],
                                                            'max_price':x['price'].max(),
                                                            'seller_minprice': x.loc[x['price'].idxmin(), 'seller'],
                                                            'min_price': x['price'].min(),
                                                            'aver_price':x['price'].mean()
                                                            })).reset_index()
Out[22]: 
   code_1  code_2 seller_maxprice  ...  seller_minprice min_price  aver_price
0     123     333            Mike  ...           Andres     12.20      12.600
1     456     666            Mike  ...             Andy     10.20      16.855
2     789     999            Paul  ...             Paul     12.34      12.340
[3 rows x 7 columns]

CodePudding user response：

Your approach is the right one, we can improve it in a number of small ways:

If you set seller as the index, then your seller min/maxprice become idxmin and idxmax − it’s OK for the index to be non-unique
If you expect code_1 to be null, you can use fillna to add some placeholder values
If you use ['price'] after .groupby you’ll get a SeriesGroupBy instead of a DataFrameGroupBy which means you don’t have to specify which columns to aggregate in .agg()
You can define the column names as keys in .agg()

>>> df.set_index('seller').fillna({'code_1': 'missing'})\
...   .groupby(['code_1', 'code_2'])['price']\
...   .agg(maxprice='max', minprice='min', aver_price='mean',
...        seller_minprice='idxmin', seller_maxprice='idxmax',
...       ).reset_index()
...
   code_1  code_2  maxprice  minprice  aver_price seller_minprice seller_maxprice
0     123     333     13.00     12.20      12.600          Andres            Mike
1     456     666     23.51     10.20      16.855            Andy            Mike
2     789     999     12.34     12.34      12.340            Paul            Paul