I have this df:
df = pd.DataFrame({'code_1':[123,456,789,456,123],
'code_2':[333,666,999,666,333],
'seller':['Andres','Mike','Paul','Andy','Mike'],
'price':[12.2,23.51,12.34,10.2,13.0]})
and I want to get this as output:
dff = pd.DataFrame({'code_1':[123,456,789],
'code_2':[333,666,999],
'seller_maxprice':['Mike','Mike','Paul'],
'max_price':[13,23.51,12.34],
'avg_price':[12.6,16.85,12.34],
'seller_mainprice':['Andres','Andy','Paul'],
'min_price':[12.2,10.20,12.34]})
So, from DF I want to get the maxprice, minprice, avgprice, sellermaxprice and sellerminprice
of cod_1 and/or cod_2
And there´s one more issue: I dont have both unique indentifier: cod_1
and cod_2
all the time. Sometimes I Have cod_1 and cod_2 is null
and vice-versa
I tried:
result = df.groupby(['code_1']).agg({'price': ['mean', 'min', 'max']})
But this approach has two issues:
1: will fail when code_1
is null
2: Does not shoe the seller max and seller min_price
an anyone help me with this?
CodePudding user response:
You can do that with groupby
and apply
dff = df.groupby(['code_1','code_2']).apply(lambda x : pd.Series({'seller_maxprice':x.loc[x['price'].idxmax(),'seller'],
'max_price':x['price'].max(),
'seller_minprice': x.loc[x['price'].idxmin(), 'seller'],
'min_price': x['price'].min(),
'aver_price':x['price'].mean()
})).reset_index()
Out[22]:
code_1 code_2 seller_maxprice ... seller_minprice min_price aver_price
0 123 333 Mike ... Andres 12.20 12.600
1 456 666 Mike ... Andy 10.20 16.855
2 789 999 Paul ... Paul 12.34 12.340
[3 rows x 7 columns]
CodePudding user response:
Your approach is the right one, we can improve it in a number of small ways:
- If you set
seller
as the index, then your seller min/maxprice becomeidxmin
andidxmax
− it’s OK for the index to be non-unique - If you expect
code_1
to benull
, you can usefillna
to add some placeholder values - If you use
['price']
after.groupby
you’ll get a SeriesGroupBy instead of a DataFrameGroupBy which means you don’t have to specify which columns to aggregate in.agg()
- You can define the column names as keys in
.agg()
>>> df.set_index('seller').fillna({'code_1': 'missing'})\
... .groupby(['code_1', 'code_2'])['price']\
... .agg(maxprice='max', minprice='min', aver_price='mean',
... seller_minprice='idxmin', seller_maxprice='idxmax',
... ).reset_index()
...
code_1 code_2 maxprice minprice aver_price seller_minprice seller_maxprice
0 123 333 13.00 12.20 12.600 Andres Mike
1 456 666 23.51 10.20 16.855 Andy Mike
2 789 999 12.34 12.34 12.340 Paul Paul