So let's say I have the following df
pd.DataFrame(
{
'store':['a','a','a','a','b','b','b','b','c','c','c','c'],
'avg_sales':[100,20,30,25,50,60,70,80,90,60,40,50],
'month':[0,1,2,3,0,1,2,3,0,1,2,3]
})
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
4 b 50 0
5 b 60 1
6 b 70 2
7 b 80 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3
What I am trying to do is find all stores where their first month (month==0
) is the greatest avg_sales
value. Thus, my desired output is to just keep store a
and c
since their stores had the greatest avg_sale
on month==0
:
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3
I actually started out with a datetime for each month and used the following code to get the cumcounts:
df['month'] = df.sort_values(by=['store','date']).groupby('store').cumcount()
Now I just need to figure out the logic that only keeps stores where their month==0
is their greatest avg_sales
number. Obviously you can do something simple like .loc[]
, but it seems like you need to do another subset within the loc to find out if month==0
is actually the greatest sales number for each group? I'm a bit confused. I could just use np.where()
but that wouldn't account for the multiple groups out-of-the-box...
I can use the following code to make a column that consists of the max avg_sales
number for each store:
df['max_sales'] = df.groupby('store').avg_sales.transform('max')
store avg_sales month max_sales
0 a 100 0 100
1 a 20 1 100
2 a 30 2 100
3 a 25 3 100
4 b 50 0 80
5 b 60 1 80
6 b 70 2 80
7 b 80 3 80
8 c 90 0 90
9 c 60 1 90
10 c 40 2 90
11 c 50 3 90
Now I am stuck on the last step of the logic.
EDIT: just fixing grammar
CodePudding user response:
You can doing filter
df.groupby('store').filter(lambda x : x['avg_sales'][x['month']==0] == x['avg_sales'].max())
Out[623]:
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3
For your transform
s = df.groupby('store').avg_sales.transform('max')
id = df.loc[(df['avg_sales'] ==s) & (df['month'] ==0),'store']
out = df.loc[df['store'].isin(id)]
Or we try sort_values
with drop_duplicates
s = df.sort_values(['avg_sales']).drop_duplicates('store',keep='last')
df.loc[df.store.isin(s.loc[s['month']==0,'store'])]
Out[630]:
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3