Home > OS >  check to see if first row of a group consist of said groups greatest value in pandas and subset data
check to see if first row of a group consist of said groups greatest value in pandas and subset data

Time:03-08

So let's say I have the following df

pd.DataFrame(
    {
        'store':['a','a','a','a','b','b','b','b','c','c','c','c'],
        'avg_sales':[100,20,30,25,50,60,70,80,90,60,40,50],
        'month':[0,1,2,3,0,1,2,3,0,1,2,3]
     
     })

    store   avg_sales   month
0   a   100 0
1   a   20  1
2   a   30  2
3   a   25  3
4   b   50  0
5   b   60  1
6   b   70  2
7   b   80  3
8   c   90  0
9   c   60  1
10  c   40  2
11  c   50  3

What I am trying to do is find all stores where their first month (month==0) is the greatest avg_sales value. Thus, my desired output is to just keep store a and c since their stores had the greatest avg_sale on month==0:

    store   avg_sales   month
0   a   100 0
1   a   20  1
2   a   30  2
3   a   25  3
8   c   90  0
9   c   60  1
10  c   40  2
11  c   50  3

I actually started out with a datetime for each month and used the following code to get the cumcounts:

df['month'] = df.sort_values(by=['store','date']).groupby('store').cumcount()

Now I just need to figure out the logic that only keeps stores where their month==0 is their greatest avg_sales number. Obviously you can do something simple like .loc[], but it seems like you need to do another subset within the loc to find out if month==0 is actually the greatest sales number for each group? I'm a bit confused. I could just use np.where() but that wouldn't account for the multiple groups out-of-the-box...

I can use the following code to make a column that consists of the max avg_sales number for each store:

df['max_sales'] = df.groupby('store').avg_sales.transform('max')


store   avg_sales   month   max_sales
0   a   100 0   100
1   a   20  1   100
2   a   30  2   100
3   a   25  3   100
4   b   50  0   80
5   b   60  1   80
6   b   70  2   80
7   b   80  3   80
8   c   90  0   90
9   c   60  1   90
10  c   40  2   90
11  c   50  3   90

Now I am stuck on the last step of the logic.

EDIT: just fixing grammar

CodePudding user response:

You can doing filter

df.groupby('store').filter(lambda x : x['avg_sales'][x['month']==0] == x['avg_sales'].max())
Out[623]: 
   store  avg_sales  month
0      a        100      0
1      a         20      1
2      a         30      2
3      a         25      3
8      c         90      0
9      c         60      1
10     c         40      2
11     c         50      3

For your transform

s = df.groupby('store').avg_sales.transform('max')

id = df.loc[(df['avg_sales'] ==s) & (df['month'] ==0),'store'] 

out = df.loc[df['store'].isin(id)]

Or we try sort_values with drop_duplicates

s = df.sort_values(['avg_sales']).drop_duplicates('store',keep='last')
df.loc[df.store.isin(s.loc[s['month']==0,'store'])]
Out[630]: 
   store  avg_sales  month
0      a        100      0
1      a         20      1
2      a         30      2
3      a         25      3
8      c         90      0
9      c         60      1
10     c         40      2
11     c         50      3
  • Related