Home > Blockchain >  Get Top N items per month in pandas
Get Top N items per month in pandas

Time:11-23

I have the following dataframe (which is a pretty reduced sample from my original one).

       year  month        id     revenue
80496  2020      2          16   536000.00
84222  2020      1        2758   430344.66
84223  2020      1        2758   679945.17
84224  2020      1        2758   543982.82
84225  2020      3        2758   287728.00
84226  2020      3        2758   321874.00
84227  2020      3        2758   408128.00
84230  2020      3        2758   458304.00
84231  2020      3        2758   332528.00
85309  2020      2        1560   277233.84
85312  2020      2        1560   419910.83
85876  2020      3        2890   333281.00
90690  2020      1        1304   306297.32
90691  2020      1        1304   391399.67
90698  2020      1        1304   314511.00
90699  2020      2        1304   598394.98
90701  2020      2        1304   391659.34
90702  2020      2        1304   420196.17
90703  2020      2        1304   390432.01
90705  2020      2        1304   485868.16
90706  2020      2        1304   340244.66
91066  2020      2        3738   426222.33
92039  2020      2        1273   269514.00
92040  2020      2        1273   399330.00
92043  2020      3        1273   845502.01
92044  2020      3        1273   285688.00
92045  2020      2        1304   471677.67
92053  2020      2        1304   306994.32
92055  2020      2        1304   514140.01
92058  2020      2        1304   670594.83
92059  2020      2        1304   313311.17
92060  2020      2        1304   264825.32
92061  2020      2        1304   417365.50
92063  2020      3        2758   460628.00
92064  2020      3        2758   270060.00
92074  2020      2        4354   626157.00
92075  2020      2        4354  1225539.99
92078  2020      3        2911   381026.67
92079  2020      3        2911   325471.67
92080  2020      3        2911   383008.33
92081  2020      3        2911   267538.66
92082  2020      3        2911   383789.17
92083  2020      3        2911   352452.50
92084  2020      3        2911   279040.84
92085  2020      3        2911   367950.82
92087  2020      2         130   284714.26
92088  2020      2         130   600318.97
92089  2020      2         130   270437.93
92091  2020      2         130   272350.83
92092  2020      2         130   346533.36
92093  2020      2         130   294939.32
92100  2020      2         134   303719.16
92101  2020      3         134   367001.67
92112  2020      3        1561   276828.00
92113  2020      3        1561   279312.01
92114  2020      3        1625   294794.50
92119  2020      3        1625   592332.50
92126  2020      2        2890   620486.67
92128  2020      3        2890   680190.00
92130  2020      3        2890   418707.33
92131  2020      3        2890   328754.99
92132  2020      3        2890   339958.00
92137  2020      3        2890   554962.00
92138  2020      3        2890   365953.34
92139  2020      3        2890   486639.16
92140  2020      3        2890  1610025.83
92141  2020      3        2890   589236.49

I'm trying to get the top 2 ids for each year AND month. So, for example, the idea was to obtain the below df. My main problem here, is to get the Top n along with the dates, because the nlargest method applies to a pd.Series or a dataframe (but in this case, you cannot pass a list for multiple conditions).

      year  month    id     revenue
0     2020      3  2890  6626149.47
1     2020      3  2758  4428253.33
691   2020      2  1304  6997646.32
692   2020      2   130  2229204.96
1785  2020      1  2758  1749250.49
1786  2020      1  1304  1581520.32

So far, my code is:

df.groupby(by=['year', 'month', 'id'])[['revenue']].agg({'revenue': 'sum'}).sort_values(by=['year', 'month', 'revenue'], ascending=False).reset_index()

CodePudding user response:

One option is to sort the columns before grouping, and use the nth function:

(df.sort_values(['year', 'month','revenue', 'id'], 
                ascending=[True, True, False, False])
   .groupby(['year', 'month'], 
            as_index = False, 
            sort = False)
   .nth([0,1])
)

       year  month    id     revenue
84223  2020      1  2758   679945.17
84224  2020      1  2758   543982.82
92075  2020      2  4354  1225539.99
92058  2020      2  1304   670594.83
92140  2020      3  2890  1610025.83
92043  2020      3  1273   845502.01

CodePudding user response:

You don't need apply, sort the dataframe, groupby then use head:

df.sort_values(['year', 'month', 'revenue'], ascending=[True, True, False])\
  .groupby(['year', 'month']).head(2`)

Output:

       year  month    id     revenue
84223  2020      1  2758   679945.17
84224  2020      1  2758   543982.82
92075  2020      2  4354  1225539.99
92058  2020      2  1304   670594.83
92140  2020      3  2890  1610025.83
92043  2020      3  1273   845502.01

CodePudding user response:

You can use GroupBy.apply to call a function for each group (each group being a separate dataframe).

top_two_per_year_month = df.groupby(['year', 'month'], as_index=False).apply(lambda x: x.sort_values('revenue', ascending=False)[0:2]).droplevel(0)

Output:

>>> top_two_per_year_month
       year  month    id     revenue
84223  2020      1  2758   679945.17
84224  2020      1  2758   543982.82
92075  2020      2  4354  1225539.99
92058  2020      2  1304   670594.83
92140  2020      3  2890  1610025.83
92043  2020      3  1273   845502.01

Or, using nlargest, as @LarryTheLlama suggests:

top_two_per_year_month = df.groupby(['year', 'month']).apply(lambda x: x['revenue'].nlargest(2))

Output:

>>> top_two_per_year_month
year  month       
2020  1      84223     679945.17
             84224     543982.82
      2      92075    1225539.99
             92058     670594.83
      3      92140    1610025.83
             92043     845502.01
Name: revenue, dtype: float64

>>> top_two_per_year_month.reset_index().rename({'level_2': 'id'}, axis=1)
   year  month     id     revenue
0  2020      1  84223   679945.17
1  2020      1  84224   543982.82
2  2020      2  92075  1225539.99
3  2020      2  92058   670594.83
4  2020      3  92140  1610025.83
5  2020      3  92043   845502.01
  • Related