I have the following dataframe (which is a pretty reduced sample from my original one).
year month id revenue
80496 2020 2 16 536000.00
84222 2020 1 2758 430344.66
84223 2020 1 2758 679945.17
84224 2020 1 2758 543982.82
84225 2020 3 2758 287728.00
84226 2020 3 2758 321874.00
84227 2020 3 2758 408128.00
84230 2020 3 2758 458304.00
84231 2020 3 2758 332528.00
85309 2020 2 1560 277233.84
85312 2020 2 1560 419910.83
85876 2020 3 2890 333281.00
90690 2020 1 1304 306297.32
90691 2020 1 1304 391399.67
90698 2020 1 1304 314511.00
90699 2020 2 1304 598394.98
90701 2020 2 1304 391659.34
90702 2020 2 1304 420196.17
90703 2020 2 1304 390432.01
90705 2020 2 1304 485868.16
90706 2020 2 1304 340244.66
91066 2020 2 3738 426222.33
92039 2020 2 1273 269514.00
92040 2020 2 1273 399330.00
92043 2020 3 1273 845502.01
92044 2020 3 1273 285688.00
92045 2020 2 1304 471677.67
92053 2020 2 1304 306994.32
92055 2020 2 1304 514140.01
92058 2020 2 1304 670594.83
92059 2020 2 1304 313311.17
92060 2020 2 1304 264825.32
92061 2020 2 1304 417365.50
92063 2020 3 2758 460628.00
92064 2020 3 2758 270060.00
92074 2020 2 4354 626157.00
92075 2020 2 4354 1225539.99
92078 2020 3 2911 381026.67
92079 2020 3 2911 325471.67
92080 2020 3 2911 383008.33
92081 2020 3 2911 267538.66
92082 2020 3 2911 383789.17
92083 2020 3 2911 352452.50
92084 2020 3 2911 279040.84
92085 2020 3 2911 367950.82
92087 2020 2 130 284714.26
92088 2020 2 130 600318.97
92089 2020 2 130 270437.93
92091 2020 2 130 272350.83
92092 2020 2 130 346533.36
92093 2020 2 130 294939.32
92100 2020 2 134 303719.16
92101 2020 3 134 367001.67
92112 2020 3 1561 276828.00
92113 2020 3 1561 279312.01
92114 2020 3 1625 294794.50
92119 2020 3 1625 592332.50
92126 2020 2 2890 620486.67
92128 2020 3 2890 680190.00
92130 2020 3 2890 418707.33
92131 2020 3 2890 328754.99
92132 2020 3 2890 339958.00
92137 2020 3 2890 554962.00
92138 2020 3 2890 365953.34
92139 2020 3 2890 486639.16
92140 2020 3 2890 1610025.83
92141 2020 3 2890 589236.49
I'm trying to get the top 2 ids for each year AND month. So, for example, the idea was to obtain the below df. My main problem here, is to get the Top n along with the dates, because the nlargest
method applies to a pd.Series or a dataframe (but in this case, you cannot pass a list for multiple conditions).
year month id revenue
0 2020 3 2890 6626149.47
1 2020 3 2758 4428253.33
691 2020 2 1304 6997646.32
692 2020 2 130 2229204.96
1785 2020 1 2758 1749250.49
1786 2020 1 1304 1581520.32
So far, my code is:
df.groupby(by=['year', 'month', 'id'])[['revenue']].agg({'revenue': 'sum'}).sort_values(by=['year', 'month', 'revenue'], ascending=False).reset_index()
CodePudding user response:
One option is to sort the columns before grouping, and use the nth
function:
(df.sort_values(['year', 'month','revenue', 'id'],
ascending=[True, True, False, False])
.groupby(['year', 'month'],
as_index = False,
sort = False)
.nth([0,1])
)
year month id revenue
84223 2020 1 2758 679945.17
84224 2020 1 2758 543982.82
92075 2020 2 4354 1225539.99
92058 2020 2 1304 670594.83
92140 2020 3 2890 1610025.83
92043 2020 3 1273 845502.01
CodePudding user response:
You don't need apply, sort the dataframe, groupby then use head
:
df.sort_values(['year', 'month', 'revenue'], ascending=[True, True, False])\
.groupby(['year', 'month']).head(2`)
Output:
year month id revenue
84223 2020 1 2758 679945.17
84224 2020 1 2758 543982.82
92075 2020 2 4354 1225539.99
92058 2020 2 1304 670594.83
92140 2020 3 2890 1610025.83
92043 2020 3 1273 845502.01
CodePudding user response:
You can use GroupBy.apply
to call a function for each group (each group being a separate dataframe).
top_two_per_year_month = df.groupby(['year', 'month'], as_index=False).apply(lambda x: x.sort_values('revenue', ascending=False)[0:2]).droplevel(0)
Output:
>>> top_two_per_year_month
year month id revenue
84223 2020 1 2758 679945.17
84224 2020 1 2758 543982.82
92075 2020 2 4354 1225539.99
92058 2020 2 1304 670594.83
92140 2020 3 2890 1610025.83
92043 2020 3 1273 845502.01
Or, using nlargest
, as @LarryTheLlama suggests:
top_two_per_year_month = df.groupby(['year', 'month']).apply(lambda x: x['revenue'].nlargest(2))
Output:
>>> top_two_per_year_month
year month
2020 1 84223 679945.17
84224 543982.82
2 92075 1225539.99
92058 670594.83
3 92140 1610025.83
92043 845502.01
Name: revenue, dtype: float64
>>> top_two_per_year_month.reset_index().rename({'level_2': 'id'}, axis=1)
year month id revenue
0 2020 1 84223 679945.17
1 2020 1 84224 543982.82
2 2020 2 92075 1225539.99
3 2020 2 92058 670594.83
4 2020 3 92140 1610025.83
5 2020 3 92043 845502.01