Groupby by sum of revenue and the corresponding highest contributing month

I have a bill details data set and I want to do a groupby of the products based on the sum of their Total value, additionally i want a column which indicates the month which has produced the most revenue for the corresponding product

Data set:

Bill_Id  Month  Product_Id  Net_Value
 1        1       20          100
 2        1       20          100
 3        2       20          100
 4        1       30          200
 5        2       30          200
 6        2       30          200

Desired_Result

Product_Id  Total_revenue  Top_Month
   20          300            1
   30          600            2

This just a sample dataset I have the transaction data of the entire year

CodePudding user response：

Pivot the dataframe with aggfunc=sum, then use sum and idxmax along columns axis to find total revenue and month which has the highest contribution to total revenue, finally concat the individual components along column axis to get the result

s = df.pivot_table('Net_Value', 'Product_Id', 'Month', aggfunc='sum')
pd.concat([s.sum(1), s.idxmax(1)], axis=1, keys=['Total_revenue', 'Top_Month'])

            Total_revenue  Top_Month
Product_Id                          
20                    300          1
30                    600          2

CodePudding user response：

Assuming that your only 1 Top month value is needed based on the maximum sum of Net_Revenue, Below is the code that might work for you.

We can achieve this in 3 stages as mentioned below:

1. Extracting the sum of net revenue based on product id

df_1 = df.groupby(['Product_Id']).agg({'Net_Value' : sum}).reset_index()
df_1 = df_1.rename(columns={'Net_Value' : 'Total_revenue'})

print(df_1)
Product_Id  Total_revenue
0   20  300
1   30  600

2. Extracting the best contibuting month based on max sum net revenue for each product id

df_2 = df.groupby(['Product_Id', 'Month']).agg({'Net_Value' : sum}).sort_values('Net_Value', ascending=False).reset_index()
df_2 = df_2.drop_duplicates(subset=['Product_Id'])[['Product_Id', 'Month']]

print(df_2)
Product_Id  Month
0   30  2
1   20  1

3. Final step is to merge this both dataframes into single based on product id

final_df = df_1.merge(df_2)

print(final_df)
Product_Id  Total_revenue  Month
0          20            300      1
1          30            600      2

Please do upvote the solution if it helps :)

CodePudding user response：

As multiple columns are being interacted, I have used the apply function in addition to groupby:

Net_value is calucated using basic aggregate function sum

Top_month required interaction between columns so, first get the index of max Net_value using idxmax then using loc to find the month

The resultant Pandas Series object has the groupby column (Product_id) as index, so it make it a column I have used reset_index

def f(x):
    d = {}
    d['Net_Value'] = x['Net_Value'].sum()
    d['Top_month'] = df.loc[x['Net_Value'].idxmax(), "Month"]
    return pd.Series(d, index=['Net_Value', 'Top_month'])

df.groupby('Product_Id').apply(f).reset_index()

# Output

  Product_Id    Net_Value   Top_month
0   20             300         1
1   30             600         2

Check out this amazing answer which helped me and can help you in the future as well.

CodePudding user response：

Small modification over @Shubham's approach

result = (
    df.pivot_table("Net_Value", "Product_Id", "Month", aggfunc="sum")
    .agg(["sum", "idxmax"], axis=1)
    .set_axis(["Total_revenue", "Top_Month"], axis=1)
)