I have a bill details data set and I want to do a groupby of the products based on the sum of their Total value, additionally i want a column which indicates the month which has produced the most revenue for the corresponding product
Data set:
Bill_Id Month Product_Id Net_Value
1 1 20 100
2 1 20 100
3 2 20 100
4 1 30 200
5 2 30 200
6 2 30 200
Desired_Result
Product_Id Total_revenue Top_Month
20 300 1
30 600 2
This just a sample dataset I have the transaction data of the entire year
CodePudding user response:
Pivot the dataframe with aggfunc=sum
, then use sum
and idxmax
along columns axis to find total revenue and month which has the highest contribution to total revenue, finally concat
the individual components along column axis to get the result
s = df.pivot_table('Net_Value', 'Product_Id', 'Month', aggfunc='sum')
pd.concat([s.sum(1), s.idxmax(1)], axis=1, keys=['Total_revenue', 'Top_Month'])
Total_revenue Top_Month
Product_Id
20 300 1
30 600 2
CodePudding user response:
Assuming that your only 1 Top month value is needed based on the maximum sum of Net_Revenue, Below is the code that might work for you.
We can achieve this in 3 stages as mentioned below:
1. Extracting the sum of net revenue based on product id
df_1 = df.groupby(['Product_Id']).agg({'Net_Value' : sum}).reset_index()
df_1 = df_1.rename(columns={'Net_Value' : 'Total_revenue'})
print(df_1)
Product_Id Total_revenue
0 20 300
1 30 600
2. Extracting the best contibuting month based on max sum net revenue for each product id
df_2 = df.groupby(['Product_Id', 'Month']).agg({'Net_Value' : sum}).sort_values('Net_Value', ascending=False).reset_index()
df_2 = df_2.drop_duplicates(subset=['Product_Id'])[['Product_Id', 'Month']]
print(df_2)
Product_Id Month
0 30 2
1 20 1
3. Final step is to merge this both dataframes into single based on product id
final_df = df_1.merge(df_2)
print(final_df)
Product_Id Total_revenue Month
0 20 300 1
1 30 600 2
Please do upvote the solution if it helps :)
CodePudding user response:
As multiple columns are being interacted, I have used the apply
function in addition to groupby
:
Net_value is calucated using basic aggregate function sum
Top_month required interaction between columns so, first get the index of max Net_value using idxmax
then using loc
to find the month
The resultant Pandas Series object has the groupby column (Product_id) as index, so it make it a column I have used reset_index
def f(x):
d = {}
d['Net_Value'] = x['Net_Value'].sum()
d['Top_month'] = df.loc[x['Net_Value'].idxmax(), "Month"]
return pd.Series(d, index=['Net_Value', 'Top_month'])
df.groupby('Product_Id').apply(f).reset_index()
# Output
Product_Id Net_Value Top_month
0 20 300 1
1 30 600 2
Check out this amazing answer which helped me and can help you in the future as well.
CodePudding user response:
Small modification over @Shubham's approach
result = (
df.pivot_table("Net_Value", "Product_Id", "Month", aggfunc="sum")
.agg(["sum", "idxmax"], axis=1)
.set_axis(["Total_revenue", "Top_Month"], axis=1)
)