I have a df
:
product_id order_id category qty_sold sale_value
table_1 1 Tables 1 999
chair_1 1 Chairs 4 1000
table_1 2 Tables 1 999
chair_1 2 Chairs 2 500
table_2 3 Tables 1 999
chair_1 3 Chairs 4 1000
table_2 4 Tables 1 999
chair_2 4 Chairs 8 2500
table_1 5 Tables 1 999
chair_1 5 Chairs 2 500
It's data of orders of chairs and tables, each row has an order_id
and a product_id
. I am trying to answer which are the most popular combinations of mostly sold tables and mostly sold chairs.
So basically, I would take top 50 product_id
of tables to each of table product_id
I would assign top n product_ids
of chairs that were bought with that table. bought with that meaning that it shares the same order_id
.
The desired output I am trying to create would look like this:
df_tables
- most popular tables by qty_sold
product_id qty_sold chair_product_id chairs_sold times_sold
table_1 3 chair_1 6 2
chair_3 2 1
table_2 2 chair_1 4 1
chair_2 8 1
Which simply groups the tables
and chairs
bought together by matching order_id
and counting the occurrences each combination was seen together.
I tried:
# Getting the list of most popular table product_ids
top_tables = df[df.category == 'Tables'] \
.groupby('product_id').sum().reset_index() \
.sort_values('qty_sold', ascending = False) \
.product_id.to_list()
most_popular_table = df[df.product_id == top_tables[0]].order_id.unique().tolist()
# getting top chairs bought with most popular table
df[(df.order_id.isin(most_popular_table)) & (df.category == 'Chairs')] \
.groupby('product_id').agg({'qty_ordered' : sum}) \
.sort_values('qty_ordered', ascending = False).reset_index()
Which retuns ( sample from real data )
product_id qty_ordered
4384 661
9974 247
8310 213
4032 166
9891 138
But I cannot think of a way that would create the desired output as I have mentioned above for each top 50 tables and then the second output where I check top chairs & tables. I am wondering if there is a neat way to get the structure I want.
CodePudding user response:
Create a DataFrame for Tables and Chairs (add column prefixes to differentiate them), then join them with order_id
as index. Group the whole by product_id
and aggregate to get the number of item sold:
df_chairs = df[df['category']=='Chairs'].set_index('order_id').drop(['category', 'sale_value'], axis=1).add_prefix('chair_')
df_tables = df[df['category']=='Tables'].set_index('order_id').drop(['category', 'sale_value'], axis=1).add_prefix('table_')
df = df_tables.join(df_chairs)
df_out = df.groupby(['table_product_id', 'chair_product_id']).agg(sum)
df_out['times_sold'] = df.groupby(['table_product_id', 'chair_product_id']).size()
print(df_out)
Output:
table_qty_sold chair_qty_sold times_sold
table_product_id chair_product_id
table_1 chair_1 3 8 3
table_2 chair_1 1 4 1
chair_2 1 8 1