Most popular item combinations bought together in pandas-CodePudding

I have a df:

product_id          order_id            category            qty_sold        sale_value
table_1             1                   Tables              1               999
chair_1             1                   Chairs              4               1000
table_1             2                   Tables              1               999
chair_1             2                   Chairs              2               500
table_2             3                   Tables              1               999
chair_1             3                   Chairs              4               1000
table_2             4                   Tables              1               999
chair_2             4                   Chairs              8               2500
table_1             5                   Tables              1               999
chair_1             5                   Chairs              2               500

It's data of orders of chairs and tables, each row has an order_id and a product_id. I am trying to answer which are the most popular combinations of mostly sold tables and mostly sold chairs. So basically, I would take top 50 product_id of tables to each of table product_id I would assign top n product_ids of chairs that were bought with that table. bought with that meaning that it shares the same order_id.

The desired output I am trying to create would look like this:

df_tables - most popular tables by qty_sold

product_id              qty_sold            chair_product_id            chairs_sold         times_sold         
table_1                 3                   chair_1                     6                   2                        
                                            chair_3                     2                   1
                            



table_2                 2                   chair_1                     4                   1
                                            chair_2                     8                   1

Which simply groups the tables and chairs bought together by matching order_id and counting the occurrences each combination was seen together.

I tried:

# Getting the list of most popular table product_ids

top_tables = df[df.category == 'Tables'] \
                .groupby('product_id').sum().reset_index() \
                .sort_values('qty_sold', ascending = False) \
                .product_id.to_list()       

most_popular_table = df[df.product_id == top_tables[0]].order_id.unique().tolist()

# getting top chairs bought with most popular table
df[(df.order_id.isin(most_popular_table)) & (df.category == 'Chairs')] \
               .groupby('product_id').agg({'qty_ordered' : sum}) \
               .sort_values('qty_ordered', ascending = False).reset_index()

Which retuns ( sample from real data )

product_id      qty_ordered
4384            661
9974            247
8310            213
4032            166
9891            138

But I cannot think of a way that would create the desired output as I have mentioned above for each top 50 tables and then the second output where I check top chairs & tables. I am wondering if there is a neat way to get the structure I want.

CodePudding user response：

Create a DataFrame for Tables and Chairs (add column prefixes to differentiate them), then join them with order_id as index. Group the whole by product_id and aggregate to get the number of item sold:

df_chairs = df[df['category']=='Chairs'].set_index('order_id').drop(['category', 'sale_value'], axis=1).add_prefix('chair_')
df_tables = df[df['category']=='Tables'].set_index('order_id').drop(['category', 'sale_value'], axis=1).add_prefix('table_')

df = df_tables.join(df_chairs)

df_out = df.groupby(['table_product_id', 'chair_product_id']).agg(sum)
df_out['times_sold'] = df.groupby(['table_product_id', 'chair_product_id']).size()
print(df_out)

Output:

                                   table_qty_sold  chair_qty_sold  times_sold
table_product_id chair_product_id                                            
table_1          chair_1                        3               8           3
table_2          chair_1                        1               4           1
                 chair_2                        1               8           1