Home > database >  Find how often products are sold together in Python DataFrame
Find how often products are sold together in Python DataFrame

Time:03-01

I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.

Order Avocado Mango Chili
1546 500 20 0
861153 200 500 5
1657446 500 20 0
79854 200 500 1
4654 500 20 0
74654 0 500 800

I found out what combinations are often together with this code (abbrivated here to 3 products).

size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)

Now I want to know per product how often it is bought solo and how often with other products.

Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:

Product Avocado Mango Chili
AVOCADO 100% 20 % 1%
MANGO 20 % 100% 3%
CHILI 20% 30% 100%

CodePudding user response:

First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:

df2 = 1*(df.set_index('Order') > 0)

Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.

For actual numbers of orders that contain (product1,product2), we can do

df3 = pd.DataFrame(data = [email protected], columns = df2.columns, index = df2.columns)

df3 looks like this:

           Avocado    Mango    Chili
-------  ---------  -------  -------
Avocado          5        5        2
Mango            5        6        3
Chili            2        3        3

eg there are 2 orders that contain Avocado and Chili

If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:

df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0))[email protected]), columns = df2.columns, index = df2.columns)

df4 is:

           Avocado    Mango    Chili
-------  ---------  -------  -------
Avocado   1               1      0.4
Mango     0.833333        1      0.5
Chili     0.666667        1      1

the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili

  • Related