Find how often products are sold together in Python DataFrame-CodePudding

I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.

Order	Avocado	Mango	Chili
1546	500	20	0
861153	200	500	5
1657446	500	20	0
79854	200	500	1
4654	500	20	0
74654	0	500	800

I found out what combinations are often together with this code (abbrivated here to 3 products).

size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)

Now I want to know per product how often it is bought solo and how often with other products.

Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:

Product	Avocado	Mango	Chili
AVOCADO	100%	20 %	1%
MANGO	20 %	100%	3%
CHILI	20%	30%	100%

CodePudding user response：

First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:

df2 = 1*(df.set_index('Order') > 0)

Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.

For actual numbers of orders that contain (product1,product2), we can do

df3 = pd.DataFrame(data = [email protected], columns = df2.columns, index = df2.columns)

df3 looks like this:

           Avocado    Mango    Chili
-------  ---------  -------  -------
Avocado          5        5        2
Mango            5        6        3
Chili            2        3        3

eg there are 2 orders that contain Avocado and Chili

If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:

df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0))[email protected]), columns = df2.columns, index = df2.columns)

df4 is:

           Avocado    Mango    Chili
-------  ---------  -------  -------
Avocado   1               1      0.4
Mango     0.833333        1      0.5
Chili     0.666667        1      1

the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili