I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.
Order | Avocado | Mango | Chili |
---|---|---|---|
1546 | 500 | 20 | 0 |
861153 | 200 | 500 | 5 |
1657446 | 500 | 20 | 0 |
79854 | 200 | 500 | 1 |
4654 | 500 | 20 | 0 |
74654 | 0 | 500 | 800 |
I found out what combinations are often together with this code (abbrivated here to 3 products).
size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)
Now I want to know per product how often it is bought solo and how often with other products.
Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:
Product | Avocado | Mango | Chili |
---|---|---|---|
AVOCADO | 100% | 20 % | 1% |
MANGO | 20 % | 100% | 3% |
CHILI | 20% | 30% | 100% |
CodePudding user response:
First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:
df2 = 1*(df.set_index('Order') > 0)
Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.
For actual numbers of orders that contain (product1,product2), we can do
df3 = pd.DataFrame(data = [email protected], columns = df2.columns, index = df2.columns)
df3
looks like this:
Avocado Mango Chili
------- --------- ------- -------
Avocado 5 5 2
Mango 5 6 3
Chili 2 3 3
eg there are 2 orders that contain Avocado and Chili
If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:
df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0))[email protected]), columns = df2.columns, index = df2.columns)
df4
is:
Avocado Mango Chili
------- --------- ------- -------
Avocado 1 1 0.4
Mango 0.833333 1 0.5
Chili 0.666667 1 1
the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili