Python get pairs of categories that appears the most in a dataframe-CodePudding

Given the table below is dataset, There are 4 columns and first 14 rows of the dataset is given but there are over 10,000 rows. In a order there are multiple products being sold to a client.

I want to find out, Across all orders which pair of product categories appears the most? example (Cat1 and Cat2)

Use any python libraries like numpy, pandas, etc. solve using python only

Note - there are only 3 categories and product ID is unique column, Order ID and Client Id are not repeated.

OrderID ProdID Prodcategory Client ID
4997    1   Cat 1   21
4997    2   Cat 1   22
4997    3   Cat 2   23
4997    4   Cat 3   24
2001    5   Cat 1   25
2001    6   Cat 2   26
2001    7   Cat 2   27
2001    8   Cat 2   28
2001    9   Cat 3   29
2376    10  Cat 3   30
2376    11  Cat 1   31
2376    12  Cat 2   32
2376    13  Cat 3   33
2376    14  Cat 1   34

I want the pairs that appear the most accross all orders. For example in Order 4997 (Cat 1, Cat 2) (Cat 2, Cat 3) and (Cat 1, Cat 3 ) all appear one time. In order 2001, (Cat 1, Cat 2) (Cat1, Cat 3) (Cat 2, Cat 3) appear once In total, I want among all the orders which pair has most value count i.e appears the most

My approach

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt



# Read the data into the dataframe
#df = pd.read_clipboard()
df



df.columns



df["Product category "].value_counts()



mylabels = ['Cat 2', 'Cat 1', 'Cat 3']
plt.pie(df["Product category "].value_counts(), labels = mylabels)
plt.show()

But this approach only shows, overall value counts and not repeated pairs

CodePudding user response：

Use crosstab/groupby().size() to get the occurrence of each category within an order. Then matrix mutliplication to count co-occurrences:

ct = pd.crosstab(df['OrderID'], df['Prodcategory']).gt(0).astype(int)
co_occur = (ct.T @ ct)

Output:

Prodcategory  Cat 1  Cat 2  Cat 3
Prodcategory                     
Cat 1             3      3      3
Cat 2             3      3      3
Cat 3             3      3      3

Then you can mask the duplicates, stack and play with the output:

counts = co_occur.where(np.tri(len(co_occur), k=-1, dtype=bool)).stack()

which is:

Prodcategory  Prodcategory
Cat 2         Cat 1           3.0
Cat 3         Cat 1           3.0
              Cat 2           3.0
dtype: float64

and counts.idxmax() will give you the pair that occur the most.

CodePudding user response：

You can aggregate the column to a lit, apply combinations from itertools to the set of the list (to remove duplicates) then eplode the result to a new column upon which to call value_counts

from itertools import combinations
cats=df.groupby('OrderID')['Prodcategory']\
    .agg(list)\
    .apply(lambda x:list(combinations(set(x),2)))\
    .explode()

cats
OrderID
2001    (Cat2, Cat1)
2001    (Cat2, Cat3)
2001    (Cat1, Cat3)
2376    (Cat2, Cat1)
2376    (Cat2, Cat3)
2376    (Cat1, Cat3)
4997    (Cat2, Cat1)
4997    (Cat2, Cat3)
4997    (Cat1, Cat3)

cats.value_counts()

(Cat2, Cat3)    3
(Cat1, Cat3)    3
(Cat2, Cat1)    3

CodePudding user response：

try using a pivot_table to count occurrences. Cat 1 and Cat 2 seem to sell the most.

import pandas as pd
import matplotlib.pyplot as plt
from io import StringIO

data="""
OrderID,ProdID,Prodcategory,ClientID
4997,1,'Cat 1',21
4997,2,'Cat 1',22
4997,3,'Cat 2',23
4997,4,'Cat 3',24
2001,5,'Cat 1',25
2001,6,'Cat 2',26
2001,7,'Cat 2',27
2001,8,'Cat 2',28
2001,9,'Cat 3',29
2376,10,'Cat 3',30
2376,11,'Cat 1',31
2376,12,'Cat 2',32
2376,13,'Cat 3',33
2376,14,'Cat 1',34
"""

 # a. create a dataframe from data
 df = pd.read_csv(StringIO(data), sep=',')

 # b. count the occurrence of Productcategory using a pivot_table
 category_list=df['Prodcategory'].unique()
 order_list=df['OrderID'].unique()

 dict=defaultdict(tuple)
 combinations_object = itertools.combinations(category_list, 2)
 for item in combinations_object:
     for order in order_list:
         filter="OrderID=={} and (Prodcategory=='{}' or Prodcategory=='{}')".format(order,item[0],item[1])
         key=item (order,)
        count=len(df.query(filter))
        if count>0:
             dict[key]=count
for item,value in dict.items():
     print(item,value)

output:

key:(category combinations and orderID)

key                      occurrences
('Cat 1', 'Cat 2', 4997) 3
('Cat 1', 'Cat 2', 2001) 4
('Cat 1', 'Cat 2', 2376) 3
('Cat 1', 'Cat 3', 4997) 3
('Cat 1', 'Cat 3', 2001) 2
('Cat 1', 'Cat 3', 2376) 4
('Cat 2', 'Cat 3', 4997) 2
('Cat 2', 'Cat 3', 2001) 4
('Cat 2', 'Cat 3', 2376) 3