Given the table below is dataset, There are 4 columns and first 14 rows of the dataset is given but there are over 10,000 rows. In a order there are multiple products being sold to a client.
I want to find out, Across all orders which pair of product categories appears the most? example (Cat1 and Cat2)
Use any python libraries like numpy, pandas, etc. solve using python only
Note - there are only 3 categories and product ID is unique column, Order ID and Client Id are not repeated.
OrderID ProdID Prodcategory Client ID
4997 1 Cat 1 21
4997 2 Cat 1 22
4997 3 Cat 2 23
4997 4 Cat 3 24
2001 5 Cat 1 25
2001 6 Cat 2 26
2001 7 Cat 2 27
2001 8 Cat 2 28
2001 9 Cat 3 29
2376 10 Cat 3 30
2376 11 Cat 1 31
2376 12 Cat 2 32
2376 13 Cat 3 33
2376 14 Cat 1 34
I want the pairs that appear the most accross all orders. For example in Order 4997 (Cat 1, Cat 2) (Cat 2, Cat 3) and (Cat 1, Cat 3 ) all appear one time. In order 2001, (Cat 1, Cat 2) (Cat1, Cat 3) (Cat 2, Cat 3) appear once In total, I want among all the orders which pair has most value count i.e appears the most
My approach
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
# Read the data into the dataframe
#df = pd.read_clipboard()
df
df.columns
df["Product category "].value_counts()
mylabels = ['Cat 2', 'Cat 1', 'Cat 3']
plt.pie(df["Product category "].value_counts(), labels = mylabels)
plt.show()
But this approach only shows, overall value counts and not repeated pairs
CodePudding user response:
Use crosstab/groupby().size()
to get the occurrence of each category within an order. Then matrix mutliplication to count co-occurrences:
ct = pd.crosstab(df['OrderID'], df['Prodcategory']).gt(0).astype(int)
co_occur = (ct.T @ ct)
Output:
Prodcategory Cat 1 Cat 2 Cat 3
Prodcategory
Cat 1 3 3 3
Cat 2 3 3 3
Cat 3 3 3 3
Then you can mask the duplicates, stack and play with the output:
counts = co_occur.where(np.tri(len(co_occur), k=-1, dtype=bool)).stack()
which is:
Prodcategory Prodcategory
Cat 2 Cat 1 3.0
Cat 3 Cat 1 3.0
Cat 2 3.0
dtype: float64
and counts.idxmax()
will give you the pair that occur the most.
CodePudding user response:
You can aggregate the column to a lit, apply combinations from itertools to the set of the list (to remove duplicates) then eplode the result to a new column upon which to call value_counts
from itertools import combinations
cats=df.groupby('OrderID')['Prodcategory']\
.agg(list)\
.apply(lambda x:list(combinations(set(x),2)))\
.explode()
cats
OrderID
2001 (Cat2, Cat1)
2001 (Cat2, Cat3)
2001 (Cat1, Cat3)
2376 (Cat2, Cat1)
2376 (Cat2, Cat3)
2376 (Cat1, Cat3)
4997 (Cat2, Cat1)
4997 (Cat2, Cat3)
4997 (Cat1, Cat3)
cats.value_counts()
(Cat2, Cat3) 3
(Cat1, Cat3) 3
(Cat2, Cat1) 3
CodePudding user response:
try using a pivot_table to count occurrences. Cat 1 and Cat 2 seem to sell the most.
import pandas as pd
import matplotlib.pyplot as plt
from io import StringIO
data="""
OrderID,ProdID,Prodcategory,ClientID
4997,1,'Cat 1',21
4997,2,'Cat 1',22
4997,3,'Cat 2',23
4997,4,'Cat 3',24
2001,5,'Cat 1',25
2001,6,'Cat 2',26
2001,7,'Cat 2',27
2001,8,'Cat 2',28
2001,9,'Cat 3',29
2376,10,'Cat 3',30
2376,11,'Cat 1',31
2376,12,'Cat 2',32
2376,13,'Cat 3',33
2376,14,'Cat 1',34
"""
# a. create a dataframe from data
df = pd.read_csv(StringIO(data), sep=',')
# b. count the occurrence of Productcategory using a pivot_table
category_list=df['Prodcategory'].unique()
order_list=df['OrderID'].unique()
dict=defaultdict(tuple)
combinations_object = itertools.combinations(category_list, 2)
for item in combinations_object:
for order in order_list:
filter="OrderID=={} and (Prodcategory=='{}' or Prodcategory=='{}')".format(order,item[0],item[1])
key=item (order,)
count=len(df.query(filter))
if count>0:
dict[key]=count
for item,value in dict.items():
print(item,value)
output:
key:(category combinations and orderID)
key occurrences
('Cat 1', 'Cat 2', 4997) 3
('Cat 1', 'Cat 2', 2001) 4
('Cat 1', 'Cat 2', 2376) 3
('Cat 1', 'Cat 3', 4997) 3
('Cat 1', 'Cat 3', 2001) 2
('Cat 1', 'Cat 3', 2376) 4
('Cat 2', 'Cat 3', 4997) 2
('Cat 2', 'Cat 3', 2001) 4
('Cat 2', 'Cat 3', 2376) 3