I have a table with 8,000 rows of data and a small sample of it here:
Customer ItemDescription Invoice PurchaseDate
1064 Produce 55514 22-01
1064 Snack 55514 22-01
1080 Drink 56511 23-01
1080 Snack 56511 23-01
1230 Drink 55551 26-03
1230 Snack 55551 26-03
1128 Meat 55003 04-03
1128 Snack 55003 04-03
1229 Drink 55100 06-03
1229 Snack 55100 06-03
1230 Meat 55102 07-03
1230 Snack 55102 07-03
I am trying to find the top 3 items that customers have bought along with "Snack".
So the printed result should look like this:
0 Drink
1 Meat
2 Produce
I have tried df.groupby but it doesn't sort them based on what was purchased along with "snacks".
CodePudding user response:
To find the top 3 items that customers have bought along with "Snack", you can use the groupby() and value_counts() methods in pandas. Here is an example of how you can do this:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'Customer': [1064, 1064, 1080, 1080, 1230, 1230, 1128, 1128, 1229, 1229, 1230, 1230],
'ItemDescription': ['Produce', 'Snack', 'Drink', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack'],
'Invoice': [55514, 55514, 56511, 56511, 55551, 55551, 55003, 55003, 55100, 55100, 55102, 55102],
'PurchaseDate': ['22-01', '22-01', '23-01', '23-01', '26-03', '26-03', '04-03', '04-03', '06-03', '06-03', '07-03', '07-03']})
# Group the data by Customer
df_grouped = df.groupby('Customer')
# Create a dictionary to store the counts of items bought along with "Snack" for each customer
item_counts = {}
# Loop through each customer group
for customer, group in df_grouped:
# Create a new DataFrame that only includes rows where the ItemDescription is "Snack"
snacks = group[group['ItemDescription'] == 'Snack']
# Loop through each row in the snacks DataFrame
for index, row in snacks.iterrows():
# Get the Invoice number for the current row
invoice = row['Invoice']
# Get the rows in the original DataFrame that have the same Invoice number as the current row
invoice_rows = df[df['Invoice'] == invoice]
# Loop through each row in the invoice_rows DataFrame
for i, r in invoice_rows.iterrows():
# If the ItemDescription is not "Snack", increment the count for that item in the item_counts dictionary
if r['ItemDescription'] != 'Snack':
item = r['ItemDescription']
if item not in item_counts:
item_counts[item] = 0
item_counts[item] = 1
# Sort the item_counts dictionary by value in descending order
sorted_items = sorted(item_counts.items(), key=lambda x: x[1], reverse=True)
# Print the top 3 items that customers have bought along with "Snack"
for i in range(3):
print(i, sorted_items[i][0])
In the example above, the data in the DataFrame is first grouped by the values in the Customer
column.
Then, for each customer group, the rows where the ItemDescription
is "Snack" are extracted and stored in a new DataFrame.
For each row in the snacks DataFrame, the rows in the original DataFrame that have the same Invoice
number are extracted and stored in a new DataFrame.
Finally, for each row in the invoice_rows DataFrame, the ItemDescription
is checked. If the ItemDescription
is not "Snack", the count for that item is incremented in the item_counts
dictionary. After all the customer groups have been processed, the item_counts
dictionary is sorted by value in descending order, and the top 3 items are printed.
CodePudding user response:
You can use groupby. By using groupby, you can group the products according to the customers and store them in the form of a list.
dfx=df.groupby('Customer').agg({'ItemDescription':list})
'''
ItemDescription
Customer
1064 [Produce, Snack]
1080 [Drink, Snack]
1128 [Meat, Snack]
1229 [Drink, Snack]
1230 [Drink, Snack, Meat, Snack]
'''
Here we will need to filter out customers who have not purchased a snack.
dfx=dfx[pd.DataFrame(dfx.ItemDescription.tolist()).isin(['Snack']).any(1).values] # https://stackoverflow.com/a/53343080/15415267
Then, convert the remaining rows into a list and get distributions with the Counter function.
products=dfx.explode('ItemDescription')['ItemDescription'].to_list()
#['Produce', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack', 'Drink', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack']
from collections import Counter
occurence_count = Counter(top)
occurence_count.most_common(4) #get top 4 product
#[('Snack', 6), ('Drink', 3), ('Meat', 2), ('Produce', 1)]
If you convert results to dataframe:
final =pd.DataFrame(occurence_count.most_common(4),columns=['product','count'])
'''
product count
0 Snack 6
1 Drink 3
2 Meat 2
3 Produce 1
'''
or (shorter):
dfx=df.groupby('Customer').agg({'ItemDescription':list})
'''
ItemDescription
Customer
1064 [Produce, Snack]
1080 [Drink, Snack]
1128 [Meat, Snack]
1229 [Drink, Snack]
1230 [Drink, Snack, Meat, Snack]
'''
dfx=dfx[pd.DataFrame(dfx.ItemDescription.tolist()).isin(['Snack']).any(1).values] # https://stackoverflow.com/a/53343080/15415267
products=dfx2.explode('ItemDescription')['ItemDescription'].value_counts()[0:4]
'''
ItemDescription
Snack 6
Drink 3
Meat 2
Produce 1
'''