How to group by data in a column with pandas?-CodePudding

I have a table with 8,000 rows of data and a small sample of it here:

Customer         ItemDescription       Invoice     PurchaseDate
  1064               Produce            55514         22-01
  1064               Snack              55514         22-01
  1080               Drink              56511         23-01
  1080               Snack              56511         23-01
  1230               Drink              55551         26-03
  1230               Snack              55551         26-03
  1128               Meat               55003         04-03
  1128               Snack              55003         04-03
  1229               Drink              55100         06-03
  1229               Snack              55100         06-03
  1230               Meat               55102         07-03
  1230               Snack              55102         07-03

I am trying to find the top 3 items that customers have bought along with "Snack".

So the printed result should look like this:

0 Drink
1 Meat
2 Produce

I have tried df.groupby but it doesn't sort them based on what was purchased along with "snacks".

CodePudding user response：

To find the top 3 items that customers have bought along with "Snack", you can use the groupby() and value_counts() methods in pandas. Here is an example of how you can do this:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'Customer': [1064, 1064, 1080, 1080, 1230, 1230, 1128, 1128, 1229, 1229, 1230, 1230],
                   'ItemDescription': ['Produce', 'Snack', 'Drink', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack'],
                   'Invoice': [55514, 55514, 56511, 56511, 55551, 55551, 55003, 55003, 55100, 55100, 55102, 55102],
                   'PurchaseDate': ['22-01', '22-01', '23-01', '23-01', '26-03', '26-03', '04-03', '04-03', '06-03', '06-03', '07-03', '07-03']})

# Group the data by Customer
df_grouped = df.groupby('Customer')

# Create a dictionary to store the counts of items bought along with "Snack" for each customer
item_counts = {}

# Loop through each customer group
for customer, group in df_grouped:
    # Create a new DataFrame that only includes rows where the ItemDescription is "Snack"
    snacks = group[group['ItemDescription'] == 'Snack']

    # Loop through each row in the snacks DataFrame
    for index, row in snacks.iterrows():
        # Get the Invoice number for the current row
        invoice = row['Invoice']

        # Get the rows in the original DataFrame that have the same Invoice number as the current row
        invoice_rows = df[df['Invoice'] == invoice]

        # Loop through each row in the invoice_rows DataFrame
        for i, r in invoice_rows.iterrows():
            # If the ItemDescription is not "Snack", increment the count for that item in the item_counts dictionary
            if r['ItemDescription'] != 'Snack':
                item = r['ItemDescription']
                if item not in item_counts:
                    item_counts[item] = 0
                item_counts[item]  = 1

# Sort the item_counts dictionary by value in descending order
sorted_items = sorted(item_counts.items(), key=lambda x: x[1], reverse=True)

# Print the top 3 items that customers have bought along with "Snack"
for i in range(3):
    print(i, sorted_items[i][0])

In the example above, the data in the DataFrame is first grouped by the values in the Customer column. Then, for each customer group, the rows where the ItemDescription is "Snack" are extracted and stored in a new DataFrame.

For each row in the snacks DataFrame, the rows in the original DataFrame that have the same Invoice number are extracted and stored in a new DataFrame.

Finally, for each row in the invoice_rows DataFrame, the ItemDescription is checked. If the ItemDescription is not "Snack", the count for that item is incremented in the item_counts dictionary. After all the customer groups have been processed, the item_counts dictionary is sorted by value in descending order, and the top 3 items are printed.

CodePudding user response：

You can use groupby. By using groupby, you can group the products according to the customers and store them in the form of a list.

dfx=df.groupby('Customer').agg({'ItemDescription':list})
'''
                      ItemDescription
Customer                             
1064                 [Produce, Snack]
1080                   [Drink, Snack]
1128                    [Meat, Snack]
1229                   [Drink, Snack]
1230      [Drink, Snack, Meat, Snack]
'''

Here we will need to filter out customers who have not purchased a snack.

dfx=dfx[pd.DataFrame(dfx.ItemDescription.tolist()).isin(['Snack']).any(1).values]  # https://stackoverflow.com/a/53343080/15415267

Then, convert the remaining rows into a list and get distributions with the Counter function.

products=dfx.explode('ItemDescription')['ItemDescription'].to_list()
#['Produce', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack', 'Drink', 'Snack', 'Drink', 'Snack', 'Meat', 'Snack']

from collections import Counter
occurence_count = Counter(top)
occurence_count.most_common(4) #get top 4 product 
#[('Snack', 6), ('Drink', 3), ('Meat', 2), ('Produce', 1)]

If you convert results to dataframe:

final =pd.DataFrame(occurence_count.most_common(4),columns=['product','count'])
'''
    product count
0   Snack   6
1   Drink   3
2   Meat    2
3   Produce 1

'''

or (shorter):

dfx=df.groupby('Customer').agg({'ItemDescription':list})
'''
                      ItemDescription
Customer                             
1064                 [Produce, Snack]
1080                   [Drink, Snack]
1128                    [Meat, Snack]
1229                   [Drink, Snack]
1230      [Drink, Snack, Meat, Snack]
'''

dfx=dfx[pd.DataFrame(dfx.ItemDescription.tolist()).isin(['Snack']).any(1).values]  # https://stackoverflow.com/a/53343080/15415267
products=dfx2.explode('ItemDescription')['ItemDescription'].value_counts()[0:4]
'''
      ItemDescription
Snack   6
Drink   3
Meat    2
Produce 1

'''