I have a dataframe that shows order transactions but I am struggling to find a way to list down the most frequent purchased pairing with red wine. I saw it somewhere before but I can't seem to recall how to.

import pandas as pd
df_orders = pd.DataFrame(
    {'Order':[300,300,301,301,301,302,302,303,303],
    'Item':['Red wine','Chicken chop','Red wine',
            'Hawaiian Pizza','Garden Salad','Chicken chop',
            'Mineral Water','Garden Salad','Red wine']})

   Order           Item  
0   300        Red wine    
1   300    Chicken chop    
2   301        Red wine    
3   301  Hawaiian Pizza
4   301    Garden Salad    
5   302    Chicken chop    
6   302   Mineral Water    
7   303    Garden Salad
8   303        Red wine

My aim is to find the most frequent items that is ordered at the same time as red wine. Intended output:

              Item             
0     Garden Salad     2    
1     Chicken chop     1   
2   Hawaiian Pizza     1

I tried using groupby, counter and a few other stuff but didn't get the desired output. Hopefully someone better than me can help me with this. Thank you.

CodePudding user response：

Filter only order numbers which has 'Red wine' and then use Series.value_counts to count items.

df[df.Order.isin(df[df.Item == 'Red wine'].Order)].Item.value_counts().drop('Red wine')

CodePudding user response：

Normally the advice is to not loop over a DataFrame. So if your DataFrame becomes larger this is probably not an efficient solution.

How I handled it was to go over each unique order. From all the orders collect all items. Check if "Red wine" was one of the items. If so add the other items to the dictionary.

Later transform the dictionary back to a DataFrame.

d = dict()

for order in set(df_orders.Order):
    items = list(df_orders[df_orders.Order == order].Item)
    if 'Red wine' in items:
        for it in items:
            if it != 'Red wine':
                if it in d.keys():
                    d[it]  = 1
                else:
                    d[it] = 1
output = pd.DataFrame(d.items(), columns=['item', 'count'])

print(output)
             item  count
0    Chicken chop      1
1  Hawaiian Pizza      1
2    Garden Salad      2

CodePudding user response：

You might consider using a filter and group by as following

orders_with_red_wine = df_orders[
    df_orders["Item"].eq("Red wine")]["Order"].unique()

df_orders[
    df_orders["Item"].ne("Red wine") & 
    df_orders["Order"].isin(orders_with_red_wine)]\
    .groupby("Item").size()\
    .sort_values(ascending=False)\
    .reset_index(name="count")

Output

             Item  count
0    Garden Salad      2
1  Hawaiian Pizza      1
2    Chicken chop      1

Explanation

We first get all orders that include "Red wine" as orders_with_red_wine.
Then we consider only these orders and we exclude the Items "Red wine"

df_orders[
    df_orders["Item"].ne("Red wine") & 
    df_orders["Order"].isin(orders_with_red_wine)]

Finally it's just a matter of grouping by Item and counting.

CodePudding user response：

I did this in two steps without looping, one selecting the 'Red wine' order numbers, then filtering to those orders and removing the 'Red wine' rows, then using value_counts to produce the frequency table, and finally some renaming to match your desired output

>>> orders = df_orders.loc[data['Item'] == 'Red wine']['Order'].values

>>> orders
array([300, 301, 303])

>>> data.loc[df_orders['Order'].isin(orders) & ~df_orders['Item'].eq('Red wine')]
        .Item
        .value_counts()
        .rename_axis('item')
        .reset_index(name='counts')

             item  counts
0    Garden Salad       2
1  Hawaiian Pizza       1
2    Chicken chop       1

CodePudding user response：

If you just want to count the frequency of a column, you can set it as index and use index.value_counts() on it.

# reset index and save to new DataFrame
df = df_orders.reset_index()

# set index to 'Item' column
df1 = df.set_index('Item')

# get orders, where 'Red wine' is purchased at same time
orders = df1.loc['Red wine']['Order']

# set index to 'Order' column, filter with the orders selected above
df2 = df.set_index('Order').loc[orders]

# reset index to Item again
res = df2.reset_index().set_index('Item')

# optional: remove 'Red wine' if you don't want to show them
res = res.loc[res.index != 'Red wine']

# call index method value_counts
ret = res.index.value_counts()

# print your result: Orders includes 'Red wine' and list the frequency of other items bought together
print(ret)

You can also use index.value_counts(normalize=True) to return relative frequency by dividing all values by the sum of values.

PS: API to pandas.Index.value_counts()