Pandas, find the number of times a combination of rows appear under a different column ID-CodePudding

I have a dataset that looks as follows:

df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
                   'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
df

   purchase    item
0         1   apple
1         1  banana
2         2   apple
3         2  banana
4         2   pear
5         3    apple

And I need an output such as

item_1	item_2	purchase
apple	banana	2
banana	pear	1
apple	pear	1

A table counting how many times a combination of two fruits was purchased in the same purchase.

In this example's first row, the values are apple, banana, 2 because there are two purchases (see column purchase), purchase ID 1 and purchase ID 2, where the person bought both apple and banana. The second row is apple, pear, and 1 because there's only one purchase (purchase ID 2) where the person bought both apple and pear.

My code so far:

df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
                   'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
dummies = pd.get_dummies(df['item'])
df2 = pd.concat([df['purchase'], dummies], axis=1)

Creates a table like this:

   purchase  apple  banana  pear
0         1      1       0     0
1         1      0       1     0
2         2      1       0     0
3         2      0       1     0
4         2      0       0     1
5         3      1       0     0

Now, I don't know how to proceed to get the wanted result (and I'm aware my output is far from the wanted one). I tried some group by's but it didn't work.

CodePudding user response：

This is probably not the most efficient, but it seems to get the job done:

In [3]: from itertools import combinations

In [4]: combos = df.groupby("purchase")["item"].apply(lambda row: list(combinations(row, 2))).explode().value_counts()

In [5]: combos.reset_index()
Out[5]:
             index  item
0  (apple, banana)     2
1    (apple, pear)     1
2   (banana, pear)     1

From there,

In [6]: pd.DataFrame([[*x, y] for x, y in zip(combos.index, combos)], columns=["item_1", "item_2", "combo_qty"])
Out[6]:
   item_1  item_2  combo_qty
0   apple  banana          2
1   apple    pear          1
2  banana    pear          1

CodePudding user response：

Here is another take that uses the behavior of join with duplicated index:

df2 = df.set_index("purchase")
df2 = df2.join(df2, rsuffix="_other")\
  .groupby(["item", "item_other"])\
  .size().rename("count").reset_index()
result = df2[df2.item < df2.item_other].reset_index(drop=True)

#      item item_other  count
# 0   apple     banana      2
# 1   apple       pear      1
# 2  banana       pear      1

I get around 10x speedup over using builtin combinations in the following benchmark:

import numpy as np

num_orders = 200
max_order_size = 10
num_items = 50

purchases = np.repeat(np.arange(num_orders), 
                      np.random.randint(1, max_order_size, num_orders))
items     = np.random.randint(1, num_items, size=purchases.size)

test_df = pd.DataFrame({
    "purchase": purchases,
    "item": items,
})