I have a dataset that looks as follows:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
df
purchase item
0 1 apple
1 1 banana
2 2 apple
3 2 banana
4 2 pear
5 3 apple
And I need an output such as
item_1 | item_2 | purchase |
---|---|---|
apple | banana | 2 |
banana | pear | 1 |
apple | pear | 1 |
A table counting how many times a combination of two fruits was purchased in the same purchase.
In this example's first row, the values are apple, banana, 2 because there are two purchases (see column purchase), purchase ID 1 and purchase ID 2, where the person bought both apple and banana. The second row is apple, pear, and 1 because there's only one purchase (purchase ID 2) where the person bought both apple and pear.
My code so far:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
dummies = pd.get_dummies(df['item'])
df2 = pd.concat([df['purchase'], dummies], axis=1)
Creates a table like this:
purchase apple banana pear
0 1 1 0 0
1 1 0 1 0
2 2 1 0 0
3 2 0 1 0
4 2 0 0 1
5 3 1 0 0
Now, I don't know how to proceed to get the wanted result (and I'm aware my output is far from the wanted one). I tried some group by's but it didn't work.
CodePudding user response:
This is probably not the most efficient, but it seems to get the job done:
In [3]: from itertools import combinations
In [4]: combos = df.groupby("purchase")["item"].apply(lambda row: list(combinations(row, 2))).explode().value_counts()
In [5]: combos.reset_index()
Out[5]:
index item
0 (apple, banana) 2
1 (apple, pear) 1
2 (banana, pear) 1
From there,
In [6]: pd.DataFrame([[*x, y] for x, y in zip(combos.index, combos)], columns=["item_1", "item_2", "combo_qty"])
Out[6]:
item_1 item_2 combo_qty
0 apple banana 2
1 apple pear 1
2 banana pear 1
CodePudding user response:
Here is another take that uses the behavior of join
with duplicated index:
df2 = df.set_index("purchase")
df2 = df2.join(df2, rsuffix="_other")\
.groupby(["item", "item_other"])\
.size().rename("count").reset_index()
result = df2[df2.item < df2.item_other].reset_index(drop=True)
# item item_other count
# 0 apple banana 2
# 1 apple pear 1
# 2 banana pear 1
I get around 10x speedup over using builtin combinations
in the following benchmark:
import numpy as np
num_orders = 200
max_order_size = 10
num_items = 50
purchases = np.repeat(np.arange(num_orders),
np.random.randint(1, max_order_size, num_orders))
items = np.random.randint(1, num_items, size=purchases.size)
test_df = pd.DataFrame({
"purchase": purchases,
"item": items,
})