I have the following dataset:
import pandas as pd
from itertools import combinations
d = {'Order_ID': ['001', '001', '002', '003', '003', '003', '004', '004'],
'Products': ['Apple', 'Pear', 'Banana', 'Apple', 'Pear', 'Banana', 'Apple', 'Pear'],
'Revenue': [15, 10, 5, 25, 15, 10, 5, 30]}
df = pd.DataFrame(data=d)
df
Yielding:
Order_ID Products Revenue
0 001 Apple 15
1 001 Pear 10
2 002 Banana 5
3 003 Apple 25
4 003 Pear 15
5 003 Banana 10
6 004 Apple 5
7 004 Pear 30
What I'm trying to achieve is a dataset that has all the possible combinations of pairs found in all transactions, get their frequency and the running sum of the revenue. It should look something like this:
d = {'Groups': ['(Apple, Pear)', '(Banana, Apple)', '(Banana, Pear)'],
'Frequency': [3, 1, 1],
'Revenue': [100, 35, 40]}
df2 = pd.DataFrame(data=d)
df2
Which returns:
Groups Frequency Revenue
0 (Apple, Pear) 3 100
1 (Banana, Apple) 1 35
2 (Banana, Pear) 1 40
I was able to get the pairs and their frequency, but I can't figure out how to get the revenue part in the groupby statement that I use:
def find_pairs(x):
return pd.Series(list(combinations(set(x), 2)))
df_group = df.groupby('Order_ID')['Products'].apply(find_pairs).value_counts()
df_group
I would need to add another condition after the function is applied to 'Products' in which the 'Revenue' is summed up by these 'new' groups created by the find_pairs function. The revenue must be the overall sum for each pair, that is, for each time that the group is repeated, add the products revenue to the running sum for the group.
CodePudding user response:
This is a little tricky to do in one line, but if you are willing to use an intermediate DataFrame in which the revenue is indexed by pair it is straightforward.
def pair_revenues_by_order(x):
return {
tuple(sorted([p0.Products, p1.Products])): p0.Revenue p1.Revenue
for [_, p0], [_, p1] in combinations(x.iterrows(), 2)
}
pair_indexed_revenue = df.groupby("Order_ID").apply(pair_revenues_by_order).apply(pd.Series)
# Apple Pear
# Banana Pear Banana
# Order_ID
# 001 NaN 25.0 NaN
# 002 NaN NaN NaN
# 003 35.0 40.0 25.0
# 004 NaN 35.0 NaN
pair_totals = pd.DataFrame(
{"total_revenue": pair_indexed_revenue.sum(axis=0), "frequency": pair_indexed_revenue.count(axis=0)}
)
# to get simple tuple indices instead of MultiIndex
pair_totals.set_index(pair_totals.index.to_flat_index())
# total_revenue frequency
# (Apple, Banana) 35.0 1
# (Apple, Pear) 100.0 3
# (Banana, Pear) 25.0 1
Edit: added tuple(sorted(...))
the pairs need to be hashable and unique, otherwise if you have an order where Banana appears before Apple, you will have (Banana, Apple) alongside (Apple, Banana)
CodePudding user response:
A possible solution:
import pandas as pd
import numpy as np
from itertools import combinations
# create pairs per order id
def pairs_per_id(df):
pairs = (pd.concat(
list(map(
lambda x: pd.DataFrame({'Groups': [x],
'Frequency': sum(df['Products'].isin(x)) // 2,
'Revenue': df.loc[df['Products'].isin(x), 'Revenue'].sum()}),
combinations(np.unique(df['Products']), 2)))).reset_index(drop=True))
return pairs
# remove groups with a single element
out = df[df.groupby('Order_ID')['Order_ID'].transform('count') != 1]
out = out.groupby('Order_ID', group_keys=False).apply(pairs_per_id)
out = out.groupby('Groups').sum().reset_index()
out
Output:
Groups Frequency Revenue
0 (Apple, Banana) 1 35
1 (Apple, Pear) 3 100
2 (Banana, Pear) 1 25