Pandas groupby with complex aggregations for multiple columns-CodePudding

I have the following dataset:

import pandas as pd
from itertools import combinations

d = {'Order_ID': ['001', '001', '002', '003', '003', '003', '004', '004'], 
 'Products': ['Apple', 'Pear', 'Banana', 'Apple', 'Pear', 'Banana', 'Apple', 'Pear'],
 'Revenue': [15, 10, 5, 25, 15, 10, 5, 30]}
df = pd.DataFrame(data=d)
df

Yielding:

    Order_ID    Products    Revenue
  0   001        Apple        15
  1   001        Pear         10
  2   002        Banana       5
  3   003        Apple        25
  4   003        Pear         15
  5   003        Banana       10
  6   004        Apple        5
  7   004        Pear         30

What I'm trying to achieve is a dataset that has all the possible combinations of pairs found in all transactions, get their frequency and the running sum of the revenue. It should look something like this:

d = {'Groups': ['(Apple, Pear)', '(Banana, Apple)', '(Banana, Pear)'], 
 'Frequency': [3, 1, 1],
 'Revenue': [100, 35, 40]}
df2 = pd.DataFrame(data=d)
df2

Which returns:

   Groups         Frequency    Revenue
0  (Apple, Pear)      3          100
1  (Banana, Apple)    1           35
2  (Banana, Pear)     1           40

I was able to get the pairs and their frequency, but I can't figure out how to get the revenue part in the groupby statement that I use:

def find_pairs(x):
  return pd.Series(list(combinations(set(x), 2)))

df_group = df.groupby('Order_ID')['Products'].apply(find_pairs).value_counts()
df_group

I would need to add another condition after the function is applied to 'Products' in which the 'Revenue' is summed up by these 'new' groups created by the find_pairs function. The revenue must be the overall sum for each pair, that is, for each time that the group is repeated, add the products revenue to the running sum for the group.

CodePudding user response：

This is a little tricky to do in one line, but if you are willing to use an intermediate DataFrame in which the revenue is indexed by pair it is straightforward.

def pair_revenues_by_order(x):
    return {
        tuple(sorted([p0.Products, p1.Products])): p0.Revenue   p1.Revenue
        for [_, p0], [_, p1] in combinations(x.iterrows(), 2)
    }

pair_indexed_revenue = df.groupby("Order_ID").apply(pair_revenues_by_order).apply(pd.Series)

#            Apple         Pear
#          Banana  Pear Banana
# Order_ID                    
# 001         NaN  25.0    NaN
# 002         NaN   NaN    NaN
# 003        35.0  40.0   25.0
# 004         NaN  35.0    NaN

pair_totals = pd.DataFrame(
    {"total_revenue": pair_indexed_revenue.sum(axis=0), "frequency": pair_indexed_revenue.count(axis=0)}
)

# to get simple tuple indices instead of MultiIndex
pair_totals.set_index(pair_totals.index.to_flat_index())

#                  total_revenue  frequency
# (Apple, Banana)           35.0          1
# (Apple, Pear)            100.0          3
# (Banana, Pear)            25.0          1

Edit: added tuple(sorted(...)) the pairs need to be hashable and unique, otherwise if you have an order where Banana appears before Apple, you will have (Banana, Apple) alongside (Apple, Banana)

CodePudding user response：

A possible solution:

import pandas as pd
import numpy as np
from itertools import combinations

# create pairs per order id
def pairs_per_id(df):
     pairs = (pd.concat(
     list(map(
          lambda x: pd.DataFrame({'Groups': [x], 
          'Frequency': sum(df['Products'].isin(x)) // 2,
          'Revenue': df.loc[df['Products'].isin(x), 'Revenue'].sum()}),
          combinations(np.unique(df['Products']), 2)))).reset_index(drop=True))
     return pairs

# remove groups with a single element
out = df[df.groupby('Order_ID')['Order_ID'].transform('count') != 1]

out = out.groupby('Order_ID', group_keys=False).apply(pairs_per_id)
out = out.groupby('Groups').sum().reset_index()
out

Output:

            Groups  Frequency  Revenue
0  (Apple, Banana)          1       35
1    (Apple, Pear)          3      100
2   (Banana, Pear)          1       25