Get the frequency of all combinations in Pandas-CodePudding

I am trying to get the purchase frequency of all combinations of products.

Suppose my transactions are the following

userid    product
u1        A
u1        B
u1        C
u2        A
u2        C

So the solution should be

combination  count_of_distinct_users
A            2
B            1
C            2
A, B         1
A, C         2
B, C         1 
A, B, C      1

i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...

CodePudding user response：

Sefine a function combine to generate all combinations:

from itertools import combinations

def combine(s):
    result = []
    for i in range(1, len(s) 1):
        for c in list(combinations(s, i)):
            result =[c]
    return result

This will give all combinations in a column:

df.groupby('user')['product'].apply(combine)
# Out: 
# user
# 1    [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2                                 [(A,), (C,), (A, C)]
# Name: product, dtype: object

Now use explode():

df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
  .explode('product_combos').groupby('product_combos') \
  .size().reset_index(name='user_count')
# Out: 
#   product_combos  user_count
# 0           (A,)           2
# 1         (A, B)           1
# 2      (A, B, C)           1
# 3         (A, C)           2
# 4           (B,)           1
# 5         (B, C)           1
# 6           (C,)           2

Careful with the combinations because the list gets large with many different products!

CodePudding user response：

Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.

Code:

from more_itertools import powerset

d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}


new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']


new['count'] = new['users'].apply(lambda x: min([len(d[y.replace(' ','')]) for y in x.split(',')]))
new

Output:

users       count
0   A           2
1   B           1
2   C           2
3   A, B        1
4   A, C        2
5   B, C        1
6   A, B, C     1