Home > Mobile >  Get the frequency of all combinations in Pandas
Get the frequency of all combinations in Pandas

Time:12-12

I am trying to get the purchase frequency of all combinations of products.

Suppose my transactions are the following

userid    product
u1        A
u1        B
u1        C
u2        A
u2        C

So the solution should be

combination  count_of_distinct_users
A            2
B            1
C            2
A, B         1
A, C         2
B, C         1 
A, B, C      1

i.e 2 users have purchased product A, one users has purchased product B..., 2 users have purchased products A and C ...

CodePudding user response:

Sefine a function combine to generate all combinations:

from itertools import combinations

def combine(s):
    result = []
    for i in range(1, len(s) 1):
        for c in list(combinations(s, i)):
            result =[c]
    return result

This will give all combinations in a column:

df.groupby('user')['product'].apply(combine)
# Out: 
# user
# 1    [(A,), (B,), (C,), (A, B), (A, C), (B, C), (A,...
# 2                                 [(A,), (C,), (A, C)]
# Name: product, dtype: object

Now use explode():

df.groupby('user')['product'].apply(combine).reset_index(name='product_combos') \
  .explode('product_combos').groupby('product_combos') \
  .size().reset_index(name='user_count')
# Out: 
#   product_combos  user_count
# 0           (A,)           2
# 1         (A, B)           1
# 2      (A, B, C)           1
# 3         (A, C)           2
# 4           (B,)           1
# 5         (B, C)           1
# 6           (C,)           2

Careful with the combinations because the list gets large with many different products!

CodePudding user response:

Here my simple trick is to convert df to dict with list of users like {'A':[u1, u2], 'B':[u1]} then find the combination and merge both products list of users total. like A:[u1, u2] and B:[u1] so merge will be [2,1] and last took the min value pf that list so final count output will be 1.

Code:

from more_itertools import powerset

d = df.groupby('product')['user'].apply(list).to_dict()
##output: {'A': ['u2', 'u1'], 'B': ['u1'], 'C': ['u1', 'u2']}


new= pd.DataFrame([', '.join(i) for i in list(powerset(d.keys()))[1:]], columns =['users'])
## Output: ['A', 'B', 'C', 'A, B', 'A, C', 'B, C', 'A, B, C']


new['count'] = new['users'].apply(lambda x: min([len(d[y.replace(' ','')]) for y in x.split(',')]))
new

Output:

users       count
0   A           2
1   B           1
2   C           2
3   A, B        1
4   A, C        2
5   B, C        1
6   A, B, C     1
  • Related