Home > OS >  Python product frequently bought with
Python product frequently bought with

Time:11-28

I have retail store transactional data and want to see what categories are bought together. The data is in the below format:

transaction_no product_id category
1 100012 A
1 121111 A
1 121127 B
1 121127 G
2 465222 N
2 121127 M
3 121127 F
3 121127 G
3 121127 F
4 465222 M
4 121127 N

Rules:

  1. The result should be aggregated based on unique transaction numbers.
  2. Also, the order shouldn't matter in this case (e.g. A bought with B is the same as B bought with A).
  3. If a category is repeated within the same transaction, it should be counted as 1 only (e.g. in transaction_no = 1, category A is counted once)

Expected output:

bucket count
A, B, G 1
N, M 2
F, G 1

How do I achieve this?

CodePudding user response:

Use GroupBy.agg for aggregate frozenset, then count values by Series.value_counts and last create DataFrame with join for strings from frozensets:

df1 = (df.groupby('transaction_no')['category']
        .agg(frozenset)
        .value_counts()
        .rename(lambda x: ', '.join(sorted(x)))
        .rename_axis('bucket')
        .reset_index(name='count'))
print (df1)
    bucket  count
0     M, N      2
1     F, G      1
2  A, B, G      1

Another idea:

df1 = (df.groupby('transaction_no')['category']
        .agg(lambda x: ', '.join(sorted(set((x)))))
        .value_counts()
        .rename_axis('bucket')
        .reset_index(name='count')
        )
print (df1)
    bucket  count
0     M, N      2
1     F, G      1
2  A, B, G      1
  • Related