I have retail store transactional data and want to see what categories are bought together. The data is in the below format:
transaction_no | product_id | category |
---|---|---|
1 | 100012 | A |
1 | 121111 | A |
1 | 121127 | B |
1 | 121127 | G |
2 | 465222 | N |
2 | 121127 | M |
3 | 121127 | F |
3 | 121127 | G |
3 | 121127 | F |
4 | 465222 | M |
4 | 121127 | N |
Rules:
- The result should be aggregated based on unique transaction numbers.
- Also, the order shouldn't matter in this case (e.g. A bought with B is the same as B bought with A).
- If a category is repeated within the same transaction, it should be counted as 1 only (e.g. in transaction_no = 1, category A is counted once)
Expected output:
bucket | count |
---|---|
A, B, G | 1 |
N, M | 2 |
F, G | 1 |
How do I achieve this?
CodePudding user response:
Use GroupBy.agg
for aggregate frozenset
, then count values by Series.value_counts
and last create DataFrame with join for strings from frozensets:
df1 = (df.groupby('transaction_no')['category']
.agg(frozenset)
.value_counts()
.rename(lambda x: ', '.join(sorted(x)))
.rename_axis('bucket')
.reset_index(name='count'))
print (df1)
bucket count
0 M, N 2
1 F, G 1
2 A, B, G 1
Another idea:
df1 = (df.groupby('transaction_no')['category']
.agg(lambda x: ', '.join(sorted(set((x)))))
.value_counts()
.rename_axis('bucket')
.reset_index(name='count')
)
print (df1)
bucket count
0 M, N 2
1 F, G 1
2 A, B, G 1