I have the following 2-column DataFrame:
transaction id | item id |
---|---|
1 | 50 |
3 | 57 |
1 | 57 |
1 | 54 |
2 | 52 |
2 | 54 |
And I want to create a squared diagonal matrix of size len(item_id) x len(item_id), such that for every item id we have a vector in which every entry represents the amount of times this item id shared a transaction with another item id.
In our example, I want the following result:
50 52 54 57
50 1 0 1 1
52 0 1 1 0
54 1 1 2 1
57 1 0 1 2
A sanity check can be the diagonal: entry (i, i) should have the number of times the item ids that is i'th in its magnitude has appeared in total. Also the resulting matrix should be diagonal.
A couple of functions I tried to use are pd.get_dummies() and groupby() but with no results.
CodePudding user response:
Try this:
ndf = (df.set_index('transaction id')['item id']
.astype(str)
.str.get_dummies()
.groupby(level=0).sum())
ndf.T.dot(ndf)
Output:
50 52 54 57
50 1 0 1 1
52 0 1 1 0
54 1 1 2 1
57 1 0 1 2