How to create from a 2-column DataFrame a DataFrame that counts occurences of 2nd column by 1st colu-CodePudding

I have the following 2-column DataFrame:

transaction id	item id
1	50
3	57
1	57
1	54
2	52
2	54

And I want to create a squared diagonal matrix of size len(item_id) x len(item_id), such that for every item id we have a vector in which every entry represents the amount of times this item id shared a transaction with another item id.

In our example, I want the following result:

          50     52     54     57
50         1     0      1      1
52         0     1      1      0
54         1     1      2      1
57         1     0      1      2

A sanity check can be the diagonal: entry (i, i) should have the number of times the item ids that is i'th in its magnitude has appeared in total. Also the resulting matrix should be diagonal.

A couple of functions I tried to use are pd.get_dummies() and groupby() but with no results.

CodePudding user response：

Try this:

ndf = (df.set_index('transaction id')['item id']
       .astype(str)
       .str.get_dummies()
       .groupby(level=0).sum())
ndf.T.dot(ndf)

Output:

    50  52  54  57
50   1   0   1   1
52   0   1   1   0
54   1   1   2   1
57   1   0   1   2