Heatmap of common values between columns in dataframe-CodePudding

I have a Pandas dataframe df with two columns, A and B such that all values in B would be different for a value in A, but two different values in A can have duplicate values in B.

     A     B
   ABC1   XYZ XYZ 123
   ABC2   XYZ XYZ 123
   ABC1   XYZ XYZ 135
   ABC3   XYZ PQR 123
   ABC2   XYZ PQR 123
   ABC3   XYZ XYZ 135
   ABC4   XYZ XYZ 135
   ABC2   XYZ PQR 987
   ABC4   XYZ PQR 123
   ABC5   PQR PQR 567

How can I create a heatmap from this dataframe that shows me the number of common B values between any two A values? I presume I'll have to create a pivot that might look something like this -

         ABC1   ABC2   ABC3   ABC4   ABC5 
ABC1      2      1     ...
ABC2      1      3
ABC3      0      1            ...       
ABC4      0      0     ...            ...
ABC5      0      0

Any help on how this can be done efficiently?

CodePudding user response：

I think we can use get_dummies and np.tensordot for this.

First, let’s find out for each value of A which B are present

>>> common = pd.get_dummies(df['A']).groupby(df['B']).max()
>>> common
   Bar  Foo
B          
2    1    0
3    0    1
4    0    1
5    1    1
6    1    0
9    1    0

Then to compute the intersections of each rows we need the dot product of every pair of rows, for that we can use:

>>> np.tensordot(common, common, (0, 0))
array([[4, 1],
       [1, 3]], dtype=uint8)
>>> pd.DataFrame(np.tensordot(common, common, (0, 0)), columns=common.columns, index=common.columns)
     Bar  Foo
Bar    4    1
Foo    1    3

The diagonal shows the number of unique B values for each A column.