How to calculate pairwise co-occurrence matrix based on dataframe?-CodePudding

I have a dataframe, about 800,000 rows and 16 columns, below is an example from the data,

import pandas as pd
import datetime

start = datetime.datetime.now()
print('Starting time,' str(start))
dict1 = {'id':['person1','person2','person3','person4','person5'], \
         'food1':['A','A','A','C','D' ], \
         'food2':['B','C','B','A','B'], \
         'food3':['','D','C','',''], 'food4':['','','D','','',] }
demo = pd.DataFrame(dict1)
demo

>>>Out[13]
Starting time,2022-11-30 12:08:41.414807

      id     food1  food2   food3   food4
0   person1    A      B     
1   person2    A      C       D 
2   person3    A      B       C       D
3   person4    C      A     
4   person5    D      B

My ideal result format is as follows,

>>>Out[14]

    A   B   C   D       
A   0   2   3   2
B   2   0   1   2
C   3   1   0   2
D   2   2   2   0

I did the following:

I've searched a bit through stackoverflow, google, but so far haven't come across an answer that helps with my problem.

I tried to code it myself, my idea was to first build each pairing, then combine everything into a string, and finally find the number of duplicates, but limited by my code capabilities, it's a work in progress.Also, the "new" combination of the next of one pair and the previous of another pair may cause errors in the process of finding duplicates.

Thank you for your help.

CodePudding user response：

If I understand your goal correctly you can use this:

uniques = demo[[x for x in demo.columns if 'id' not in x]].stack().unique()
pd.DataFrame(index = uniques, columns = uniques).fillna(np.NaN)

CodePudding user response：

You could try this:

out = demo.iloc[:,1:].stack().str.get_dummies().sum(level=0).ne(0).astype(int)
final = out.T.dot(out).astype(float)
np.fill_diagonal(final.values, np.nan)

>>>final
    A   B   C   D
A   NaN 2.0 3.0 2.0
B   2.0 NaN 1.0 2.0
C   3.0 1.0 NaN 2.0
D   2.0 2.0 2.0 NaN