I have a dataframe, about 800,000 rows and 16 columns, below is an example from the data,
import pandas as pd
import datetime
start = datetime.datetime.now()
print('Starting time,' str(start))
dict1 = {'id':['person1','person2','person3','person4','person5'], \
'food1':['A','A','A','C','D' ], \
'food2':['B','C','B','A','B'], \
'food3':['','D','C','',''], 'food4':['','','D','','',] }
demo = pd.DataFrame(dict1)
demo
>>>Out[13]
Starting time,2022-11-30 12:08:41.414807
id food1 food2 food3 food4
0 person1 A B
1 person2 A C D
2 person3 A B C D
3 person4 C A
4 person5 D B
My ideal result format is as follows,
>>>Out[14]
A B C D
A 0 2 3 2
B 2 0 1 2
C 3 1 0 2
D 2 2 2 0
I did the following:
I've searched a bit through stackoverflow, google, but so far haven't come across an answer that helps with my problem.
I tried to code it myself, my idea was to first build each pairing, then combine everything into a string, and finally find the number of duplicates, but limited by my code capabilities, it's a work in progress.Also, the "new" combination of the next of one pair and the previous of another pair may cause errors in the process of finding duplicates.
Thank you for your help.
CodePudding user response:
If I understand your goal correctly you can use this:
uniques = demo[[x for x in demo.columns if 'id' not in x]].stack().unique()
pd.DataFrame(index = uniques, columns = uniques).fillna(np.NaN)
CodePudding user response:
You could try this:
out = demo.iloc[:,1:].stack().str.get_dummies().sum(level=0).ne(0).astype(int)
final = out.T.dot(out).astype(float)
np.fill_diagonal(final.values, np.nan)
>>>final
A B C D
A NaN 2.0 3.0 2.0
B 2.0 NaN 1.0 2.0
C 3.0 1.0 NaN 2.0
D 2.0 2.0 2.0 NaN