How can I generate a correlation matrix of different categories in the same column? I am working with medical data in which I have a column with different categories of diseases assigned to different patients. For eg,
patient disease
1 101 A
2 101 B
3 102 A
4 102 C
5 102 B
6 103 A
7 104 B
8 104 C
I want to find the correlation between the different diseases A, B, and C to find out if a patient has disease A, how likely they will have disease B, and so on for every pair.
Something like this,
A B C
A ... ... ...
B ... ... ...
C ... ... ...
CodePudding user response:
well this approach is based on some restructuring the data and it works.
import pandas as pd
data_main = {'patient':[101,101,102,102,102,103,104,104],'disease':['A','B','A','C','B','A','B','C']}
new_data_main = [] #adding columns for every unique values of disease
df = pd.DataFrame(new_data_main)
len_data = len(data_main['patient'])
checked = [] #prevent duplicate
for j in data_main['patient']:
if j in checked:
continue
indices = [i for i, x in enumerate(data_main['patient']) if x == j]
temp_data = {'patient':j,'A':0,'B':0,'C':0}
for k in indices:
str_dis = data_main['disease'][k]
temp_data[str_dis]=1
checked.append(j)
new_data_main.append(temp_data)
df = pd.DataFrame(new_data_main)
print(df.corr())
'''
patient A B C
patient 1.000000 -0.774597 -0.258199 0.447214
A -0.774597 1.000000 -0.333333 -0.577350
B -0.258199 -0.333333 1.000000 0.577350
C 0.447214 -0.577350 0.577350 1.000000
'''
CodePudding user response:
I would use pivot_table
for that manner.
import pandas as pd
import numpy as np
df = pd.DataFrame({'patient': [101, 101, 102, 102, 102, 103, 104, 104],
'disease': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C']})
pivot_table = pd.pivot_table(df, values='patient', index='disease', columns='disease', aggfunc='count',fill_value = 0)
# set fill_value to 0 in case of no combination between 2 diseases
correlation_matrix = pivot_table.corr() # Initialize corr matrix