How to find the correlation between categories of the same column in python?-CodePudding

How can I generate a correlation matrix of different categories in the same column? I am working with medical data in which I have a column with different categories of diseases assigned to different patients. For eg,

    patient disease 
1    101      A   
2    101      B  
3    102      A     
4    102      C   
5    102      B
6    103      A   
7    104      B  
8    104      C

I want to find the correlation between the different diseases A, B, and C to find out if a patient has disease A, how likely they will have disease B, and so on for every pair.

Something like this,

     A     B     C 
A   ...   ...   ...
B   ...   ...   ...
C   ...   ...   ...

CodePudding user response：

well this approach is based on some restructuring the data and it works.

import pandas as pd

data_main = {'patient':[101,101,102,102,102,103,104,104],'disease':['A','B','A','C','B','A','B','C']}

new_data_main = [] #adding columns for every unique values of disease

df = pd.DataFrame(new_data_main)

len_data = len(data_main['patient'])
checked = [] #prevent duplicate 
for j in data_main['patient']:
    if j in checked:
        continue
    indices = [i for i, x in enumerate(data_main['patient']) if x == j]
    temp_data = {'patient':j,'A':0,'B':0,'C':0}
    for k in indices:
        str_dis = data_main['disease'][k]
        temp_data[str_dis]=1
    checked.append(j)
    new_data_main.append(temp_data)


df = pd.DataFrame(new_data_main)


print(df.corr())
'''
          patient         A         B         C
patient  1.000000 -0.774597 -0.258199  0.447214
A       -0.774597  1.000000 -0.333333 -0.577350
B       -0.258199 -0.333333  1.000000  0.577350
C        0.447214 -0.577350  0.577350  1.000000
'''

CodePudding user response：

I would use pivot_table for that manner.

import pandas as pd
import numpy as np
df = pd.DataFrame({'patient': [101, 101, 102, 102, 102, 103, 104, 104], 
        'disease': ['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C']})
pivot_table = pd.pivot_table(df, values='patient', index='disease', columns='disease', aggfunc='count',fill_value = 0)
# set fill_value to 0 in case of no combination between 2 diseases
correlation_matrix = pivot_table.corr() # Initialize corr matrix