How do i use medical codes to determine what disease a person have using jupyter?-CodePudding

I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.

Excel file 1:

Patient	Primary Diagnosis	Secondary Diagnosis	Secondary Diagnosis 2	Secondary Diagnosis 3
Alex	50322	50111
John	50331	60874	50226	74444
Peter	50226	74444
Peter	50233	88888

Excel File 2:

Primary Diagnosis	Medical Code
Diabetes Type 2	50322
Diabetes Type 2	50331
Diabetes Type 2	50233
Cardiovescular Disease	50226
Hypertension	50111
AIDS	60874
HIV	74444
HIV	88888

Intended output:

Patient	Positive for Diabetes Type 2	Positive for Cardiovascular Disease	Positive for AIDS	Positive for HIV
Alex	1	1	0	0
John	1	1	1	1
Peter	1	1	0	1

CodePudding user response：

IIUC, you could melt df1, then map the codes from reshaped df2, finally pivot_table on the output:

diseases = df2.set_index('Medical Code')['Primary Diagnosis']

(df1
 .reset_index()
 .melt(id_vars=['index', 'Patient'])
 .assign(disease=lambda d: d['value'].map(diseases),
         value=1,
        )
 .pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)

output:

disease  AIDS  Cardiovescular Disease  Diabetes Type 2  HIV  Hypertension
Patient                                                                  
Alex        0                       0                1    0             1
John        1                       1                1    1             0
Peter       0                       1                1    1             0

CodePudding user response：

Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.

CodePudding user response：

You can use merge and pivot_table

out = (
    df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
       .merge(df2, on='Medical Code').assign(dummy=1)
       .pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
       .add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)

Output:

Patient	Positive for AIDS	Positive for Cardiovescular Disease	Positive for Diabetes Type 2	Positive for HIV	Positive for Hypertension
Alex	0	0	1	0	1
John	1	1	1	1	0
Peter	0	1	1	1	0