I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.
Excel file 1:
Patient | Primary Diagnosis | Secondary Diagnosis | Secondary Diagnosis 2 | Secondary Diagnosis 3 | |
---|---|---|---|---|---|
Alex | 50322 | 50111 | |||
John | 50331 | 60874 | 50226 | 74444 | |
Peter | 50226 | 74444 | |||
Peter | 50233 | 88888 |
Excel File 2:
Primary Diagnosis | Medical Code |
---|---|
Diabetes Type 2 | 50322 |
Diabetes Type 2 | 50331 |
Diabetes Type 2 | 50233 |
Cardiovescular Disease | 50226 |
Hypertension | 50111 |
AIDS | 60874 |
HIV | 74444 |
HIV | 88888 |
Intended output:
Patient | Positive for Diabetes Type 2 | Positive for Cardiovascular Disease | Positive for Hypertension | Positive for AIDS | Positive for HIV |
---|---|---|---|---|---|
Alex | 1 | 1 | 0 | 0 | 0 |
John | 1 | 1 | 0 | 1 | 1 |
Peter | 1 | 1 | 0 | 0 | 1 |
CodePudding user response:
IIUC, you could melt df1
, then map the codes from reshaped df2
, finally pivot_table
on the output:
diseases = df2.set_index('Medical Code')['Primary Diagnosis']
(df1
.reset_index()
.melt(id_vars=['index', 'Patient'])
.assign(disease=lambda d: d['value'].map(diseases),
value=1,
)
.pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)
output:
disease AIDS Cardiovescular Disease Diabetes Type 2 HIV Hypertension
Patient
Alex 0 0 1 0 1
John 1 1 1 1 0
Peter 0 1 1 1 0
CodePudding user response:
Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.
CodePudding user response:
You can use merge
and pivot_table
out = (
df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
.merge(df2, on='Medical Code').assign(dummy=1)
.pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
.add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)
Output:
Patient | Positive for AIDS | Positive for Cardiovescular Disease | Positive for Diabetes Type 2 | Positive for HIV | Positive for Hypertension |
---|---|---|---|---|---|
Alex | 0 | 0 | 1 | 0 | 1 |
John | 1 | 1 | 1 | 1 | 0 |
Peter | 0 | 1 | 1 | 1 | 0 |