Suppose I have the following DataFrame:
ID | Diagnosis | Test Result |
---|---|---|
1 | Cancer | Positive |
1 | TB | Negative |
1 | Lupus | Indeterminate |
2 | Cancer | Negative |
2 | TB | Negative |
2 | Myopia | Negative |
2 | Hypertension | Negative |
Which is in this wide format marked by ID. I want to truncate the rows to a single entry by creating more columns on the unique values by each ID. I would like my DataFrame to look like:
ID | Diagnosis_1 | Diagnosis_2 | Diagnosis_3 | Diagnosis_4 | Test Result_1 | Test Result_2 | Test Result_3 | Test Result_4 |
---|---|---|---|---|---|---|---|---|
1 | Cancer | TB | Lupus | Positive | Negative | Indeterminate | ||
2 | Cancer | TB | Myopia | Hypertension | Negative | Negative | Negative | Negative |
A couple of things to note:
- It is ok to have extra columns for IDs where the entry does not exist (see ID 1)
- The number of columns is decided by the ID with the most entries (but this may not be the same ID for each column)
- Does Pandas have a clean way to do this with some methods?
CodePudding user response:
In [97]: tmp = df.assign(result_num=df.groupby('ID').cumcount() 1).set_index(['ID', 'result_num']).unstack()
...: tmp.columns = [f'{a}_{b}' for a,b in tmp.columns]
...: tmp
Out[97]:
Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4 Test Result_1 Test Result_2 Test Result_3 Test Result_4
ID
1 Cancer TB Lupus NaN Positive Negative Indeterminate NaN
2 Cancer TB Myopia Hypertension Negative Negative Negative Negative
- assign a group serial number using
df.groupby('ID').cumcount() 1
- set index to both ID and the serial number
- then unstack the serial number column to convert it to columns
- the last step flattens the multi-indexed column labels
- you can add a
.fillna('')
to get rid of theNaN
s