Home > Enterprise >  Map unique values by ID to separate columns in Pandas
Map unique values by ID to separate columns in Pandas

Time:09-15

Suppose I have the following DataFrame:

ID Diagnosis Test Result
1 Cancer Positive
1 TB Negative
1 Lupus Indeterminate
2 Cancer Negative
2 TB Negative
2 Myopia Negative
2 Hypertension Negative

Which is in this wide format marked by ID. I want to truncate the rows to a single entry by creating more columns on the unique values by each ID. I would like my DataFrame to look like:

ID Diagnosis_1 Diagnosis_2 Diagnosis_3 Diagnosis_4 Test Result_1 Test Result_2 Test Result_3 Test Result_4
1 Cancer TB Lupus Positive Negative Indeterminate
2 Cancer TB Myopia Hypertension Negative Negative Negative Negative

A couple of things to note:

  • It is ok to have extra columns for IDs where the entry does not exist (see ID 1)
  • The number of columns is decided by the ID with the most entries (but this may not be the same ID for each column)
  • Does Pandas have a clean way to do this with some methods?

CodePudding user response:

In [97]: tmp = df.assign(result_num=df.groupby('ID').cumcount() 1).set_index(['ID', 'result_num']).unstack()
    ...: tmp.columns = [f'{a}_{b}' for a,b in tmp.columns]
    ...: tmp
Out[97]:
   Diagnosis_1 Diagnosis_2 Diagnosis_3   Diagnosis_4 Test Result_1 Test Result_2  Test Result_3 Test Result_4
ID
1       Cancer          TB       Lupus           NaN      Positive      Negative  Indeterminate           NaN
2       Cancer          TB      Myopia  Hypertension      Negative      Negative       Negative      Negative
  1. assign a group serial number using df.groupby('ID').cumcount() 1
  2. set index to both ID and the serial number
  3. then unstack the serial number column to convert it to columns
  4. the last step flattens the multi-indexed column labels
  5. you can add a .fillna('') to get rid of the NaNs
  • Related