How to convert multiple rows for a variable into a single column in python?-CodePudding

I have a dataframe in dynamic format for each ID

df1:

ID   |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10   |01-Apr-20 |31-Mar-21|  1123  |CSHLESS   |   23-Aug-2020 | 25-Aug-2020  |   25406 | 19351 
10   |01-Apr-20 |31-Mar-21|  1212  |POSTHOSP  |   30-Aug-2020 | 01-Sep-2020  |   4209  | 3964
10   |01-Apr-20 |31-Mar-21|  1680  |CSHLESS   |   18-Mar-2021 | 23-Mar-2021  |   18002 |  0
11   |12-Dec-20 |11-Dec-21|  1503  |CSHLESS   |   12-Jan-2021 | 15-Jan-2021  |   76137 | 50286
11   |12-Dec-20 |11-Dec-21|  1505  |CSHLESS   |   05-Jan-2021 | 07-Jan-2021  |   30000 | 0

Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID. Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.

I am expecting an output in the below format:

ID   |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10   |01-Apr-20 |31-Mar-21|  1123    |CSHLESS    | 23-Aug-2020     | 25-Aug-2020    |   25406   | 19351        |  1212    |POSTHOSP    | 30-Aug-2020     | 01-Sep-2020    |   4209   | 3964         |    1680  |CSHLESS     | 18-Mar-2021     | 23-Mar-2021   |   18002   | 0
11   |12-Dec-20 |11-Dec-21|  1503    |CSHLESS    | 12-Jan-2021     | 15-Jan-2021    |   76137   | 50286        |  1505    |CSHLESS     |05-Jan-2021      |07-Jan-2021     |30000     |0

I tried doing:

df1_updated = pd.get_dummies(df1,columns = ['claim_no','claim_type','Admission_date','Discharge_date','Claim_amt','Approved_amt'])

but it creates huge number of columns which is very difficult to read.

Can someone help me out in achieving the expected output.

CodePudding user response：

Annotated code

# Index columns
idx = ['ID', 'Start Date', 'End date']

# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount()   1

# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])

# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)

Result

                        claim_no_1 claim_type_1 Admission_date_1 Discharge_date_1 Claim_amt_1 Approved_amt_1 claim_no_2 claim_type_2 Admission_date_2 Discharge_date_2 Claim_amt_2 Approved_amt_2 claim_no_3 claim_type_3 Admission_date_3 Discharge_date_3 Claim_amt_3 Approved_amt_3
ID Start Date End date                                                                                                                                                                                                                                                                
10 01-Apr-20  31-Mar-21       1123      CSHLESS      23-Aug-2020      25-Aug-2020       25406          19351       1212     POSTHOSP      30-Aug-2020      01-Sep-2020        4209           3964       1680      CSHLESS      18-Mar-2021      23-Mar-2021       18002              0
11 12-Dec-20  11-Dec-21       1503      CSHLESS      12-Jan-2021      15-Jan-2021       76137          50286       1505      CSHLESS      05-Jan-2021      07-Jan-2021       30000              0        NaN          NaN              NaN              NaN         NaN            NaN