Home > Software design >  Two columns list data into multiple row with same reference id
Two columns list data into multiple row with same reference id

Time:09-22

I have a data frame like this,

|ID |Pass | Fail | Pass_student_ID |Failed_student_ID|
|:-:|:---:|:----:|:---------------:|:---------------:|
| 1 |  2  |   0  |    [101,102]    |        0        |
| 2 |  1  |   1  |      [102]      |       101       |
| 3 |  0  |   1  |    [101,102]    |        0        |

I want an output like this

| ID|Student_ID|Status|
|:-:|:--------:|:----:|
| 1 |    101   | Pass |
| 1 |    102   | Pass |
| 2 |    101   | Fail |
| 2 |    102   | Pass |
| 3 |    101   | Fail |
| 3 |    102   | Fail |

How do I write in python to get this kind of data frame?

CodePudding user response:

IIUC, you can explode, then reshape with melt:

(df.explode('Pass_student_ID')
 [['ID', 'Pass_student_ID', 'Failed_student_ID']]
 .rename(columns={'Pass_student_ID': 'Pass', 'Failed_student_ID': 'Fail'})
 .melt('ID', value_name='Student_ID', var_name='Status')
 .loc[lambda d: d['Student_ID'].ne(0)]
 .sort_values(by='ID')
)

output:

   ID Status Student_ID
0   1   Pass        101
1   1   Pass        102
2   2   Pass        102
7   2   Fail        101
3   3   Pass        101
4   3   Pass        102

CodePudding user response:

Use DataFrame.drop for remove Pass/Fail columns, unpivot by DataFrame.melt, convert lists to scalars by DataFrame.explode, filter out 0 values and for Status filter first 4 values of columns names, last sorting by DataFrame.sort_values:

import ast

#if lists are string representation
df['Pass_student_ID'] = df['Pass_student_ID'].apply(ast.literal_eval)


df1 = (df.drop(['Pass','Fail'],axis=1)
        .melt('ID', value_name='Student_ID', var_name='Status')
        .explode('Student_ID')
        .loc[lambda x: x['Student_ID'].ne(0)]
        .assign(Status = lambda x: x['Status'].str[:4])
        .sort_values('ID', ignore_index=True))
print (df1)
   ID Status Student_ID
0   1   Pass        101
1   1   Pass        102
2   2   Pass        102
3   2   Fail        101
4   3   Pass        101
5   3   Pass        102

If multiple columns in DataFrame is possible select values by subset instead drop:

df1 = (df[['ID','Pass_student_ID','Failed_student_ID']]
        .melt('ID', value_name='Student_ID', var_name='Status')
        .explode('Student_ID')
        .loc[lambda x: x['Student_ID'].ne(0)]
        .assign(Status = lambda x: x['Status'].str[:4])
        .sort_values('ID', ignore_index=True))
print (df1)
   ID Status Student_ID
0   1   Pass        101
1   1   Pass        102
2   2   Pass        102
3   2   Fail        101
4   3   Pass        101
5   3   Pass        102
  • Related