In my dataset, there is a column that has names of the courses taken by a student. It has multiple values
StudentID:S1, Courses: course1, course2, course3
StudentID:S2, Courses: course1, course4, course5
I want to separate these values (course names) and create column for each course separately and assign values 1 or 0 based on the courses taken by a student. How can I do that using Pandas library ?
Any help is appreciated Thank you
CodePudding user response:
header_list = ['StudentID','Courses','Course1', 'Course2']
df = df.reindex(columns = header_list)
This will include those columns into the dataframe. Including empty columns.
CodePudding user response:
I created the sample dataframe using data that you gave
StudentID Courses
0 S1 course1, course2, course3
1 S2 course1, course4, course5
Split the courses column
df['Courses'] = df.Courses.str.split(', ')
StudentID Courses
0 S1 [course1, course2, course3]
1 S2 [course1, course4, course5]
Explode the courses column
df = df.explode('Courses')
StudentID Courses
0 S1 course1
0 S1 course2
0 S1 course3
1 S2 course1
1 S2 course4
1 S2 course5
Use get_dummies to create a separate column for each course
pd.get_dummies(df.explode('Courses'), columns=['Courses'], prefix='', prefix_sep='')
StudentID course1 course2 course3 course4 course5
0 S1 1 0 0 0 0
0 S1 0 1 0 0 0
0 S1 0 0 1 0 0
1 S2 1 0 0 0 0
1 S2 0 0 0 1 0
1 S2 0 0 0 0 1