Pandas Pivot Table based on number and its frequency in certain column-CodePudding

I am new to Python pandas pivot_table and would like to ask a way to count frequencies of values in one column. I tried myself, but this situation is somewhat complicated.

I have a data frame like this:

Some students in the first column, like A01, A02, etc. Each student will have 5 semesters in school. The second column gives the semester number in which each student takes a course. Some students will take no course at all, some of them will take multiple courses per semester. Which means, if a student does not take courses at all, they will not appear in the original table.

The Data frame looks like the following:

df1 = pd.DataFrame({'student': ['A01','A01','A01','A02','A02','A04','A04','A04'],
                    'course': ['0', '3', '4', '2', '2', '0', '2', '4']})

  student  course
0   A01     0
1   A01     3
2   A01     4
3   A02     2
4   A02     2
5   A04     0
6   A04     2
7   A04     4

I want the table to be like this:

The second column will be the semester number, no matter how many courses that the student take, I want all 5 semesters to be listed, so each student will take 5 rows. The third column will be the number of courses the student took in the corresponding second column semester.

df2 = pd.DataFrame({'student': ['A01','A01','A01','A01','A01','A02','A02','A02','A02','A02','A03','A03','A03','A03','A03','A04','A04','A04','A04','A04'],
                    'semester': ['0', '1', '2', '3', '4', '0', '1', '2', '3', '4','0', '1', '2', '3', '4','0', '1', '2', '3', '4'],
                    'course_count': ['1', '0', '0', '1', '1', '0', '0', '2', '0', '0', '0', '0','0', '0', '0','1','0','1', '0', '1']})

    student  semester  course_count
0   A01       0           1
1   A01       1           0
2   A01       2           0
3   A01       3           1
4   A01       4           1
5   A02       0           0
6   A02       1           0
7   A02       2           2
8   A02       3           0
9   A02       4           0
10  A03       0           0
11  A03       1           0
12  A03       2           0
13  A03       3           0
14  A03       4           0
15  A04       0           1
16  A04       1           0
17  A04       2           1
18  A04       3           0
19  A04       4           1

What's more, some students' data is even missing, in the question, data for A03 is missing, but I still want the final table to list all the students with consecutive numbers, for example, if I loop 100 students, there will be 500 rows total.

Thanks in advance.

CodePudding user response：

You can use crosstab and reshaping:

out = (pd.crosstab(df1['student'], df1['course'])
         .reindex(index=[f'A{x 1:02d}' for x in range(4)],
                  columns=map(str, range(5)), 
                  fill_value=0)
         .stack().reset_index(name='course_count')
         .rename(columns={'course': 'semester'}) # optional
       )

Other approach with reindexing:

cols = ['student', 'course']
out = (df1
   .value_counts()
   .reindex(pd.MultiIndex.from_product([sorted(set(df1[c])) for c in cols],
                                       names=cols
                                      ),
            fill_value=0)
   .reset_index(name='course_count')
   .rename(columns={'course': 'semester'}) # optional
)

output:

   student semester  course_count
0      A01        0             1
1      A01        1             0
2      A01        2             0
3      A01        3             1
4      A01        4             1
5      A02        0             0
6      A02        1             0
7      A02        2             2
8      A02        3             0
9      A02        4             0
10     A03        0             0
11     A03        1             0
12     A03        2             0
13     A03        3             0
14     A03        4             0
15     A04        0             1
16     A04        1             0
17     A04        2             1
18     A04        3             0
19     A04        4             1