I am new to Python pandas pivot_table and would like to ask a way to count frequencies of values in one column. I tried myself, but this situation is somewhat complicated.
I have a data frame like this:
Some students in the first column, like A01, A02, etc. Each student will have 5 semesters in school. The second column gives the semester number in which each student takes a course. Some students will take no course at all, some of them will take multiple courses per semester. Which means, if a student does not take courses at all, they will not appear in the original table.
The Data frame looks like the following:
df1 = pd.DataFrame({'student': ['A01','A01','A01','A02','A02','A04','A04','A04'],
'course': ['0', '3', '4', '2', '2', '0', '2', '4']})
student course
0 A01 0
1 A01 3
2 A01 4
3 A02 2
4 A02 2
5 A04 0
6 A04 2
7 A04 4
I want the table to be like this:
The second column will be the semester number, no matter how many courses that the student take, I want all 5 semesters to be listed, so each student will take 5 rows. The third column will be the number of courses the student took in the corresponding second column semester.
df2 = pd.DataFrame({'student': ['A01','A01','A01','A01','A01','A02','A02','A02','A02','A02','A03','A03','A03','A03','A03','A04','A04','A04','A04','A04'],
'semester': ['0', '1', '2', '3', '4', '0', '1', '2', '3', '4','0', '1', '2', '3', '4','0', '1', '2', '3', '4'],
'course_count': ['1', '0', '0', '1', '1', '0', '0', '2', '0', '0', '0', '0','0', '0', '0','1','0','1', '0', '1']})
student semester course_count
0 A01 0 1
1 A01 1 0
2 A01 2 0
3 A01 3 1
4 A01 4 1
5 A02 0 0
6 A02 1 0
7 A02 2 2
8 A02 3 0
9 A02 4 0
10 A03 0 0
11 A03 1 0
12 A03 2 0
13 A03 3 0
14 A03 4 0
15 A04 0 1
16 A04 1 0
17 A04 2 1
18 A04 3 0
19 A04 4 1
What's more, some students' data is even missing, in the question, data for A03 is missing, but I still want the final table to list all the students with consecutive numbers, for example, if I loop 100 students, there will be 500 rows total.
Thanks in advance.
CodePudding user response:
You can use crosstab
and reshaping:
out = (pd.crosstab(df1['student'], df1['course'])
.reindex(index=[f'A{x 1:02d}' for x in range(4)],
columns=map(str, range(5)),
fill_value=0)
.stack().reset_index(name='course_count')
.rename(columns={'course': 'semester'}) # optional
)
Other approach with reindexing:
cols = ['student', 'course']
out = (df1
.value_counts()
.reindex(pd.MultiIndex.from_product([sorted(set(df1[c])) for c in cols],
names=cols
),
fill_value=0)
.reset_index(name='course_count')
.rename(columns={'course': 'semester'}) # optional
)
output:
student semester course_count
0 A01 0 1
1 A01 1 0
2 A01 2 0
3 A01 3 1
4 A01 4 1
5 A02 0 0
6 A02 1 0
7 A02 2 2
8 A02 3 0
9 A02 4 0
10 A03 0 0
11 A03 1 0
12 A03 2 0
13 A03 3 0
14 A03 4 0
15 A04 0 1
16 A04 1 0
17 A04 2 1
18 A04 3 0
19 A04 4 1