I have an unbalanced data panel where each period a student can have a certain level/type of scholarship:
head(df)
ID student_period scholarship
4567 1 scholarship_level_1
4567 2 scholarship_level_2
4567 3 scholarship_level_2
4567 4 scholarship_level_3
5478 4 scholarship_level_3
5478 5 scholarship_level_3
6758 7 scholarship_level_1
6758 8 scholarship_level_2
6758 9 scholarship_level_2
Basically, I want to create a new variable that plots the statistical mode of the scholarship level for each student ID in this panel. Something like this:
head(df1)
ID student_period scholarship scholarship_mode
4567 1 scholarship_level_1 scholarship_level_2
4567 2 scholarship_level_2 scholarship_level_2
4567 3 scholarship_level_2 scholarship_level_2
4567 4 scholarship_level_3 scholarship_level_2
5478 4 scholarship_level_3 scholarship_level_3
5478 5 scholarship_level_3 scholarship_level_3
6758 7 scholarship_level_1 scholarship_level_2
6758 8 scholarship_level_2 scholarship_level_2
6758 9 scholarship_level_2 scholarship_level_2
Any ideas?
CodePudding user response:
You can use groupby
transform
and value_counts
:
df['scholarship_mode'] = (df.groupby('ID')['scholarship']
.transform(lambda x: x.value_counts().index[0]))
or mode
:
df['scholarship_mode'] = (df.groupby('ID')['scholarship']
.transform(lambda x: x.mode().iloc[0]))
output:
ID student_period scholarship scholarship_mode
0 4567 1 scholarship_level_1 scholarship_level_2
1 4567 2 scholarship_level_2 scholarship_level_2
2 4567 3 scholarship_level_2 scholarship_level_2
3 4567 4 scholarship_level_3 scholarship_level_2
4 5478 4 scholarship_level_3 scholarship_level_3
5 5478 5 scholarship_level_3 scholarship_level_3
6 6758 7 scholarship_level_1 scholarship_level_2
7 6758 8 scholarship_level_2 scholarship_level_2
8 6758 9 scholarship_level_2 scholarship_level_2
NB. be aware that mode
/value_counts
can have ties, in this case only one value will be used.