I have a DataFrame like this:
student marks term
steve 55 1
jordan 66 2
steve 53 1
alan 74 2
jordan 99 1
steve 81 2
alan 78 1
alan 76 2
jordan 48 1
I would like to return highest two scores for each student
student marks term
steve 81 2
steve 55 1
jordan 99 1
jordan 66 2
alan 78 1
alan 76 2
I have tried
df = df.groupby('student')['marks'].max()
but it returns 1 row, I would like each student in the order they are mentioned with top two scores.
CodePudding user response:
You could use groupby
nlargest
to find the 2 largest values; then use loc
to sort in the order they appear in df
:
out = (df.groupby('student')['marks'].nlargest(2)
.droplevel(1)
.loc[df['student'].drop_duplicates()]
.reset_index())
Output:
student marks
0 steve 81
1 steve 55
2 jordan 99
3 jordan 66
4 alan 78
5 alan 76
If you want to keep "terms" as well, you could use the index:
idx = df.groupby('student')['marks'].nlargest(2).index.get_level_values(1)
out = df.loc[idx].set_index('student').loc[df['student'].drop_duplicates()].reset_index()
Output:
student marks term
0 steve 81 2
1 steve 55 1
2 jordan 99 1
3 jordan 66 2
4 alan 78 1
5 alan 76 2
@sammywemmy suggested a better way to derive the second result:
out = (df.loc[df.groupby('student', sort=False)['marks'].nlargest(2)
.index.get_level_values(1)]
.reset_index(drop=True))
CodePudding user response:
Sorting before grouping should suffice, since you need to keep the term
column:
df.sort_values('marks').groupby('student', sort = False).tail(2)
student marks term
0 steve 55 1
1 jordan 66 2
7 alan 76 2
6 alan 78 1
5 steve 81 2
4 jordan 99 1