I have DataFrame
looks like this:
Name Score1 Score2 Score3
0 A 98 72 99
1 A 98 84 91
2 B 34 20 81
3 A 98 93 88
4 B 68 97 12
5 A NaN 72 NaN
I want to groupby
name, then take top 2 on Score1
, if duplicate values, then look at Score2
whichever larger. Expectation:
Name Score1 Score2 Score3
0 A 98 93 88
1 A 98 84 91
2 B 68 97 12
3 B 34 20 81
i tried df.groupby("Name").apply(lambda x:x.nlargest(2, ["Score1", "Score2"])).reset_index(drop=True)
. What i get is:
Name Score1 Score2 Score3
0 A 98 93 88
1 A 98 84 91
2 A 98 72 99
3 A NaN 72 NaN
4 B 68 97 12
5 B 34 20 81
I found that because of NaN
, it returns me more than 2 rows of data for Name A
. Is dropna
the only way to fix it?
CodePudding user response:
You can also do it like this:
out = df.sort_values(['Score1', 'Score2'], ascending=False).groupby('Name').head(2)
print(out)
Name Score1 Score2 Score3
3 A 98.0 93 88.0
1 A 98.0 84 91.0
4 B 68.0 97 12.0
2 B 34.0 20 81.0
CodePudding user response:
You can try fill the nan or drop them before using nlargest.
cols = ["Score1", "Score2"]
df[cols] = df[cols].fillna()
#df = df.dropna(subset=columns)
out = df.groupby("Name").apply(lambda g: g.nlargest(2, cols)).reset_index(drop=True)