I have a large dataframe where I want to use groupby
and nlargest
to look for the second largest, third, fourth and fifth largest value of each group. I have over 500 groups and each group has over 1000 values. I also have other columns in the dataframe which I want to keep after applying groupby
and nlargest
. My dataframe looks like this
df = pd.DataFrame({
'group': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10]
})
To look for second, third, fourth largest and so on of each group for column a
I use
secondlargest = df.groupby(['group'], as_index=False)['a'].apply(lambda grp: grp.nlargest(2).min())
which returns
group a
0 1 4
1 2 5
2 3 1
3 4 2
4 5 20
5 6 10
6 7 40
7 8 30
I need columns b
and c
present in this resulting dataframe. I use the following to subset the original dataframe but it returns an empty dataframe. How should I modify the code?
secondsubset = df[df.groupby(['group'])['a'].apply(lambda grp: grp.nlargest(2).min())]
CodePudding user response:
If I understand your goal correctly, you should be able to just drop as_index=False
, use idxmin
instead of min
, pass the result to df.loc
:
df.loc[df.groupby('group')['a'].apply(lambda grp: grp.nlargest(2).idxmin())]
CodePudding user response:
You can uses agg lambda. It is neater
df.groupby('group').agg(lambda grp: grp.nlargest(2).min())