Say I have such Pandas dataframe
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
so df
looks like:
print(df)
a b c
0 4 20 25
1 5 10 20
2 3 40 5
3 1 50 15
4 2 30 10
And I want to get the column name of the 2nd largest value in each row. Borrowing the answer from Felex Le in this thread, I can now get the 2nd largest value by:
def second_largest(l = []):
return (l.nlargest(2).min())
print(df.apply(second_largest, axis = 1))
which gives me:
0 20
1 10
2 5
3 15
4 10
dtype: int64
But what I really want is the column names for those values, or to say:
0 b
1 b
2 c
3 c
4 c
Pandas
has a function idxmax
which can do the job for the largest value:
df.idxmax(axis = 1)
0 c
1 c
2 b
3 b
4 b
dtype: object
Is there any elegant way to do the same job but for the 2nd largest value?
CodePudding user response:
If efficiency is important, numpy.argpartition
is quite efficient:
N = 2
cols = df.columns.to_numpy()
pd.Series(cols[np.argpartition(df.to_numpy().T, -N, axis=0)[-N]], index=df.index)
If you want a pure pandas (less efficient):
out = df.stack().groupby(level=0).apply(lambda s: s.nlargest(2).index[-1][1])
Output:
0 b
1 b
2 c
3 c
4 c
dtype: object
CodePudding user response:
Use numpy.argsort
for positions of second largest values:
df['new'] = df['new'] = df.columns.to_numpy()[np.argsort(df.to_numpy())[:, -2]]
print(df)
a b c new
0 4 20 25 b
1 5 10 20 b
2 3 40 5 c
3 1 50 15 c
4 2 30 10 c
Your solution should working, but is slow:
def second_largest(l = []):
return (l.nlargest(2).idxmin())
print(df.apply(second_largest, axis = 1))