For example I have a dataframe df
:
trade_date | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-01-04 00:00:00 | 5 | 4 | 2 | 1 | 3 | 6 | 8 | 9 | 10 | 7 | 11 | 12 |
2010-01-05 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 8 | 9 | 10 | 7 | 12 | 11 |
2010-01-06 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 8 | 9 | 10 | 7 | 12 | 11 |
2010-01-07 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 8 | 9 | 10 | 7 | 12 | 11 |
2010-01-08 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 7 | 9 | 10 | 8 | 12 | 11 |
2010-01-11 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 7 | 9 | 10 | 8 | 12 | 11 |
2010-01-12 00:00:00 | 5 | 4 | 3 | 1 | 2 | 6 | 7 | 9 | 10 | 8 | 12 | 11 |
2010-01-13 00:00:00 | 6 | 4 | 3 | 1 | 2 | 5 | 7 | 9 | 10 | 8 | 12 | 11 |
2010-01-14 00:00:00 | 6 | 4 | 3 | 1 | 2 | 5 | 7 | 9 | 10 | 8 | 12 | 11 |
2010-01-15 00:00:00 | 6 | 5 | 3 | 1 | 2 | 4 | 7 | 9 | 10 | 8 | 12 | 11 |
and I want to get this result:
df.apply(lambda r: r.nlargest(2).index.max(), axis=1)
All functions used in apply are numpy/pandas's built-in function, so I think there should be some way to get rid of python level for loop and make this transform much faster.
How can I do that?
CodePudding user response:
For improve performance use numpy - get position of sorted values by numpy.argsort
with filter first top 2 values with indexing and converting value to negative, then get columns names with maximal:
print (df)
01 02 03 04 05 06 07 08 09 10 11 12
trade_date
2010-01-04 00:00:00 5 40 20 1 3 6 8 9 10 7 11 12
2010-01-05 00:00:00 5 4 3 1 2 6 8 9 10 7 12 11
2010-01-06 00:00:00 5 4 3 1 2 6 8 9 100 7 12 11
2010-01-07 00:00:00 5 4 3 1 2 6 8 9 10 7 12 11
2010-01-08 00:00:00 5 4 3 1 20 6 7 9 10 8 12 11
2010-01-11 00:00:00 5 40 3 10 20 6 7 9 10 8 12 11
2010-01-12 00:00:00 5 4 3 1 2 6 7 9 10 8 12 11
2010-01-13 00:00:00 6 4 3 1 2 5 7 9 10 8 12 11
2010-01-14 00:00:00 600 40 3 1 2 5 7 9 10 8 12 11
2010-01-15 00:00:00 6 5 3 1 200 4 70 9 10 8 12 11
print (df.apply(lambda r: r.nlargest(2).index.max(), axis=1))
trade_date
2010-01-04 00:00:00 03
2010-01-05 00:00:00 12
2010-01-06 00:00:00 11
2010-01-07 00:00:00 12
2010-01-08 00:00:00 11
2010-01-11 00:00:00 05
2010-01-12 00:00:00 12
2010-01-13 00:00:00 12
2010-01-14 00:00:00 02
2010-01-15 00:00:00 07
dtype: object
c = df.columns.to_numpy()
print (np.max(c[np.argsort(-df.to_numpy())[:, :2]], axis=1))
['03' '12' '11' '12' '11' '05' '12' '12' '02' '07']