How do I select the 3 columns with highest values from a row in a Pandas dataframe?-CodePudding

So I have a dataframe as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
                   columns=['a', 'b', 'c', 'a_select','b_select','c_select'])

df

Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...

I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.

For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).

Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).

CodePudding user response：

Simple oneliner using nlargest

>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)

For performance, maybe numpy is useful:

>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)

To get values from the first columns, use argsort

>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))], 
np.argsort(arr, 1)][:, -2:].mean(1)

array([1.5, 5. , 8. ])