pandas: find maximum across column range; use second column range for tie breaks-CodePudding

I have a data frame with two corresponding sets of columns, e.g. like this sample containing people and their rating of three fruits as well as their ability to detect a fruit ('corresponding' means that banana_rati corresponds to banana_reco etc.).

import pandas as pd    
df_raw = pd.DataFrame(data=[     ["name1", 10, 10,  9, 10, 10, 10],
                                 ["name2", 10, 10,  8, 10,  8,  4],
                                 ["name3", 10,  8,  8, 10,  8,  8],
                                 ["name4",  5, 10, 10,  5, 10,  8]],
                       columns=["name", "banana_rati", "mango_rati", "orange_rati",
                                "banana_reco", "mango_reco", "orange_reco"])

Suppose I now want to find each respondent's favorite fruit, which I define was the highest rated fruit.

I do this via:

cols_find_max = ["banana_rati", "mango_rati", "orange_rati"]  # columns to find the maximum in
mxs = df_raw[cols_find_max].eq(df_raw[cols_find_max].max(axis=1), axis=0)  # bool indicator if the cell contains the row-wise maximum value across cols_find_max

However, some respondents rated more than one fruit with the highes value:

df_raw['highest_rated_fruits'] = mxs.dot(mxs.columns   ' ').str.rstrip(', ').str.replace("_rati", "").str.split()

df_raw['highest_rated_fruits']
# Out:
# [banana, mango]  
# [banana, mango]  
#        [banana]  
# [mango, orange]

I now want to use the maximum of ["banana_reco", "mango_reco", "orange_reco"] for tie breaks. If this also gives no tie break, I want a random selection of fruits from the so-determined favorite ones.

Can someone help me with this?

The expected output is:

df_raw['fav_fruit']
# Out
# mango  # <- random selection from banana (rating: 10, recognition: 10) and mango (same values)
# banana # <- highest ratings: banana, mango; highest recognition: banana
# banana # <- highest rating: banana
# mango  # <- highest ratings: mango, orange; highest recognition: mango

CodePudding user response：

UPDATED

Here's a way to do what your question asks:

from random import sample
df = pd.DataFrame({
    'name':[c[:-len('_rati')] for c in df_raw.columns if c.endswith('_rati')]})
df = df.assign(rand=df.name   '_rand', tupl=df.name   '_tupl')
num = len(df)
df_raw[df.rand] = [sample(range(num), k=num) for _ in range(len(df_raw))]
df_ord = pd.DataFrame(
    {f'{fr}_tupl':df_raw.apply(
        lambda x: tuple(x[(f'{fr}_{suff}' for suff in ('rati','reco','rand'))]), axis=1) 
        for fr in df.name})
df_raw['fav_fruit'] = df_ord.apply(lambda x: df.name[list(x==x.max())].squeeze(), axis=1)
df_raw = df_raw.drop(columns=df.rand)

Sample output:

    name  banana_rati  mango_rati  orange_rati  banana_reco  mango_reco  orange_reco fav_fruit
0  name1           10          10            9           10          10           10    banana
1  name2           10          10            8           10           8            4    banana
2  name3           10           8            8           10           8            8    banana
3  name4            5          10           10            5          10            8     mango

Explanation:

create one new column per fruit ending in rand to collectively hold a random shuffled sequence of those fruits (0 through number of fruits) for each row
create one new column per fruit ending in tupl containing 3-tuples of rati, reco, rand corresponding to that fruit
because the rand value for each fruit in a given row is distinct, the 3-tuples will break ties, and therefore, for each row we can simply look up the favorite fruit, namely the fruit whose tuple matches the row's max tuple
drop intermediate columns and we're done.

CodePudding user response：

Try:

import numpy as np
mxs.dot(mxs.columns   ' ').str.rstrip(', ').str.replace("_rati", "").str.split().apply(lambda x: x[np.random.randint(len(x))])

This adds .apply(lambda x: x[np.random.randint(len(x))]) to the end of your last statement and randomly selects an element from the list.

Run 1:

0    banana
1    banana
2    banana
3    orange

Run 2:

0     mango
1    banana
2    banana
3    orange