Home > Mobile >  pandas: find maximum across column range; use second column range for tie breaks
pandas: find maximum across column range; use second column range for tie breaks

Time:01-12

I have a data frame with two corresponding sets of columns, e.g. like this sample containing people and their rating of three fruits as well as their ability to detect a fruit ('corresponding' means that banana_rati corresponds to banana_reco etc.).

import pandas as pd    
df_raw = pd.DataFrame(data=[     ["name1", 10, 10,  9, 10, 10, 10],
                                 ["name2", 10, 10,  8, 10,  8,  4],
                                 ["name3", 10,  8,  8, 10,  8,  8],
                                 ["name4",  5, 10, 10,  5, 10,  8]],
                       columns=["name", "banana_rati", "mango_rati", "orange_rati",
                                "banana_reco", "mango_reco", "orange_reco"])

Suppose I now want to find each respondent's favorite fruit, which I define was the highest rated fruit.

I do this via:

cols_find_max = ["banana_rati", "mango_rati", "orange_rati"]  # columns to find the maximum in
mxs = df_raw[cols_find_max].eq(df_raw[cols_find_max].max(axis=1), axis=0)  # bool indicator if the cell contains the row-wise maximum value across cols_find_max

However, some respondents rated more than one fruit with the highes value:

df_raw['highest_rated_fruits'] = mxs.dot(mxs.columns   ' ').str.rstrip(', ').str.replace("_rati", "").str.split()

df_raw['highest_rated_fruits']
# Out:
# [banana, mango]  
# [banana, mango]  
#        [banana]  
# [mango, orange] 

I now want to use the maximum of ["banana_reco", "mango_reco", "orange_reco"] for tie breaks. If this also gives no tie break, I want a random selection of fruits from the so-determined favorite ones.

Can someone help me with this?

The expected output is:

df_raw['fav_fruit']
# Out
# mango  # <- random selection from banana (rating: 10, recognition: 10) and mango (same values)
# banana # <- highest ratings: banana, mango; highest recognition: banana
# banana # <- highest rating: banana
# mango  # <- highest ratings: mango, orange; highest recognition: mango

CodePudding user response:

UPDATED

Here's a way to do what your question asks:

from random import sample
df = pd.DataFrame({
    'name':[c[:-len('_rati')] for c in df_raw.columns if c.endswith('_rati')]})
df = df.assign(rand=df.name   '_rand', tupl=df.name   '_tupl')
num = len(df)
df_raw[df.rand] = [sample(range(num), k=num) for _ in range(len(df_raw))]
df_ord = pd.DataFrame(
    {f'{fr}_tupl':df_raw.apply(
        lambda x: tuple(x[(f'{fr}_{suff}' for suff in ('rati','reco','rand'))]), axis=1) 
        for fr in df.name})
df_raw['fav_fruit'] = df_ord.apply(lambda x: df.name[list(x==x.max())].squeeze(), axis=1)
df_raw = df_raw.drop(columns=df.rand)

Sample output:

    name  banana_rati  mango_rati  orange_rati  banana_reco  mango_reco  orange_reco fav_fruit
0  name1           10          10            9           10          10           10    banana
1  name2           10          10            8           10           8            4    banana
2  name3           10           8            8           10           8            8    banana
3  name4            5          10           10            5          10            8     mango

Explanation:

  • create one new column per fruit ending in rand to collectively hold a random shuffled sequence of those fruits (0 through number of fruits) for each row
  • create one new column per fruit ending in tupl containing 3-tuples of rati, reco, rand corresponding to that fruit
  • because the rand value for each fruit in a given row is distinct, the 3-tuples will break ties, and therefore, for each row we can simply look up the favorite fruit, namely the fruit whose tuple matches the row's max tuple
  • drop intermediate columns and we're done.

CodePudding user response:

Try:

import numpy as np
mxs.dot(mxs.columns   ' ').str.rstrip(', ').str.replace("_rati", "").str.split().apply(lambda x: x[np.random.randint(len(x))])

This adds .apply(lambda x: x[np.random.randint(len(x))]) to the end of your last statement and randomly selects an element from the list.

Run 1:

0    banana
1    banana
2    banana
3    orange

Run 2:

0     mango
1    banana
2    banana
3    orange
  • Related