I have a data frame with two corresponding sets of columns, e.g. like this sample containing people and their rating of three fruits as well as their ability to detect a fruit ('corresponding' means that banana_rati
corresponds to banana_reco
etc.).
import pandas as pd
df_raw = pd.DataFrame(data=[ ["name1", 10, 10, 9, 10, 10, 10],
["name2", 10, 10, 8, 10, 8, 4],
["name3", 10, 8, 8, 10, 8, 8],
["name4", 5, 10, 10, 5, 10, 8]],
columns=["name", "banana_rati", "mango_rati", "orange_rati",
"banana_reco", "mango_reco", "orange_reco"])
Suppose I now want to find each respondent's favorite fruit, which I define was the highest rated fruit.
I do this via:
cols_find_max = ["banana_rati", "mango_rati", "orange_rati"] # columns to find the maximum in
mxs = df_raw[cols_find_max].eq(df_raw[cols_find_max].max(axis=1), axis=0) # bool indicator if the cell contains the row-wise maximum value across cols_find_max
However, some respondents rated more than one fruit with the highes value:
df_raw['highest_rated_fruits'] = mxs.dot(mxs.columns ' ').str.rstrip(', ').str.replace("_rati", "").str.split()
df_raw['highest_rated_fruits']
# Out:
# [banana, mango]
# [banana, mango]
# [banana]
# [mango, orange]
I now want to use the maximum of ["banana_reco", "mango_reco", "orange_reco"]
for tie breaks. If this also gives no tie break, I want a random selection of fruits from the so-determined favorite ones.
Can someone help me with this?
The expected output is:
df_raw['fav_fruit']
# Out
# mango # <- random selection from banana (rating: 10, recognition: 10) and mango (same values)
# banana # <- highest ratings: banana, mango; highest recognition: banana
# banana # <- highest rating: banana
# mango # <- highest ratings: mango, orange; highest recognition: mango
CodePudding user response:
UPDATED
Here's a way to do what your question asks:
from random import sample
df = pd.DataFrame({
'name':[c[:-len('_rati')] for c in df_raw.columns if c.endswith('_rati')]})
df = df.assign(rand=df.name '_rand', tupl=df.name '_tupl')
num = len(df)
df_raw[df.rand] = [sample(range(num), k=num) for _ in range(len(df_raw))]
df_ord = pd.DataFrame(
{f'{fr}_tupl':df_raw.apply(
lambda x: tuple(x[(f'{fr}_{suff}' for suff in ('rati','reco','rand'))]), axis=1)
for fr in df.name})
df_raw['fav_fruit'] = df_ord.apply(lambda x: df.name[list(x==x.max())].squeeze(), axis=1)
df_raw = df_raw.drop(columns=df.rand)
Sample output:
name banana_rati mango_rati orange_rati banana_reco mango_reco orange_reco fav_fruit
0 name1 10 10 9 10 10 10 banana
1 name2 10 10 8 10 8 4 banana
2 name3 10 8 8 10 8 8 banana
3 name4 5 10 10 5 10 8 mango
Explanation:
- create one new column per fruit ending in
rand
to collectively hold a random shuffled sequence of those fruits (0 through number of fruits) for each row - create one new column per fruit ending in
tupl
containing 3-tuples ofrati, reco, rand
corresponding to that fruit - because the rand value for each fruit in a given row is distinct, the 3-tuples will break ties, and therefore, for each row we can simply look up the favorite fruit, namely the fruit whose tuple matches the row's max tuple
- drop intermediate columns and we're done.
CodePudding user response:
Try:
import numpy as np
mxs.dot(mxs.columns ' ').str.rstrip(', ').str.replace("_rati", "").str.split().apply(lambda x: x[np.random.randint(len(x))])
This adds .apply(lambda x: x[np.random.randint(len(x))])
to the end of your last statement and randomly selects an element from the list.
Run 1:
0 banana
1 banana
2 banana
3 orange
Run 2:
0 mango
1 banana
2 banana
3 orange