Find the max value(s) for specific columns in every row and match the corresponding column values to-CodePudding

I have a table like this,

Hour represents the hour of the day agent reached the customer successfully.

Cus1 to Cus4 are the top 4 time slots provided to the agent to call the customer.

Cus1_Score to Cus4_Score represents the probability of success for the call during the hours corresponding to Cus1 to Cus4.

I need to get YES or NO values for Match1 and Match2 columns.

Match1 represents this - Check for the highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus1 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with highest score and and check if the Hour number matches to any of the values in the high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.

Match2 represents this - Check for the second highest score in the columns Cus1_Score to Cus4_Score. If no duplicates, check if Cus2 = Hour and if it's a match we write YES else NO. If duplicates exist, check all the columns with second highest score and and check if the Hour number matches to any of the values in the second high score represented Cus columns(Cus1 to Cus4). Again, if it's match YES else NO.

ID   Hour  Cus1  Cus2  Cus3  Cus4  Cus1_Score  Cus2_Score  Cus3_Score  Cus4_Score  Match1  Match2

1     11    8     10    11    14      0.62        0.59        0.59        0.54       NO     YES 

2     13    15    16    18    13      0.57        0.57        0.57        0.57       YES    NO

3     16    09    14    16    12      0.67        0.54        0.48        0.34       NO     NO

4     08    11    08    12    17      0.58        0.55        0.43        0.25       NO     YES

I tried using idxmax() and nlargest(2) functions and I have no luck as I am not very strong in Python. Highly appreciate your help.

CodePudding user response：

You need to write a function for row operation with all the criterion that you explained. Then you can apply this function to the rows using, .iterrows, for, or .apply method which you can find them here.

CodePudding user response：

Hopefully I got the conditions right; the calculations below rely on Pandas aligning on the index before assigning values; so I create a long dataframe, create conditions off the long dataframe, reduce the size to unique indices and assign the outcome to the original dataframe:

Create a long dataframe to work out the conditions(I'm using pyjanitor pivot_longer to reshape here because of the convenience; you can do this in a couple of steps without pyjanitor):

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import janitor as jn
import pandas as pd
res = (df.pivot_longer(['ID', 'Hour'], 
                       names_pattern= [r'. \d$', r'. Score$'], 
                       names_to = ['cus', 'score'], 
                       ignore_index = False)
       )

print(res)

   ID  Hour  cus  score
0   1    11    8   0.62
1   2    13   15   0.57
2   3    16    9   0.67
3   4     8   11   0.58
0   1    11   10   0.59
1   2    13   16   0.57
2   3    16   14   0.54
3   4     8    8   0.55
0   1    11   11   0.59
1   2    13   18   0.57
2   3    16   16   0.48
3   4     8   12   0.43
0   1    11   14   0.54
1   2    13   13   0.57
2   3    16   12   0.34
3   4     8   17   0.25

Create conditions for match1:

# get booleans for max and no duplicates
res = res.sort_index()
max1 = res.groupby(level=0).score.transform('max')
# max without duplicates
cond1 = res.score.eq(max1).groupby(level=0).sum().eq(1)
cond1 = cond1 & df.Hour.eq(df.Cus1)
# max with duplicates
cond2 = res.score.eq(max1).groupby(level=0).transform('sum').gt(1) 
cond2 &= res.Hour.eq(res.cus)
cond2 = cond2.groupby(level=0).any()
df['match1'] = np.where(cond1 | cond2, 'YES', 'NO')

match2:


second_largest = (res.sort_values('score', ascending=False)
                     .groupby(level=0, sort = False)
                     .score
                     .transform('nth', 1)
                  )
second_largest = second_largest.sort_index()
# second largest without duplicates
cond_1 = res.score.eq(second_largest).groupby(level=0).sum().eq(1)
cond_1 &= df.Hour.eq(df.Cus2)
# second largest with duplicates
cond_2 = res.score.eq(second_largest).groupby(level=0).transform('sum').gt(1)
cond_2 &= res.Hour.eq(res.cus)
cond_2 = cond_2.groupby(level=0).any()
df['match2'] = np.where(cond_1 | cond_2, 'YES', 'NO')

df

      ID  Hour  Cus1  Cus2  Cus3  Cus4  Cus1_Score  Cus2_Score  Cus3_Score  Cus4_Score match1 match2
0   1    11     8    10    11    14        0.62        0.59        0.59        0.54     NO    YES
1   2    13    15    16    18    13        0.57        0.57        0.57        0.57    YES    YES
2   3    16     9    14    16    12        0.67        0.54        0.48        0.34     NO     NO
3   4     8    11     8    12    17        0.58        0.55        0.43        0.25     NO    YES