Home > Enterprise >  Select Best Value Based on Highest Score
Select Best Value Based on Highest Score

Time:09-30

Need to create 2 new column(expected_value and expected_score) by identifying maximum score between Score and Score.1 and returning corresponding value from Value and Value.1 columns whose score is that maximum.

dict = {'Alias': ['JOB_NUMBER','GOOD_DATE','SALARY','SALARY_DATE','NAME','NUMBER','NEXT_DATE','COMMISSION_AMOUNT',
'BALANCE','ADDRESS'],'Value': ['123456789','8/10/2022','$373.28','8/10/2022','Group Group','08764510160','8/10/2022','$9.22','$125,770.04','11111111 N Test Rd'],'Score': [98.0,98.0,95.0,98.0,92.0,99.0,51.0,94.0,82.0,95.0],'page': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],'Alias.1': ['JOB_NUMBER','GOOD_DATE','SALARY','SALARY_DATE','NAME','NUMBER','NEXT_DATE',
'COMMISSION_AMOUNT','BALANCE','ADDRESS'],'Value.1': [9090909090,np.nan,np.nan,np.nan,np.nan,
np.nan,'8/1/2022',np.nan,np.nan,np.nan],'Score.1': [99.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0,0.0,0.0],
'page.1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

df = pd.DataFrame(dict)

Original DataFrame: enter image description here

Expected Result: enter image description here

My Attempt not able to make any inroad:

# column name with max duration value
max_col_name = df.filter(like='Score', axis=1).max(axis = 1).idxmax()

# index of max_col_name
max_col_idx =df.columns.get_loc(max_col_name)

# row index of max value in max_col_name
max_row_idx = df[max_col_name].idxmax()

# output with .loc
df.iloc[max_row_idx, [0, max_col_idx, max_col_idx   1 ]]

CodePudding user response:

With dynamic columns for Score (e.g. Score, Score.1, Score.2 ... Score.n) and same with Value columns, try as follows:

# don't use "dict" as a variable to store a `dict`, it will overwrite
# its built-in functionality!

df = pd.DataFrame(my_dict)

# group all cols that start with `Score`|`Value`

scores = df.filter(like='Score', axis=1).columns
values = df.filter(like='Value', axis=1).columns

# create a mask with `True` for each cell in `df[scores]` that
# matches the max of its row (`axis=1`)

max_values = df[scores].max(axis=1)
m = df[scores].eq(max_values, axis=0).to_numpy()

# apply (inverse of) mask to `df[values]` and fill up any `NaN` value
# in 1st column with the first value that is found in 2nd, 3rd, etc column
# and then select only that first column; will now always be filled

df['expected_value'] = df[values].mask(~m).fillna(method='bfill', axis=1).iloc[:,0]
df['expected_score'] = max_values

print(df)

               Alias               Value  ...      expected_value  expected_score
0         JOB_NUMBER           123456789  ...          9090909090            99.0
1          GOOD_DATE           8/10/2022  ...           8/10/2022            98.0
2             SALARY             $373.28  ...             $373.28            95.0
3        SALARY_DATE           8/10/2022  ...           8/10/2022            98.0
4               NAME         Group Group  ...         Group Group            92.0
5             NUMBER         08764510160  ...         08764510160            99.0
6          NEXT_DATE           8/10/2022  ...            8/1/2022            98.0
7  COMMISSION_AMOUNT               $9.22  ...               $9.22            94.0
8            BALANCE         $125,770.04  ...         $125,770.04            82.0
9            ADDRESS  11111111 N Test Rd  ...  11111111 N Test Rd            95.0

CodePudding user response:

here is one way to do it

# identify the column names that has a max score in a row
df['maxSc']=df.filter(like='Score').idxmax(axis=1)

#Value columns are similarly named, so translate Score to Value columns
df['maxVal']=df['maxSc'].replace(r'Score','Value', regex=True)

# pick the score and value based on identified columns, create a string of the two
# then split to create the desire columns
# finally remove teh temp columns

(df[['expectedScore','expectedValue']] = 
    df.apply(lambda x: str(x[x['maxSc']])  "|"  str(x[x['maxVal']])  , axis=1)
    .str
    .split('|', expand=True)
)
df.drop(columns=['maxSc', 'maxVal'])

    Alias               Value               Score   page    Alias.1         Value.1     Score.1     page.1  expectedScore   expectedValue
0   JOB_NUMBER          123456789           98.0    0   JOB_NUMBER          9090909090  99.0            1       99.0        9090909090
1   GOOD_DATE           8/10/2022           98.0    0   GOOD_DATE           NaN         0.0             1       98.0        8/10/2022
2   SALARY              $373.28             95.0    0   SALARY              NaN         0.0             1       95.0        $373.28
3   SALARY_DATE         8/10/2022           98.0    0   SALARY_DATE         NaN         0.0             1       98.0        8/10/2022
4   NAME                Group Group         92.0    0   NAME                NaN         0.0             1       92.0        Group Group
5   NUMBER              08764510160         99.0    0   NUMBER              NaN         0.0             1       99.0        08764510160
6   NEXT_DATE           8/10/2022           51.0    0   NEXT_DATE           8/1/2022    98.0            1       98.0        8/1/2022
7   COMMISSION_AMOUNT   $9.22               94.0    0   COMMISSION_AMOUNT   NaN         0.0             1       94.0        $9.22
8   BALANCE             $125,770.04         82.0    0   BALANCE             NaN         0.0             1       82.0        $125,770.04
9   ADDRESS             11111111 N Test Rd  95.0    0   ADDRESS             NaN         0.0             1       95.0        11111111 N Test Rd

CodePudding user response:

Dump into numpy as much as possible - push your loops into C where possible; this also assumes an alignment column wise: score comes before score.1, value comes before value.1; you can always ensure that by sorting the columns index:

Get the scores and values arrays:

scores = df.filter(like='Score').to_numpy()
values = df.filter(like='Value').to_numpy()

Get the index positions for the maximum scores:

cols = np.argmax(scores, axis = 1)

Index scores and values to get the actual values:

rows = np.arange(len(scores))
scores = scores[rows, cols]
values = values[rows, cols]

Create final dataframe:

df.assign(
    expected_value = values,
    expected_score = scores)

               Alias               Value  Score  page            Alias.1     Value.1  Score.1  page.1      expected_value  expected_score
0         JOB_NUMBER           123456789   98.0     0         JOB_NUMBER  9090909090     99.0       1          9090909090            99.0
1          GOOD_DATE           8/10/2022   98.0     0          GOOD_DATE         NaN      0.0       1           8/10/2022            98.0
2             SALARY             $373.28   95.0     0             SALARY         NaN      0.0       1             $373.28            95.0
3        SALARY_DATE           8/10/2022   98.0     0        SALARY_DATE         NaN      0.0       1           8/10/2022            98.0
4               NAME         Group Group   92.0     0               NAME         NaN      0.0       1         Group Group            92.0
5             NUMBER         08764510160   99.0     0             NUMBER         NaN      0.0       1         08764510160            99.0
6          NEXT_DATE           8/10/2022   51.0     0          NEXT_DATE    8/1/2022     98.0       1            8/1/2022            98.0
7  COMMISSION_AMOUNT               $9.22   94.0     0  COMMISSION_AMOUNT         NaN      0.0       1               $9.22            94.0
8            BALANCE         $125,770.04   82.0     0            BALANCE         NaN      0.0       1         $125,770.04            82.0
9            ADDRESS  11111111 N Test Rd   95.0     0            ADDRESS         NaN      0.0       1  11111111 N Test Rd            95.0
  • Related