Need to create 2 new column(expected_value and expected_score) by identifying maximum score between Score and Score.1 and returning corresponding value from Value and Value.1 columns whose score is that maximum.
dict = {'Alias': ['JOB_NUMBER','GOOD_DATE','SALARY','SALARY_DATE','NAME','NUMBER','NEXT_DATE','COMMISSION_AMOUNT',
'BALANCE','ADDRESS'],'Value': ['123456789','8/10/2022','$373.28','8/10/2022','Group Group','08764510160','8/10/2022','$9.22','$125,770.04','11111111 N Test Rd'],'Score': [98.0,98.0,95.0,98.0,92.0,99.0,51.0,94.0,82.0,95.0],'page': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],'Alias.1': ['JOB_NUMBER','GOOD_DATE','SALARY','SALARY_DATE','NAME','NUMBER','NEXT_DATE',
'COMMISSION_AMOUNT','BALANCE','ADDRESS'],'Value.1': [9090909090,np.nan,np.nan,np.nan,np.nan,
np.nan,'8/1/2022',np.nan,np.nan,np.nan],'Score.1': [99.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0,0.0,0.0],
'page.1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(dict)
My Attempt not able to make any inroad:
# column name with max duration value
max_col_name = df.filter(like='Score', axis=1).max(axis = 1).idxmax()
# index of max_col_name
max_col_idx =df.columns.get_loc(max_col_name)
# row index of max value in max_col_name
max_row_idx = df[max_col_name].idxmax()
# output with .loc
df.iloc[max_row_idx, [0, max_col_idx, max_col_idx 1 ]]
CodePudding user response:
With dynamic columns for Score
(e.g. Score
, Score.1
, Score.2
... Score.n
) and same with Value
columns, try as follows:
# don't use "dict" as a variable to store a `dict`, it will overwrite
# its built-in functionality!
df = pd.DataFrame(my_dict)
# group all cols that start with `Score`|`Value`
scores = df.filter(like='Score', axis=1).columns
values = df.filter(like='Value', axis=1).columns
# create a mask with `True` for each cell in `df[scores]` that
# matches the max of its row (`axis=1`)
max_values = df[scores].max(axis=1)
m = df[scores].eq(max_values, axis=0).to_numpy()
# apply (inverse of) mask to `df[values]` and fill up any `NaN` value
# in 1st column with the first value that is found in 2nd, 3rd, etc column
# and then select only that first column; will now always be filled
df['expected_value'] = df[values].mask(~m).fillna(method='bfill', axis=1).iloc[:,0]
df['expected_score'] = max_values
print(df)
Alias Value ... expected_value expected_score
0 JOB_NUMBER 123456789 ... 9090909090 99.0
1 GOOD_DATE 8/10/2022 ... 8/10/2022 98.0
2 SALARY $373.28 ... $373.28 95.0
3 SALARY_DATE 8/10/2022 ... 8/10/2022 98.0
4 NAME Group Group ... Group Group 92.0
5 NUMBER 08764510160 ... 08764510160 99.0
6 NEXT_DATE 8/10/2022 ... 8/1/2022 98.0
7 COMMISSION_AMOUNT $9.22 ... $9.22 94.0
8 BALANCE $125,770.04 ... $125,770.04 82.0
9 ADDRESS 11111111 N Test Rd ... 11111111 N Test Rd 95.0
CodePudding user response:
here is one way to do it
# identify the column names that has a max score in a row
df['maxSc']=df.filter(like='Score').idxmax(axis=1)
#Value columns are similarly named, so translate Score to Value columns
df['maxVal']=df['maxSc'].replace(r'Score','Value', regex=True)
# pick the score and value based on identified columns, create a string of the two
# then split to create the desire columns
# finally remove teh temp columns
(df[['expectedScore','expectedValue']] =
df.apply(lambda x: str(x[x['maxSc']]) "|" str(x[x['maxVal']]) , axis=1)
.str
.split('|', expand=True)
)
df.drop(columns=['maxSc', 'maxVal'])
Alias Value Score page Alias.1 Value.1 Score.1 page.1 expectedScore expectedValue
0 JOB_NUMBER 123456789 98.0 0 JOB_NUMBER 9090909090 99.0 1 99.0 9090909090
1 GOOD_DATE 8/10/2022 98.0 0 GOOD_DATE NaN 0.0 1 98.0 8/10/2022
2 SALARY $373.28 95.0 0 SALARY NaN 0.0 1 95.0 $373.28
3 SALARY_DATE 8/10/2022 98.0 0 SALARY_DATE NaN 0.0 1 98.0 8/10/2022
4 NAME Group Group 92.0 0 NAME NaN 0.0 1 92.0 Group Group
5 NUMBER 08764510160 99.0 0 NUMBER NaN 0.0 1 99.0 08764510160
6 NEXT_DATE 8/10/2022 51.0 0 NEXT_DATE 8/1/2022 98.0 1 98.0 8/1/2022
7 COMMISSION_AMOUNT $9.22 94.0 0 COMMISSION_AMOUNT NaN 0.0 1 94.0 $9.22
8 BALANCE $125,770.04 82.0 0 BALANCE NaN 0.0 1 82.0 $125,770.04
9 ADDRESS 11111111 N Test Rd 95.0 0 ADDRESS NaN 0.0 1 95.0 11111111 N Test Rd
CodePudding user response:
Dump into numpy as much as possible - push your loops into C where possible; this also assumes an alignment column wise: score comes before score.1, value comes before value.1; you can always ensure that by sorting the columns index:
Get the scores and values arrays:
scores = df.filter(like='Score').to_numpy()
values = df.filter(like='Value').to_numpy()
Get the index positions for the maximum scores:
cols = np.argmax(scores, axis = 1)
Index scores and values to get the actual values:
rows = np.arange(len(scores))
scores = scores[rows, cols]
values = values[rows, cols]
Create final dataframe:
df.assign(
expected_value = values,
expected_score = scores)
Alias Value Score page Alias.1 Value.1 Score.1 page.1 expected_value expected_score
0 JOB_NUMBER 123456789 98.0 0 JOB_NUMBER 9090909090 99.0 1 9090909090 99.0
1 GOOD_DATE 8/10/2022 98.0 0 GOOD_DATE NaN 0.0 1 8/10/2022 98.0
2 SALARY $373.28 95.0 0 SALARY NaN 0.0 1 $373.28 95.0
3 SALARY_DATE 8/10/2022 98.0 0 SALARY_DATE NaN 0.0 1 8/10/2022 98.0
4 NAME Group Group 92.0 0 NAME NaN 0.0 1 Group Group 92.0
5 NUMBER 08764510160 99.0 0 NUMBER NaN 0.0 1 08764510160 99.0
6 NEXT_DATE 8/10/2022 51.0 0 NEXT_DATE 8/1/2022 98.0 1 8/1/2022 98.0
7 COMMISSION_AMOUNT $9.22 94.0 0 COMMISSION_AMOUNT NaN 0.0 1 $9.22 94.0
8 BALANCE $125,770.04 82.0 0 BALANCE NaN 0.0 1 $125,770.04 82.0
9 ADDRESS 11111111 N Test Rd 95.0 0 ADDRESS NaN 0.0 1 11111111 N Test Rd 95.0