Home > Back-end >  extract conditional max based on a reference data frame
extract conditional max based on a reference data frame

Time:01-06

my reference data frame is of the following type:

tbl = pd.DataFrame([['Afghanistan', 'AFN',  4],
                   ['Albania',  'ALL',  2],
                   ['France',   'EUR',  1]],
                     columns=['country',    'currency', 'score'])
df = pd.DataFrame(
    [['France','AFN'],['France','ALL'],['France','EUR'],
    ['Albania','AFN'],['Albania','ALL'],['Albania','EUR']], 
    columns=['country','currency'])

However, my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN'. I would like to create a column with max score based on either, i.e., the above combo would imply a score of 4.
Desired output:

df_out = pd.DataFrame(
    [['France','AFN', 4],['France','ALL', 2],['France','EUR', 1],
    ['Albania','AFN', 4],['Albania','ALL', 2],['Albania','EUR', 2],
    ['Afghanistan','AFN', 4],['Afghanistan','ALL', 4],['Afghanistan','EUR', 4]], 
    columns=['country','currency', 'score'])

Out[102]: 
       country currency  score
0       France      AFN      4
1       France      ALL      2
2       France      EUR      1
3      Albania      AFN      4
4      Albania      ALL      2
5      Albania      EUR      2
6  Afghanistan      AFN      4
7  Afghanistan      ALL      4
8  Afghanistan      EUR      4

Here is what I have so far and it's extremely clunky:

df = pd.merge(df, tbl[['country', 'score']],
             how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)

df = pd.merge(df, tbl[['currency', 'score']],
             how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)

CodePudding user response:

Here's a way to do it:

byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
    df.join(byCol[col], on=col).score.rename('score_'   col) for col in byCol
    ], axis=1).max(axis=1)

Explanation:

  • for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
  • use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
  • use max() on each row to get the desired em_score.

Full test code with sample df:

import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN',  4],
                   ['Albania',  'ALL',  2],
                   ['France',   'EUR',  1]],
                     columns=['country',    'currency', 'score'])

df = pd.DataFrame(
    [['France','AFN'],['France','ALL'],['France','EUR'],
    ['Albania','AFN'],['Albania','ALL'],['Albania','EUR']], 
    columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')

byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
    df.join(byCol[col], on=col).score.rename('score_'   col) for col in byCol
    ], axis=1).max(axis=1)
print('','output',df,sep='\n')

Output:

tbl
       country currency  score
0  Afghanistan      AFN      4
1      Albania      ALL      2
2       France      EUR      1

df
   country currency
0   France      AFN
1   France      ALL
2   France      EUR
3  Albania      AFN
4  Albania      ALL
5  Albania      EUR

output
   country currency  em_score
0   France      AFN         4
1   France      ALL         2
2   France      EUR         1
3  Albania      AFN         4
4  Albania      ALL         2
5  Albania      EUR         2

CodePudding user response:

So, for case if you have

tbl = pd.DataFrame([['Afghanistan', 'AFN',  4],
                   ['Albania',  'ALL',  2],
                   ['France',   'EUR',  1],
                   ['France',   'AFN',  0]],
                     columns=['country',    'currency', 'score'])

This code will find the max of either the max score for the country or the currency of each row:

np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
           np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))
  • Related