my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
However, my working df
may have country and currency differently, for example country =='France'
and 'currency'=='AFN'
. I would like to create a column with max score based on either, i.e., the above combo would imply a score of 4.
Desired output:
df_out = pd.DataFrame(
[['France','AFN', 4],['France','ALL', 2],['France','EUR', 1],
['Albania','AFN', 4],['Albania','ALL', 2],['Albania','EUR', 2],
['Afghanistan','AFN', 4],['Afghanistan','ALL', 4],['Afghanistan','EUR', 4]],
columns=['country','currency', 'score'])
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far and it's extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)
CodePudding user response:
Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' col) for col in byCol
], axis=1).max(axis=1)
Explanation:
- for each column in
tbl
other thanscore
(in your case,country
andcurrency
), create a Series with that column as index - use
pd.concat()
to create a new dataframe with multiple columns, each a Series object created usingjoin()
between the workingdf
and one of the Series objects from the previous step - use
max()
on each row to get the desiredem_score
.
Full test code with sample df
:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
CodePudding user response:
So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))