I have 2 dataframes, and I want to update the score of rows with the same 2 column values.
How can I do that?
df 1:
DEP ID | Team ID | Group | Score
001 | 002 | A | 50
001 | 004 | A | 70
002 | 002 | A | 50
002 | 007 | A | 90
df 2 (a subset of one department):
DEP ID | Team ID | Group | Result
001 | 002 | A | 80
001 | 003 | A | 60
001 | 004 | A | 70
OUTPUT: All columns with the same TeamID and Group update the score
DEP ID | Team ID | Group | Score
001 | 002 | A | 80
001 | 004 | A | 70
002 | 002 | A | 80
002 | 007 | A | 90
I've tried doing pd merge left join but I'm not really getting the expected result.
Any suggestions?
CodePudding user response:
Here's a way to do it:
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'])
df1.loc[df1.Result.notna(), 'Score'] = df1.Result
df1 = df1.drop(columns='Result')
Explanation:
- modify df2 so it has
Team ID, Group
as its index and its only column isResult
- use
join
to bring the new scores from df2 into aResult
column in df1 - use
loc
to updateScore
values for rows whereResult
is not null (i.e., rows for which an updatedScore
is available) - drop the
Result
column.
Full test code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'DEP ID':['001','001','002','002'],
'Team ID':['002','004','002','007'],
'Group':['A','A','A','A'],
'Score':[50,70,50,90]})
df2 = pd.DataFrame({
'DEP ID':['001','001','001'],
'Team ID':['002','003','004'],
'Group':['A','A','A'],
'Result':[80,60,70]})
print(df1)
print(df2)
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'])
df1.loc[df1.Result.notna(), 'Score'] = df1.Result
df1 = df1.drop(columns='Result')
print(df1)
Output:
index DEP ID Team ID Group Score
0 0 001 002 A 80
1 1 001 004 A 70
2 2 002 002 A 80
3 3 002 007 A 90
UPDATE:
If Result
column in df2 is instead named Score
, as asked by OP in a comment, then the code can be adjusted slightly as follows:
df1 = df1.join(df2.drop(columns='DEP ID').set_index(['Team ID', 'Group']), on=['Team ID', 'Group'], rsuffix='_NEW')
df1.loc[df1.Score_NEW.notna(), 'Score'] = df1.Score_NEW
df1 = df1.drop(columns='Score_NEW')