How can I join/merge 2 DataFrames, keeping values from the second?-CodePudding

I have two DataFrames with user details and scores. Some users have a second score and will be present in the second DataFrame. What I want to do is to join or merge them together (don't really mind which) to get their final score i.e. if they exist in the second DataFrame then take it from there, otherwise from the first.

DataFrame one:

    first_name  last_name              email  score               feedback
0         Bill      First  [email protected]   57.0  Needs to try again
1         Anna     Second  [email protected]   10.0  Not enough to proceed
2         Fred      Third  [email protected]  100.0  Excellent!

DataFrame two:

    first_name  last_name              email  score      feedback
0         Bill      First  [email protected]  100.0  Much better!

Desired result:

    first_name  last_name              email  score               feedback
0         Bill      First  [email protected]  100.0  Much better!
1         Anna     Second  [email protected]   10.0  Not enough to proceed
2         Fred      Third  [email protected]  100.0  Excellent!

The closest I've got is df1.merge(df2, on=['email', 'first_name', 'last_name', ], how='left') but that gives me columns: first_name last_name email score_x feedback_x score_y feedback_y

CodePudding user response：

Use pd.concat and drop_duplicates:

out = pd.concat([df2, df1]).drop_duplicates(['first_name', 'last_name', 'email'])
print(out)

# Output
  first_name last_name              email  score               feedback
0       Bill     First  [email protected]  100.0           Much better!
1       Anna    Second  [email protected]   10.0  Not enough to proceed
2       Fred     Third  [email protected]  100.0             Excellent!

Assuming first_name, last_name and email fields are the primary key to identify records like your merge.

Just before drop_duplicates, your dataframe looks like:

>>> pd.concat([df2, df1])
  first_name last_name              email  score               feedback
0       Bill     First  [email protected]  100.0           Much better!
0       Bill     First  [email protected]   57.0     Needs to try again  # dupe
1       Anna    Second  [email protected]   10.0  Not enough to proceed
2       Fred     Third  [email protected]  100.0             Excellent!

CodePudding user response：

Alternative using combine_first:

cols = ['email', 'first_name', 'last_name']
out = df2.set_index(cols).combine_first(df1.set_index(cols)).reset_index()

output:

               email first_name last_name  score               feedback
0  [email protected]       Bill     First  100.0           Much better!
1  [email protected]       Anna    Second   10.0  Not enough to proceed
2  [email protected]       Fred     Third  100.0             Excellent!