Home > Software engineering >  overwrite dataframe rows with merge
overwrite dataframe rows with merge

Time:12-03

I am trying to overwrite specific rows and columns from one dataframe with a second dataframe rows and columns. I can't give the actual data but I will use a proxy here.

Here is an example and what I have tried:

df1
    UID   B     C     D     
0   X14   cat   red   One
1   X26   cat   blue  Two
2   X99   cat   pink  One
3   X54   cat   pink  One


df2
   UID    B     C      EX2
0   X14   dog   blue   coat
1   X88   rat   green  jacket
2   X99   bat   red    glasses
3   X29   bat   red    shoes

What I want to do here is overwrite column B and C in df1 with the values in df2 based upon UID. Therefore in this example X88 and X29 from df2 would not appear in df2. Also column D would not be affected and EX2 not

The outcome would looks as such:

df1
    UID   B     C     D     
0   X14   dog   blue  One
1   X26   cat   blue  Two
2   X99   bat   red   One
3   X54   cat   pink  One

I looked at this solution : Pandas merge two dataframe and overwrite rows However this appears to only update null values whereas I want an overwrite.

My attempt looked this like:

df = df1.merge(df2.filter(['B', 'C']), on=['B', 'C'], how='left')

For my data this actually doesn't seem to overwrite anything. Please could someone explain why this would not work?

Thanks

CodePudding user response:

One approach could be as follows:

  • First, use df.set_index to make column UID your index (inplace).
  • Next, use df.update with parameter overwrite set to True (also use set_index here for the "other" df: df2). This will overwrite all the columns that the two dfs have in common (i.e. B and C) based on index matches (i.e. now UID).
  • Finally, restore the standard index using df.reset_index.
df1.set_index('UID', inplace=True)
df1.update(df2.set_index('UID'), overwrite=True)
df1.reset_index(inplace=True)
print(df1)

   UID    B     C    D
0  X14  dog  blue  One
1  X26  cat  blue  Two
2  X99  bat   red  One
3  X54  cat  pink  One

CodePudding user response:

You can approach this by using reindex_like and combine_first.

Try this :

out = (
        df2.set_index("UID")
           .reindex_like(df1.set_index("UID"))
           .combine_first(df1.set_index("UID"))
           .reset_index()
       )

# Output :

print(out)

   UID    B     C    D
0  X14  dog  blue  One
1  X26  cat  blue  Two
2  X99  bat   red  One
3  X54  cat  pink  One

CodePudding user response:

Using Update function

df1.set_index('UID', inplace=True)
df2.set_index('UID', inplace=True)

df1.update(df2)
df1.reset_index(inplace=True)
print(df1)
Output
   UID    B     C    D
0  X14  dog  blue  One
1  X26  cat  blue  Two
2  X99  bat   red  One
3  X54  cat  pink  One
  • Related