Home > OS >  pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry
pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry

Time:05-16

I have 2 dataframes, df1

   A  B  C
0  a  1  x
1  b  2  y
2  c  3  z
3  d  4  g
4  e  5  h

and df2:

   0  A   B  C
0  1  a   6  i
1  2  a   7  j
2  3  b   8  k
3  3  d  10  k

What I want to do is the following:
Whenever an entry in column A of df1 matches an entry in column A of df2, replace the matching row in df1 with parts of the row in df2

In my approach, in the below code, I tried to replace the first row (a,1,x) by (a,6,i) and consecutively with (a,7,j).
Also all other matching rows should be replaced:
So: (b,2,y) with (b,8,k) and (d,4,g) with (d,10,k)

Meaning that every row in df1 should be replaced by the latest match of column A in df2.

import numpy as np
import pandas as pd


columns = ["0","A", "B", "C"]

s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)],  columns =  columns[1::])

s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)],  columns =  columns)

cols =  ["A", "B", "C"]

print(df1[columns[1::]])
print("---")

print(df2[columns])
print("---")

df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)

The expected result would therefor be:

   A   B  C
0  a   7  j
1  b   2  y
2  c   3  z
3  d  10  k
4  e   5  h

But the above approach results in:

   A   B  C
0  a   6  i
1  a   7  j
2  c   3  z
3  d  10  k
4  e   5  h

I know i could do what I want with iterrows() but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)

Or is there there any other easy approach to achieve this?

CodePudding user response:

Use:

df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
   A   B  C
1  a   7  j
2  b   8  k
2  c   3  z
3  d  10  k
4  e   5  h

CodePudding user response:

You can try merge then update

df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)

   A     B  C
0  a   7.0  j
1  b   8.0  k
2  c   3.0  z
3  d  10.0  k
4  e   5.0  h
  • Related