I have 2 dataframes, df1
A B C
0 a 1 x
1 b 2 y
2 c 3 z
3 d 4 g
4 e 5 h
and df2
:
0 A B C
0 1 a 6 i
1 2 a 7 j
2 3 b 8 k
3 3 d 10 k
What I want to do is the following:
Whenever an entry in column A
of df1
matches an entry in column A
of df2
, replace the matching row in df1
with parts of the row in df2
In my approach, in the below code, I tried to replace the first row
(a,1,x)
by (a,6,i)
and consecutively with (a,7,j)
.
Also all other matching rows should be replaced:
So: (b,2,y)
with (b,8,k)
and (d,4,g)
with (d,10,k)
Meaning that every row in df1
should be replaced by the latest match of column A
in df2
.
import numpy as np
import pandas as pd
columns = ["0","A", "B", "C"]
s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = columns[1::])
s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)], columns = columns)
cols = ["A", "B", "C"]
print(df1[columns[1::]])
print("---")
print(df2[columns])
print("---")
df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)
The expected result would therefor be:
A B C
0 a 7 j
1 b 2 y
2 c 3 z
3 d 10 k
4 e 5 h
But the above approach results in:
A B C
0 a 6 i
1 a 7 j
2 c 3 z
3 d 10 k
4 e 5 h
I know i could do what I want with iterrows()
but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)
Or is there there any other easy approach to achieve this?
CodePudding user response:
Use:
df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
A B C
1 a 7 j
2 b 8 k
2 c 3 z
3 d 10 k
4 e 5 h
CodePudding user response:
You can try merge
then update
df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)
A B C
0 a 7.0 j
1 b 8.0 k
2 c 3.0 z
3 d 10.0 k
4 e 5.0 h