I have 2 large dataframes as follows:
df1
C1 C2 C3 V1
a e k 1
b f j 2
d g h 3
df2
C1 C2 C3 V1
a e m 21
d g p 1
e f q 2
f g r 3
I want to get the following:
C1 C2 C3 V1
a e k 1
b f j 2
d g h 3
e f q 2
f g r 3
Where I have only included rows from df2
where C1 and C2 are not present in df1
. i.e. I dropped first row of df2
as a, e
was already present in df1
.
I am trying:
df1['id'] = df1[['C1', 'C2']].astype(str).apply('_'.join, axis=1)
df2['id'] = df2[['C1', 'C2']].astype(str).apply('_'.join, axis=1)
df3 = df2[~df2['id'].isin(df1['id'])]
pd.concat([df1, df3])
Is there a better way to do this using some inbuilt function as I run into this kind of situation a lot with varying number of columns.
CodePudding user response:
You can use combine_first
:
(df1
.set_index(['C1', 'C2'])
.combine_first(df2.set_index(['C1', 'C2']))
.reset_index()
)
output:
C1 C2 C3 V1
0 a e k 1
1 b f j 2
2 d g h 3
3 e f q 2
4 f g r 3
using merge
This is a bit more complex, you need to perform a negative merge with help of indicator=True
:
cols = ['C1', 'C2']
df2_only = (df1[cols]
.merge(df2, on=cols, indicator=True, how='right')
.query('_merge == "right_only"').drop(columns='_merge')
)
out = pd.concat([df1, df2_only])
CodePudding user response:
append df2 to df1 and drop duplicates
new=df1.append(df2).drop_duplicates(subset=['C1','C2'],keep='first')
print(new)
C1 C2 C3 V1
0 a e k 1
1 b f j 2
2 d g h 3
3 e f q 2
4 f g r 3