Home > Software design >  Concat dataframes using rows from second dataframe which don't exist in first dataframe based o
Concat dataframes using rows from second dataframe which don't exist in first dataframe based o

Time:04-07

I have 2 large dataframes as follows:

df1

C1  C2  C3  V1
a   e   k   1
b   f   j   2
d   g   h   3    

df2

C1  C2  C3  V1
a   e   m   21
d   g   p   1
e   f   q   2
f   g   r   3

I want to get the following:

C1  C2  C3  V1
a   e   k   1
b   f   j   2
d   g   h   3
e   f   q   2
f   g   r   3

Where I have only included rows from df2 where C1 and C2 are not present in df1. i.e. I dropped first row of df2 as a, e was already present in df1.

I am trying:

df1['id'] = df1[['C1', 'C2']].astype(str).apply('_'.join, axis=1)
df2['id'] = df2[['C1', 'C2']].astype(str).apply('_'.join, axis=1)
df3 = df2[~df2['id'].isin(df1['id'])]
pd.concat([df1, df3])

Is there a better way to do this using some inbuilt function as I run into this kind of situation a lot with varying number of columns.

CodePudding user response:

You can use combine_first:

(df1
 .set_index(['C1', 'C2'])
 .combine_first(df2.set_index(['C1', 'C2']))
 .reset_index()
)

output:

  C1 C2 C3  V1
0  a  e  k   1
1  b  f  j   2
2  d  g  h   3
3  e  f  q   2
4  f  g  r   3
using merge

This is a bit more complex, you need to perform a negative merge with help of indicator=True:

cols = ['C1', 'C2']
df2_only = (df1[cols]
            .merge(df2, on=cols, indicator=True, how='right')
            .query('_merge == "right_only"').drop(columns='_merge')
           )

out = pd.concat([df1, df2_only])

CodePudding user response:

append df2 to df1 and drop duplicates

new=df1.append(df2).drop_duplicates(subset=['C1','C2'],keep='first')

print(new)

 C1 C2 C3  V1
0  a  e  k   1
1  b  f  j   2
2  d  g  h   3
3  e  f  q   2
4  f  g  r   3
  • Related