Home > Software engineering >  Comparing two dataframes on given columns
Comparing two dataframes on given columns

Time:01-04

I have two dataframes DF1 and DF2 with different column sets

------------------------------------------------
DF1                                                 
col1 col2 col3  col4 col5   col6    col7 
---------------------------------------------           
Asr  Suh  dervi xyz  yes    NY      2022-04-11      
Bas  nav  dervi xyz  yes    CA      2022-04-11      
Naz  adn  otc   xyz   no    NJ      2022-05-01    

---------------------------------------------------

DF2
col1 col2 col3 col4 
------------------------------------------------------
Asr  Suh  dervi xyz  
Bas  nav  dervi xyz  
--------------------------------------------------------

I want to compare these two dataframes based on col2 and Col3 and filter out non matching rows from DF1.

So compare operation should give below row in dataset

df3
Naz  adn  otc  xyz   no NJ  2022-05-01

CodePudding user response:

There's no left anti join in pandas, use outer join with indicator=True first.

cols = ['col2', 'col3']
_df = df1[cols].merge(df2[cols], how='outer', indicator=True)
_df = _df[_df._merge=='left_only'].drop('_merge', axis=1)
df3 = df1.merge(_df, on=cols)
df3

CodePudding user response:

You can use the pyspark's left anti join to get your desired result. When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.

Preparing the dataframes,

data_1 = [('Asr',  'Suh', 'dervi', 'xyz', 'yes', 'NY', '2022-04-11'),
('Bas','nav','dervi','xyz', 'yes', 'CA', '2022-04-11'),
('Naz','adn','otc','xyz', 'no', 'NJ', '2022-05-01')]

col_1 = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']

df_1 = spark.createDataFrame(data=data_1, schema = col_1)

df_1.show()
 ---- ---- ----- ---- ---- ---- ---------- 
|col1|col2| col3|col4|col5|col6|      col7|
 ---- ---- ----- ---- ---- ---- ---------- 
| Asr| Suh|dervi| xyz| yes|  NY|2022-04-11|
| Bas| nav|dervi| xyz| yes|  CA|2022-04-11|
| Naz| adn|  otc| xyz|  no|  NJ|2022-05-01|
 ---- ---- ----- ---- ---- ---- ---------- 
data_2 = [('Asr',  'Suh', 'dervi', 'xyz'),
('Bas','nav','dervi','xyz')]

col_2 = ['col1', 'col2', 'col3', 'col4']

df_2 = spark.createDataFrame(data=data_2, schema = col_2)

df_2.show()
 ---- ---- ----- ---- 
|col1|col2| col3|col4|
 ---- ---- ----- ---- 
| Asr| Suh|dervi| xyz|
| Bas| nav|dervi| xyz|
 ---- ---- ----- ---- 

So, in your case, you could try something like this,

df_1.join(df_2, (df_1.col2 == df_2.col2) & (df_1.col3 == df_2.col3), "leftanti").show(truncate=False)
 ---- ---- ---- ---- ---- ---- ---------- 
|col1|col2|col3|col4|col5|col6|col7      |
 ---- ---- ---- ---- ---- ---- ---------- 
|Naz |adn |otc |xyz |no  |NJ  |2022-05-01|
 ---- ---- ---- ---- ---- ---- ---------- 

or

df_1.join(df_2, ['col2', 'col3'], "leftanti").select(df_1.columns).show(truncate=False)
 ---- ---- ---- ---- ---- ---- ---------- 
|col1|col2|col3|col4|col5|col6|col7      |
 ---- ---- ---- ---- ---- ---- ---------- 
|Naz |adn |otc |xyz |no  |NJ  |2022-05-01|
 ---- ---- ---- ---- ---- ---- ---------- 

Both of them are the same just different syntax, use which ever you prefer.

  • Related