I have two dataframes DF1 and DF2 with different column sets
------------------------------------------------
DF1
col1 col2 col3 col4 col5 col6 col7
---------------------------------------------
Asr Suh dervi xyz yes NY 2022-04-11
Bas nav dervi xyz yes CA 2022-04-11
Naz adn otc xyz no NJ 2022-05-01
---------------------------------------------------
DF2
col1 col2 col3 col4
------------------------------------------------------
Asr Suh dervi xyz
Bas nav dervi xyz
--------------------------------------------------------
I want to compare these two dataframes based on col2 and Col3 and filter out non matching rows from DF1.
So compare operation should give below row in dataset
df3
Naz adn otc xyz no NJ 2022-05-01
CodePudding user response:
There's no left anti join in pandas, use outer join with indicator=True first.
cols = ['col2', 'col3']
_df = df1[cols].merge(df2[cols], how='outer', indicator=True)
_df = _df[_df._merge=='left_only'].drop('_merge', axis=1)
df3 = df1.merge(_df, on=cols)
df3
CodePudding user response:
You can use the pyspark's left anti
join to get your desired result. When you join two DataFrame using Left Anti Join (leftanti), it returns only columns from the left DataFrame for non-matched records.
Preparing the dataframes,
data_1 = [('Asr', 'Suh', 'dervi', 'xyz', 'yes', 'NY', '2022-04-11'),
('Bas','nav','dervi','xyz', 'yes', 'CA', '2022-04-11'),
('Naz','adn','otc','xyz', 'no', 'NJ', '2022-05-01')]
col_1 = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
df_1 = spark.createDataFrame(data=data_1, schema = col_1)
df_1.show()
---- ---- ----- ---- ---- ---- ----------
|col1|col2| col3|col4|col5|col6| col7|
---- ---- ----- ---- ---- ---- ----------
| Asr| Suh|dervi| xyz| yes| NY|2022-04-11|
| Bas| nav|dervi| xyz| yes| CA|2022-04-11|
| Naz| adn| otc| xyz| no| NJ|2022-05-01|
---- ---- ----- ---- ---- ---- ----------
data_2 = [('Asr', 'Suh', 'dervi', 'xyz'),
('Bas','nav','dervi','xyz')]
col_2 = ['col1', 'col2', 'col3', 'col4']
df_2 = spark.createDataFrame(data=data_2, schema = col_2)
df_2.show()
---- ---- ----- ----
|col1|col2| col3|col4|
---- ---- ----- ----
| Asr| Suh|dervi| xyz|
| Bas| nav|dervi| xyz|
---- ---- ----- ----
So, in your case, you could try something like this,
df_1.join(df_2, (df_1.col2 == df_2.col2) & (df_1.col3 == df_2.col3), "leftanti").show(truncate=False)
---- ---- ---- ---- ---- ---- ----------
|col1|col2|col3|col4|col5|col6|col7 |
---- ---- ---- ---- ---- ---- ----------
|Naz |adn |otc |xyz |no |NJ |2022-05-01|
---- ---- ---- ---- ---- ---- ----------
or
df_1.join(df_2, ['col2', 'col3'], "leftanti").select(df_1.columns).show(truncate=False)
---- ---- ---- ---- ---- ---- ----------
|col1|col2|col3|col4|col5|col6|col7 |
---- ---- ---- ---- ---- ---- ----------
|Naz |adn |otc |xyz |no |NJ |2022-05-01|
---- ---- ---- ---- ---- ---- ----------
Both of them are the same just different syntax, use which ever you prefer.