Problem Statement
Here's my use case:
I have 2 tables, Today_data
and Yesterday_data
, for example:
Today_data:
Id Value
1 1_data
2 2_data
3 3_data
Yesterday_data:
Id Value
2 2_data
4 4_data
8 8_data
I want to delete Today_data
df rows if the row matches Yesterday_data
row.
Expected Result
Id Value
1 1_data
3 3_data
Approach Taken
I was thinking it should be a easy left join where Today_data
will be on the left, however after I read through all the join operations in pyspark here: https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/#pyspark-join-types, I don't see any of them can solve my problem. Any ideas?
CodePudding user response:
You're looking for left_anti
join:
today_df = spark.createDataFrame([(1, "1_data"), (2, "2_data"), (3, "3_data")], ["Id", "Value"])
yesterday_df = spark.createDataFrame([(2, "2_data"), (4, "4_data"), (8, "8_data")], ["Id", "Value"])
result = today_df.join(yesterday_df, ["Id"], "left_anti")
result.show()
# --- ------
#| Id| Value|
# --- ------
#| 1|1_data|
#| 3|3_data|
# --- ------
This one is explained in this section of the link you provided.