Home > front end >  Pyspark Delete rows in table one which matches rows in table two
Pyspark Delete rows in table one which matches rows in table two

Time:01-26

Problem Statement

Here's my use case: I have 2 tables, Today_data and Yesterday_data, for example:

Today_data:

Id    Value
1     1_data
2     2_data
3     3_data 

Yesterday_data:

Id    Value
2     2_data
4     4_data
8     8_data

I want to delete Today_data df rows if the row matches Yesterday_data row.

Expected Result

Id    Value
1     1_data
3     3_data 

Approach Taken

I was thinking it should be a easy left join where Today_data will be on the left, however after I read through all the join operations in pyspark here: https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/#pyspark-join-types, I don't see any of them can solve my problem. Any ideas?

CodePudding user response:

You're looking for left_anti join:

today_df = spark.createDataFrame([(1, "1_data"), (2, "2_data"), (3, "3_data")], ["Id", "Value"])
yesterday_df = spark.createDataFrame([(2, "2_data"), (4, "4_data"), (8, "8_data")], ["Id", "Value"])

result = today_df.join(yesterday_df, ["Id"], "left_anti")

result.show()
# --- ------ 
#| Id| Value|
# --- ------ 
#|  1|1_data|
#|  3|3_data|
# --- ------ 

This one is explained in this section of the link you provided.

  •  Tags:  
  • Related