Home > Software engineering >  Using .join to remove certain lines in a db in spark scala
Using .join to remove certain lines in a db in spark scala

Time:07-09

If I had these two datasets:

db1:  
{item: "keyboard", cus_id: 1234}  
{item: "mouse", cus_id: 2345}  
{item: "laptop", cus_id: 3456}  
{item: "charger", cus_id: 4567}  
{item: "headset", cus_id: 5678}  

db2:
{item: "keyboard", cus_id: 1234}  
{item: "mouse", cus_id: 2345}  
{item: "laptop", cus_id: 3456}  
{item: "charger", cus_id: 1234}  

I want to return the dataset

{item: "charger", cus_id: 4567}  
{item: "headset", cus_id: 5678}  

because they are entries that are contained in db1 but not db2 based on the item and cus_id. Here's what I have so far:

db1.join(db2, db1.col("item") =!= bd2.col("item") && db1.col("cus_id") === db2.col("cus_id"), "inner")

but I think the logic is flawed somewhere in the join function. How can I do this correctly?

CodePudding user response:

I think you should instead perform a left join db2 on the condition that cus_ids are equal and items are equal too, so if there is a row that exists in db1 and not in db2, will appear in the result but the right columns will be NULL and you can use this to filter the desired data:

scala> df1.join(df2, df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"), "left").toDF("id1", "item1", "id2", "item2").select("id1", "item1").where("id2 is null").toDF("id", "item").show
 ------- ---- 
|     id|item|
 ------- ---- 
|charger|4567|
|headset|5678|
 ------- ---- 

This is a cleaner version of the same code:

df1
  .join(
    df2,
    df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"),
    "left"
  )
  .toDF("id1", "item1", "id2", "item2")
  .select("id1", "item1").where("id2 is null")
  .toDF("id", "item")
  .show

CodePudding user response:

Are you looking for a left anti join?

db1.join(db2, on=["item", "cus_id"], how="left_anti") 
  • Related