If I had these two datasets:
db1:
{item: "keyboard", cus_id: 1234}
{item: "mouse", cus_id: 2345}
{item: "laptop", cus_id: 3456}
{item: "charger", cus_id: 4567}
{item: "headset", cus_id: 5678}
db2:
{item: "keyboard", cus_id: 1234}
{item: "mouse", cus_id: 2345}
{item: "laptop", cus_id: 3456}
{item: "charger", cus_id: 1234}
I want to return the dataset
{item: "charger", cus_id: 4567}
{item: "headset", cus_id: 5678}
because they are entries that are contained in db1
but not db2
based on the item
and cus_id
. Here's what I have so far:
db1.join(db2, db1.col("item") =!= bd2.col("item") && db1.col("cus_id") === db2.col("cus_id"), "inner")
but I think the logic is flawed somewhere in the join
function. How can I do this correctly?
CodePudding user response:
I think you should instead perform a left join db2
on the condition that cus_id
s are equal and item
s are equal too, so if there is a row that exists in db1
and not in db2
, will appear in the result but the right columns will be NULL
and you can use this to filter the desired data:
scala> df1.join(df2, df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"), "left").toDF("id1", "item1", "id2", "item2").select("id1", "item1").where("id2 is null").toDF("id", "item").show
------- ----
| id|item|
------- ----
|charger|4567|
|headset|5678|
------- ----
This is a cleaner version of the same code:
df1
.join(
df2,
df1.col("cus_id") === df2.col("cus_id") && df1.col("item") === df2.col("item"),
"left"
)
.toDF("id1", "item1", "id2", "item2")
.select("id1", "item1").where("id2 is null")
.toDF("id", "item")
.show
CodePudding user response:
Are you looking for a left anti join?
db1.join(db2, on=["item", "cus_id"], how="left_anti")