I wanna find all rows in df1 which dose not contain and id
from df2. In pandas I can do it by the following code
df1.merge(df2, on='id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
how I can do it in pyspark?
CodePudding user response:
Use left_anti
join
df1
df1 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
], ['id', 'col'])
--- ---
| id|col|
--- ---
| 1| a|
| 1| b|
| 1| c|
| 2| d|
| 2| e|
| 3| f|
--- ---
df2
df2 = spark.createDataFrame([
(1, 'a'),
(1, 'b'),
(1, 'c'),
], ['id', 'col'])
--- ---
| id|col|
--- ---
| 1| a|
| 1| b|
| 1| c|
--- ---
left_anti
join
df1.join(df2, on=['id'], how='left_anti').show()
--- ---
| id|col|
--- ---
| 2| d|
| 2| e|
| 3| f|
--- ---