how to find complement of a dataframe with respect of another df?-CodePudding

I wanna find all rows in df1 which dose not contain and id from df2. In pandas I can do it by the following code

df1.merge(df2, on='id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

how I can do it in pyspark?

CodePudding user response：

Use left_anti join

`df1`

df1 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
    (2, 'd'),
    (2, 'e'),
    (3, 'f'),
], ['id', 'col'])

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
|  2|  d|
|  2|  e|
|  3|  f|
 --- ---

`df2`

df2 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
], ['id', 'col'])

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
 --- ---

`left_anti` join

df1.join(df2, on=['id'], how='left_anti').show()

 --- --- 
| id|col|
 --- --- 
|  2|  d|
|  2|  e|
|  3|  f|
 --- ---

df1

df2

left_anti join

`df1`

`df2`

`left_anti` join