Home > Software engineering >  how to find complement of a dataframe with respect of another df?
how to find complement of a dataframe with respect of another df?

Time:04-09

I wanna find all rows in df1 which dose not contain and id from df2. In pandas I can do it by the following code

df1.merge(df2, on='id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

how I can do it in pyspark?

CodePudding user response:

Use left_anti join

df1
df1 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
    (2, 'd'),
    (2, 'e'),
    (3, 'f'),
], ['id', 'col'])

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
|  2|  d|
|  2|  e|
|  3|  f|
 --- --- 
df2
df2 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
], ['id', 'col'])

 --- --- 
| id|col|
 --- --- 
|  1|  a|
|  1|  b|
|  1|  c|
 --- --- 
left_anti join
df1.join(df2, on=['id'], how='left_anti').show()

 --- --- 
| id|col|
 --- --- 
|  2|  d|
|  2|  e|
|  3|  f|
 --- --- 
  • Related