remove all rows with the values in all column similar with another pyspark dataframe-CodePudding

For example, if i have 2 pyspark dataframe with 100 columns. I wish to compare dataframe A, with Dataframe B and check whether is there row with similar values based on all 100 columns. If yes, then remove the rows.

I do not have a code because I do not know how to begin. please help. I am wondering whether could I use outer join between two dataframes.

CodePudding user response：

The left_anti join removes matching rows from the left table, and is well suited for your use case. Here's an example.

data1_sdf.show()

#  --- --- --- --- 
# | c1| c2| c3| c4|
#  --- --- --- --- 
# |  1|  2|  3|  4|
# |  1|  2|  3|  4|
# |  3|  2|  1|  4|
#  --- --- --- --- 

data2_sdf.show()

#  --- --- --- --- 
# | c1| c2| c3| c4|
#  --- --- --- --- 
# |  1|  2|  3|  4|
# |  5|  7|  3|  4|
#  --- --- --- --- 

# left anti join to remove matching rows from data1_sdf
data1_sdf. \
    withColumn('allcol_concat', func.concat_ws('_', *data1_sdf.columns)). \
    join(data2_sdf.withColumn('allcol_concat', func.concat_ws('_', *data2_sdf.columns)).
         select('allcol_concat'),
         'allcol_concat',
         'leftanti'
         ). \
    show()

#  ------------- --- --- --- --- 
# |allcol_concat| c1| c2| c3| c4|
#  ------------- --- --- --- --- 
# |      3_2_1_4|  3|  2|  1|  4|
#  ------------- --- --- --- ---