For example, if i have 2 pyspark dataframe with 100 columns. I wish to compare dataframe A, with Dataframe B and check whether is there row with similar values based on all 100 columns. If yes, then remove the rows.
I do not have a code because I do not know how to begin. please help. I am wondering whether could I use outer join between two dataframes.
CodePudding user response:
The left_anti
join removes matching rows from the left table, and is well suited for your use case. Here's an example.
data1_sdf.show()
# --- --- --- ---
# | c1| c2| c3| c4|
# --- --- --- ---
# | 1| 2| 3| 4|
# | 1| 2| 3| 4|
# | 3| 2| 1| 4|
# --- --- --- ---
data2_sdf.show()
# --- --- --- ---
# | c1| c2| c3| c4|
# --- --- --- ---
# | 1| 2| 3| 4|
# | 5| 7| 3| 4|
# --- --- --- ---
# left anti join to remove matching rows from data1_sdf
data1_sdf. \
withColumn('allcol_concat', func.concat_ws('_', *data1_sdf.columns)). \
join(data2_sdf.withColumn('allcol_concat', func.concat_ws('_', *data2_sdf.columns)).
select('allcol_concat'),
'allcol_concat',
'leftanti'
). \
show()
# ------------- --- --- --- ---
# |allcol_concat| c1| c2| c3| c4|
# ------------- --- --- --- ---
# | 3_2_1_4| 3| 2| 1| 4|
# ------------- --- --- --- ---