I have noticed that 0
joins with null
when using null-safe join (eqNullSafe
).
df1 = spark.createDataFrame([(1, ), (None, )], ['df1_id'])
df2 = spark.createDataFrame([(None, ), (0, )], ['df2_id'])
df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'right').show()
# ------ ------
#|df1_id|df2_id|
# ------ ------
#| null| 0|
#| null| null|
# ------ ------
df2.join(df1, df1.df1_id.eqNullSafe(df2.df2_id), 'left').show()
# ------ ------
#|df2_id|df1_id|
# ------ ------
#| 0| null|
#| null| null|
# ------ ------
How do I make null
join only with null
?
CodePudding user response:
You need to do inner join here
df1.join(df2, df1.df1_id.eqNullSafe(df2.df2_id), 'inner').show()
Right now for 0 in right and there is no match in the left df, and we are doing right join that's why pyspark is keeping 0 in right df and it is becoming null in df1_id
.