Is this behavior of join normal?-CodePudding

from datetime import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

columns = [ 'id1', 'id2', 'val' ]
ids = [ 'id1', 'id2' ]

vals1 = [
        (1, 'a', 1),
        (2, 'b', 4),
        (4, None, 1),
      ]
df1 = spark.createDataFrame(data=vals1, schema=columns)

vals2 = [
        (1, 'a', 5),
        (3, 'c', 2),
        (4, None, 2),
      ]
df2 = spark.createDataFrame(data=vals2, schema=columns)

df1 = df1.withColumnRenamed('val', 'val1')
df2 = df2.withColumnRenamed('val', 'val2')

res_df = df1.join(df2, ids, 'full')

res_df.sort(ids).show()

And the result is

 --- ---- ---- ---- 
|id1| id2|val1|val2|
 --- ---- ---- ---- 
|  1|   a|   1|   5|
|  2|   b|   4|null|
|  3|   c|null|   2|
|  4|null|   1|null|
|  4|null|null|   2|
 --- ---- ---- ----

i.e. there are 2 rows with the key [ 4, null ]
Is this behavior normal?

CodePudding user response：

This is expected behaviour. NULL does not equal NULL in comparisons as explained in the NULL Semantics section of the documentation:

Apache spark supports the standard comparison operators such as ‘>’, ‘>=’, ‘=’, ‘<’ and ‘<=’. The result of these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL.

(AFAIK, this is standard SQL behaviour and not specific to Spark)

Furthermore:

WHERE, HAVING operators filter rows based on the user specified condition. A JOIN operator is used to combine rows from two tables based on a join condition. For all the three operators, a condition expression is a boolean expression and can return True, False or Unknown (NULL). They are “satisfied” if the result of the condition is True.

Comparing NULL to NULL produces NULL, which is not True, thus the (4, NULL) key in first table is not the same as the (4, NULL) key in the second table and so the join produces two distinct rows, one from joining the left row to a missing right row (| 4|null| 1|null|) and one from joining the right row to a missing left row (| 4|null|null| 2|).