from datetime import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
columns = [ 'id1', 'id2', 'val' ]
ids = [ 'id1', 'id2' ]
vals1 = [
(1, 'a', 1),
(2, 'b', 4),
(4, None, 1),
]
df1 = spark.createDataFrame(data=vals1, schema=columns)
vals2 = [
(1, 'a', 5),
(3, 'c', 2),
(4, None, 2),
]
df2 = spark.createDataFrame(data=vals2, schema=columns)
df1 = df1.withColumnRenamed('val', 'val1')
df2 = df2.withColumnRenamed('val', 'val2')
res_df = df1.join(df2, ids, 'full')
res_df.sort(ids).show()
And the result is
--- ---- ---- ----
|id1| id2|val1|val2|
--- ---- ---- ----
| 1| a| 1| 5|
| 2| b| 4|null|
| 3| c|null| 2|
| 4|null| 1|null|
| 4|null|null| 2|
--- ---- ---- ----
i.e. there are 2 rows with the key [ 4, null ]
Is this behavior normal?
CodePudding user response:
This is expected behaviour. NULL
does not equal NULL
in comparisons as explained in the NULL
Semantics section of the documentation:
Apache spark supports the standard comparison operators such as ‘>’, ‘>=’, ‘=’, ‘<’ and ‘<=’. The result of these operators is unknown or
NULL
when one of the operands or both the operands are unknown orNULL
.
(AFAIK, this is standard SQL behaviour and not specific to Spark)
Furthermore:
WHERE
,HAVING
operators filter rows based on the user specified condition. AJOIN
operator is used to combine rows from two tables based on a join condition. For all the three operators, a condition expression is a boolean expression and can returnTrue
,False
orUnknown (NULL)
. They are “satisfied” if the result of the condition isTrue
.
Comparing NULL
to NULL
produces NULL
, which is not True
, thus the (4, NULL)
key in first table is not the same as the (4, NULL)
key in the second table and so the join produces two distinct rows, one from joining the left row to a missing right row (| 4|null| 1|null|
) and one from joining the right row to a missing left row (| 4|null|null| 2|
).