pyspark daatframe empty array still got size as 1?-CodePudding

My question is relevant to this, but it got a new problem.

Why the empty array has non-zero size ?

import pyspark.sql.functions as F
import pyspark.sql.types as T

new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.size("val"))

new_customers.show(10, truncate=False)

The results:

root
 |-- name: string (nullable = true)
 |-- val: array (nullable = true)
 |    |-- element: string (containsNull = true)

 ------ --- 
|name  |val|
 ------ --- 
|Karen |[a]|
|Penny |[b]|
|John  |[] |
|Cosimo|[d]|
 ------ --- 

 ------ --- ------- 
|name  |val|new_val|
 ------ --- ------- 
|Karen |[a]|1      |
|Penny |[b]|1      |
|John  |[] |1      | <- # why it is 1 ? 
|Cosimo|[d]|1      |
 ------ --- -------

pyspark version is 2.3.2

thanks

CodePudding user response：

I would say it is just a display issue, since you have a list with 1 element None. Spark 3.3.0 will correctly display:

|John  |[null]|1      |

If you remove None, you get:

|John  |[] |0      |

CodePudding user response：

You are getting 1 because None is a valid element.

["aa"] -> size 1
[None] -> size 1
[]     -> size 0

If you change your initial data to

new_customers = spark.createDataFrame(data=[
    ["Karen", ["a"]],
    ["Penny", ["b"]],
    ["John", []],  # Empty array instead of array with None element.
    ["Cosimo", ["d"]]
], schema=["name", "val"])

You will get your expected result.

 ------ --- ------- 
|name  |val|new_val|
 ------ --- ------- 
|Karen |[a]|1      |
|Penny |[b]|1      |
|John  |[] |0      |
|Cosimo|[d]|1      |
 ------ --- -------

If you cannot change your input dataframe, you can try removing the None before the size function.

new_customers = new_customers.withColumn('new_val', F.size(F.expr('filter(val, x -> x is not null)')))

new_customers.show()

#  ------ --- ------- 
# |name  |val|new_val|
#  ------ --- ------- 
# |Karen |[a]|1      |
# |Penny |[b]|1      |
# |John  |[] |0      |
# |Cosimo|[d]|1      |
#  ------ --- -------