My question is relevant to this, but it got a new problem.
Why the empty array has non-zero size ?
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.size("val"))
new_customers.show(10, truncate=False)
The results:
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
------ ---
|name |val|
------ ---
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
------ ---
------ --- -------
|name |val|new_val|
------ --- -------
|Karen |[a]|1 |
|Penny |[b]|1 |
|John |[] |1 | <- # why it is 1 ?
|Cosimo|[d]|1 |
------ --- -------
pyspark version is 2.3.2
thanks
CodePudding user response:
I would say it is just a display issue, since you have a list with 1 element None
. Spark 3.3.0 will correctly display:
|John |[null]|1 |
If you remove None
, you get:
|John |[] |0 |
CodePudding user response:
You are getting 1 because None
is a valid element.
["aa"] -> size 1
[None] -> size 1
[] -> size 0
If you change your initial data to
new_customers = spark.createDataFrame(data=[
["Karen", ["a"]],
["Penny", ["b"]],
["John", []], # Empty array instead of array with None element.
["Cosimo", ["d"]]
], schema=["name", "val"])
You will get your expected result.
------ --- -------
|name |val|new_val|
------ --- -------
|Karen |[a]|1 |
|Penny |[b]|1 |
|John |[] |0 |
|Cosimo|[d]|1 |
------ --- -------
If you cannot change your input dataframe, you can try removing the None before the size
function.
new_customers = new_customers.withColumn('new_val', F.size(F.expr('filter(val, x -> x is not null)')))
new_customers.show()
# ------ --- -------
# |name |val|new_val|
# ------ --- -------
# |Karen |[a]|1 |
# |Penny |[b]|1 |
# |John |[] |0 |
# |Cosimo|[d]|1 |
# ------ --- -------