I have the following schema:
>>> df.printSchema()
root
... SNIP ...
|-- foo: array (nullable = true)
| |-- element: struct (containsNull = true)
... SNIP ...
| | |-- value: double (nullable = true)
In this case, I only have one row in the dataframe and in the foo
array:
>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1
value
is null:
>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value').show()
-----
|value|
-----
| null|
-----
I want to edit value
and make a new dataframe. I can explode foo
and set value
:
>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
-----
|value|
-----
| 10|
-----
How do I collapse this dataframe to put fooUpdated
back in as an array with a struct element or is there a way to do this without exploding foo
?
In the end, I want to have the following:
>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value').show()
-----
|value|
-----
| 10|
-----
CodePudding user response:
You can use transform
function to update each struct in the foo
array.
Here's an example:
import pyspark.sql.functions as F
df.printSchema()
#root
# |-- foo: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- value: string (nullable = true)
df1 = df.withColumn("foo", F.expr("transform(foo, x -> struct(10 as value))"))
Now, you can show the value in df1
to verify it was updated:
df1.select(F.expr("inline(foo)")).show()
# -----
#|value|
# -----
#| 10|
# -----