How to update a value in an array of structs in a dataframe in pyspark?-CodePudding

I have the following schema:

>>> df.printSchema()
root
... SNIP ...
 |-- foo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
... SNIP ...
 |    |    |-- value: double (nullable = true)

In this case, I only have one row in the dataframe and in the foo array:

>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1

value is null:

>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value').show()
 ----- 
|value|
 ----- 
| null|
 -----

I want to edit value and make a new dataframe. I can explode foo and set value:

>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
 ----- 
|value|
 ----- 
|   10|
 -----

How do I collapse this dataframe to put fooUpdated back in as an array with a struct element or is there a way to do this without exploding foo?

In the end, I want to have the following:

>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value').show()
 ----- 
|value|
 ----- 
|   10|
 -----

CodePudding user response：

You can use transform function to update each struct in the foo array.

Here's an example:

import pyspark.sql.functions as F

df.printSchema()

#root
# |-- foo: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- value: string (nullable = true)

df1 = df.withColumn("foo", F.expr("transform(foo, x -> struct(10 as value))"))

Now, you can show the value in df1 to verify it was updated:

df1.select(F.expr("inline(foo)")).show()
# ----- 
#|value|
# ----- 
#|   10|
# -----