Schema of dataframe
root
|-- parentColumn: array
| |-- element: struct
| | |-- colA: string
| | |-- colB: string
| | |-- colTimestamp: string
value inside dataframe look like this
"parentColumn": [
{
"colA": "Test",
"colB": "Test",
"colTimestamp": "2020-08-17T03:28:44.986000"
},
{
"colA": "UNREAD",
"colB": "USER",
"colTimestamp": "2020-08-17T03:28:44.986000"
}
]
df.withColumn("parentColumn", ?)
Here I want to format all colTimestamp
inside the array to UTC format, I saw many examples of updating values inside array but I'm not able to find a way to Update dict inside an array.
CodePudding user response:
If you're on spark 3.1 , you can use the transform
function with withField
within a lambda function.
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
data_sdf. \
withColumn('parent_col_new',
func.transform('parent_col',
lambda x: x.withField('col_ts',
func.to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss")
)
)
). \
show(truncate=False)
# --- --------------------------------------------------------------------------------------- -------------------------------------------------------------------------
# |id |parent_col |parent_col_new |
# --- --------------------------------------------------------------------------------------- -------------------------------------------------------------------------
# |1 |[{test, testB, 2020-08-17T03:28:44.986000}, {UNREAD, USER, 2020-08-17T03:28:44.986000}]|[{test, testB, 2020-08-17 03:28:44}, {UNREAD, USER, 2020-08-17 03:28:44}]|
# --- --------------------------------------------------------------------------------------- -------------------------------------------------------------------------
# root
# |-- id: integer (nullable = false)
# |-- parent_col: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- col_a: string (nullable = true)
# | | |-- col_b: string (nullable = true)
# | | |-- col_ts: string (nullable = true)
# |-- parent_col_new: array (nullable = false)
# | |-- element: struct (containsNull = false)
# | | |-- col_a: string (nullable = true)
# | | |-- col_b: string (nullable = true)
# | | |-- col_ts: timestamp (nullable = true)
If withField
and/or transform
isn't available in your spark version, you can use expr
and recreate the struct. It'll result in the same output.
data_sdf. \
withColumn('parent_col_new',
func.expr('''
transform(parent_col,
x -> struct(x.col_a as col_a,
x.col_b as col_b,
to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss") as col_ts
)
)
''')
). \
show(truncate=False)
CodePudding user response:
You can use the transform
function to apply a function to each element of the array. I don't see any other option but to recreate the struct like this:
df.withColumn("parentColumn", transform('parentColumn, x => struct(
x.getField("colA") as "colA",
x.getField("colB") as "colB",
to_utc_timestamp(x.getField("colTimestamp") , "GMT 2") as "colTimestamp"
)))