Home > Mobile >  How to update dataframe column which contains arrays of dictionary
How to update dataframe column which contains arrays of dictionary

Time:09-01

Schema of dataframe

  root
    |-- parentColumn: array
    |    |-- element: struct
    |    |    |-- colA: string
    |    |    |-- colB: string
    |    |    |-- colTimestamp: string

value inside dataframe look like this

"parentColumn": [
        {
            "colA": "Test",
            "colB": "Test",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        },
        {
            "colA": "UNREAD",
            "colB": "USER",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        }
    ]

df.withColumn("parentColumn", ?)

Here I want to format all colTimestamp inside the array to UTC format, I saw many examples of updating values inside array but I'm not able to find a way to Update dict inside an array.

CodePudding user response:

You can use the transform function to apply a function to each element of the array. I don't see any other option but to recreate the struct like this:

df.withColumn("parentColumn", transform('parentColumn, x => struct(
    x.getField("colA") as "colA",
    x.getField("colB") as "colB",
    to_utc_timestamp(x.getField("colTimestamp") , "GMT 2") as "colTimestamp"
)))

CodePudding user response:

If you're on spark 3.1 , you can use the transform function with withField within a lambda function.

spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')

data_sdf. \
    withColumn('parent_col_new', 
               func.transform('parent_col', 
                              lambda x: x.withField('col_ts', 
                                                    func.to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss")
                                                    )
                              )
               ). \
    show(truncate=False)

#  --- --------------------------------------------------------------------------------------- ------------------------------------------------------------------------- 
# |id |parent_col                                                                             |parent_col_new                                                           |
#  --- --------------------------------------------------------------------------------------- ------------------------------------------------------------------------- 
# |1  |[{test, testB, 2020-08-17T03:28:44.986000}, {UNREAD, USER, 2020-08-17T03:28:44.986000}]|[{test, testB, 2020-08-17 03:28:44}, {UNREAD, USER, 2020-08-17 03:28:44}]|
#  --- --------------------------------------------------------------------------------------- ------------------------------------------------------------------------- 

# root
#  |-- id: integer (nullable = false)
#  |-- parent_col: array (nullable = false)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- col_a: string (nullable = true)
#  |    |    |-- col_b: string (nullable = true)
#  |    |    |-- col_ts: string (nullable = true)
#  |-- parent_col_new: array (nullable = false)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- col_a: string (nullable = true)
#  |    |    |-- col_b: string (nullable = true)
#  |    |    |-- col_ts: timestamp (nullable = true)

If withField and/or transform isn't available in your spark version, you can use expr and recreate the struct. It'll result in the same output.

data_sdf. \
    withColumn('parent_col_new',
               func.expr('''
                         transform(parent_col, 
                                   x -> struct(x.col_a as col_a, 
                                               x.col_b as col_b, 
                                               to_timestamp(x.col_ts, "yyyy-MM-dd'T'HH:mm:ss") as col_ts
                                               )
                                   )
                         ''')
               ). \
    show(truncate=False)
  • Related