Home > Net >  How to fetch value from an array of struct dataframe after comparing one of its attribute containing
How to fetch value from an array of struct dataframe after comparing one of its attribute containing

Time:09-23

Schema of dataframe

root
    |-- parentColumn: array
    |    |-- element: struct
    |    |    |-- colA: string
    |    |    |-- colB: string
    |    |    |-- colTimestamp: string

value inside dataframe look like this

"parentColumn": [
        {
            "colA": "LatestValueA",
            "colB": "LatestValueB",
            "colTimestamp": "2020-08-18T04:00:44.986000"
        },
        {
            "colA": "OldValueA",
            "colB": "OldValueB",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        }
    ]

I want to fetch the value of col A based on latest coltimestamp. In given scenario after comparison LatestValueA should be returned as its colTimeStamp is latest.

I want this value to add it as a value of new dataframe column

df.withColumn("newColumn", ?)

CodePudding user response:

You can sort the array descending based on colTimestamp and then take the colA of the first element:

df.withColumn('sorted', F.expr("""array_sort(parentColumn, (l,r) -> case 
          when l.colTimestamp < r.colTimestamp then 1 
          when l.colTimestamp > r.colTimestamp then -1 
          else 0 end)""")) \
  .withColumn('newColumn', F.col('sorted')[0].colA) \
  .show()
  • Related