Schema of dataframe
root
|-- parentColumn: array
| |-- element: struct
| | |-- colA: string
| | |-- colB: string
| | |-- colTimestamp: string
value inside dataframe look like this
"parentColumn": [
{
"colA": "LatestValueA",
"colB": "LatestValueB",
"colTimestamp": "2020-08-18T04:00:44.986000"
},
{
"colA": "OldValueA",
"colB": "OldValueB",
"colTimestamp": "2020-08-17T03:28:44.986000"
}
]
I want to fetch the value of col A based on latest coltimestamp. In given scenario after comparison LatestValueA
should be returned as its colTimeStamp is latest.
I want this value to add it as a value of new dataframe column
df.withColumn("newColumn", ?)
CodePudding user response:
You can sort the array descending based on colTimestamp
and then take the colA
of the first element:
df.withColumn('sorted', F.expr("""array_sort(parentColumn, (l,r) -> case
when l.colTimestamp < r.colTimestamp then 1
when l.colTimestamp > r.colTimestamp then -1
else 0 end)""")) \
.withColumn('newColumn', F.col('sorted')[0].colA) \
.show()