I have a dataframe that looks like this -
root
|-- query_results: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- result_type: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- attributes: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- key: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | | |
The json data looks like this
{
"root": {
"queryresults": [
{
"title": "",
"resulttype": "other",
"attributes": [
{
"key": "A",
"value": "123"
},
{
"key": "B",
"value": "456"
}
]
},
{
"title": "title",
"resulttype": "answer",
"attributes": []
}
]
}
}
I have multiple query_results in root and multiple attributes in query_results
I want to create a new column titles
of type array which will depend on result_type, if result_type == 'answer'
then pick title
else pick value from attributes whose key == 'A' i.e "123"
Ex: {"titles" : {"123", "title"}}
There can be nulls in both title and attributes. In that case I want to skip picking any value.
I am trying trying to create a concatenated string initially and plan to convert it to array later but this ain't working.
df.selectExpr("CAST(aggregate(root.query_results,'', (a, item) -> concat(a,',', CASE WHEN item.resultType =='X' AND item.attributes IS NOT NULL THEN filter(item.attributes, x -> x.key = 'a')[0]["value"] ELSE item.title.s END )) as string) as titles")
CodePudding user response:
with the following input dataframe based on your sample json and schema
------------------------------------------------------
|queryresults |
------------------------------------------------------
|[{other, , [{A, 123}, {B, 456}]}, {answer, title, []}]|
------------------------------------------------------
you can use transform
function to iterate over each element of the array and filter
function on attributes array field
data_arr_sdf. \
withColumn('titles',
func.expr('''
transform(queryresults, x -> if(x.result_type="answer",
x.title,
if(size(x.attributes)>0,
filter(x.attributes, y -> y.key="A")[0].value,
null
)
)
)
''')
). \
show(truncate=False)
# ------------------------------------------------------ ------------
# |queryresults |titles |
# ------------------------------------------------------ ------------
# |[{other, , [{A, 123}, {B, 456}]}, {answer, title, []}]|[123, title]|
# ------------------------------------------------------ ------------