Home > Net >  merge or concatenate map/array values to a new column based on condition
merge or concatenate map/array values to a new column based on condition

Time:10-06

I have a dataframe that looks like this -

root
|-- query_results: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |  |-- result_type: string (nullable = true)
|    |  |-- title: string (nullable = true)
|    |  |-- attributes: array (nullable = true)
|    |  |  |-- element: struct (containsNull = true)
|    |  |  |  |-- key: string (nullable = true)
|    |  |  |  |-- value: string (nullable = true)
|    |  |  |

The json data looks like this

{
  "root": {
    "queryresults": [
      {
        "title": "",
        "resulttype": "other",
        "attributes": [
          {
            "key": "A",
            "value": "123"
          },
          {
            "key": "B",
            "value": "456"
          }
        ]
      },
      {
        "title": "title",
        "resulttype": "answer",
        "attributes": []
      }
    ]
  }
}



I have multiple query_results in root and multiple attributes in query_results

I want to create a new column titles of type array which will depend on result_type, if result_type == 'answer' then pick title else pick value from attributes whose key == 'A' i.e "123"

Ex: {"titles" : {"123", "title"}}

There can be nulls in both title and attributes. In that case I want to skip picking any value.

I am trying trying to create a concatenated string initially and plan to convert it to array later but this ain't working.

df.selectExpr("CAST(aggregate(root.query_results,'', (a, item) -> concat(a,',', CASE WHEN item.resultType =='X' AND item.attributes IS NOT NULL THEN filter(item.attributes, x -> x.key = 'a')[0]["value"] ELSE item.title.s END )) as string) as titles")

CodePudding user response:

with the following input dataframe based on your sample json and schema

 ------------------------------------------------------ 
|queryresults                                          |
 ------------------------------------------------------ 
|[{other, , [{A, 123}, {B, 456}]}, {answer, title, []}]|
 ------------------------------------------------------ 

you can use transform function to iterate over each element of the array and filter function on attributes array field

data_arr_sdf. \
    withColumn('titles', 
               func.expr('''
                         transform(queryresults, x -> if(x.result_type="answer", 
                                                         x.title, 
                                                         if(size(x.attributes)>0, 
                                                            filter(x.attributes, y -> y.key="A")[0].value, 
                                                            null
                                                            )
                                                          )
                                    )
                         ''')
               ). \
    show(truncate=False)

#  ------------------------------------------------------ ------------ 
# |queryresults                                          |titles      |
#  ------------------------------------------------------ ------------ 
# |[{other, , [{A, 123}, {B, 456}]}, {answer, title, []}]|[123, title]|
#  ------------------------------------------------------ ------------ 
  • Related