Unpack string column with multiple dictionaries in PySpark-CodePudding

In Databricks, using PySpark, I am working on a DataFrame that has the following column, where each row is a list with multiple dictionaries:

I would like to unpack/flatten this column, so that there is a separate column for each of the dictionary values. However, the issue is that the data type of this column is a string. How can I unpack the column?

For reference, here is an example of a value:

[{"long_name":"Sofia","short_name":"Sofia","types":["locality","political"]},{"long_name":"Sofia City Province","short_name":"Sofia City Province","types":["administrative_area_level_1","political"]},{"long_name":"Bulgaria","short_name":"BG","types":["country","political"]}]

CodePudding user response：

Your string column can be converted to array of structs using from_json and providing the schema. Then you can do inline to explode it to columns.

df = spark.createDataFrame(
    [('[{"long_name":"Sofia","short_name":"Sofia","types":["locality","political"]},{"long_name":"Sofia City Province","short_name":"Sofia City Province","types":["administrative_area_level_1","political"]},{"long_name":"Bulgaria","short_name":"BG","types":["country","political"]}]',)],
    ['address_components'])

df = df.selectExpr(
    "inline(from_json(address_components, 'array<struct<long_name:string,short_name:string,types:array<string>>>'))"
)

df.show(truncate=0)
#  ------------------- ------------------- ---------------------------------------- 
# |long_name          |short_name         |types                                   |
#  ------------------- ------------------- ---------------------------------------- 
# |Sofia              |Sofia              |[locality, political]                   |
# |Sofia City Province|Sofia City Province|[administrative_area_level_1, political]|
# |Bulgaria           |BG                 |[country, political]                    |
#  ------------------- ------------------- ----------------------------------------